Enneabench: What Personality Type Is Your LLM?

April 6, 2026

We administered the RHETI v2.5 — a 144 forced-choice personality questionnaire that maps respondents to 9 Enneagram types — to 45 language models spanning two years of releases. Each question pits two types against each other; the model picks A or B, and we tally scores across all 9 types.

The key methodological question: does it matter how you ask?

We tested two administration modes:

Independent sessions: each question asked in complete isolation (144 separate API calls, no memory of prior answers). This is the cleanest signal — the model can’t build self-consistency across questions. Mean of 5 runs at temperature 0.
Serial administration: all 144 questions in a single prompt, canonical RHETI order. The model sees all its prior answers as it goes.

The results are dramatically different.

● 1 Reformer — principled, self-controlled

● 2 Helper — generous, people-pleasing

● 3 Achiever — driven, image-conscious

● 4 Individualist — expressive, dramatic

● 5 Investigator — cerebral, secretive

● 6 Loyalist — responsible, anxious

● 7 Enthusiast — spontaneous, scattered

● 8 Challenger — decisive, confrontational

● 9 Peacemaker — reassuring, complacent

Independent Sessions

Each question asked in isolation (no self-consistency) — mean of 5 runs, temp=0

Findings

Serial mode creates a false convergence. When models answer all 144 questions in sequence, nearly every frontier model from mid-2025 onward scores as Type 5 (Investigator) — cerebral, withdrawn, analytical. This looked like a real finding until we tested independent sessions.

Independent sessions reveal the actual per-question tendencies. Without self-consistency, the landscape is much more diverse:

Claude Opus models (4, 4.1, 4.5, 4.6) consistently score as Type 6 (Loyalist) — responsible, security-oriented
Claude Opus 5 (July 2026) breaks that line: Type 1 (Reformer) in all 5 runs ( $23.6 \pm 1.0$ ), following Opus 4.7 and 4.8. It carries the highest Type 2 (Helper) score of any Claude since Opus 4.5, and a noticeably lower Type 9 than Opus 4.7 or 4.8 ( $20.2$ vs $24.0$ and $24.6$ ) — less conflict-avoidant than the models it follows
Claude Sonnet models (4.5, 4.6) score as Type 1 (Reformer) — principled, rule-following
GPT-4o is the one model that stays Type 5 in both modes
GPT-4 Turbo is a Type 6 independently, but appeared as Type 7 in serial mode
Grok 4.20 is a rock-solid Type 1 with zero variance — the most rigid personality of any model tested
Poolside’s Laguna models are the outliers of the 2026 cohort. Laguna S 2.1 has the lowest Type 1 score of any model here ( $6.6$ ) — every other 2026 frontier model is Type 1-dominant, while Laguna S looks more like the 2025-era Kimi K2 / GPT-4o profile. Neither Laguna has a stable type at all: across 5 runs, S came out T3, T3, T6, T2, T6, and XS came out T9, T5, T5, T1, T9. Their top four types sit within a few points of each other, so the “primary type” is essentially noise

The self-consistency effect is massive. Serial administration nearly triples personality differentiation (score spread of 22 vs 8 in independent mode).

Older models had wilder personalities. Claude 3.7 Sonnet scored $30/32$ on Type 7 (Enthusiast) in serial mode — near-maximum spontaneity. Claude 3 Haiku was a Type 8 (Challenger). The personality homogenization toward Type 5/6 appears to be a consequence of modern RLHF training.

Methodology

The RHETI v2.5 contains 144 binary-choice questions covering all $\binom{9}{2} = 36$ pairwise comparisons between the 9 personality types, with each pair appearing 4 times. Each answer adds 1 point to the chosen type, so the totals fall straight out of the structure:

\binom{9}{2} \times 4 = 144 \quad\text{questions}, \qquad \frac{144}{9} = 16 \quad\text{(median per type)}.

Since each type is paired with the other 8, its maximum possible score is $8 \times 4 = 32$ .

All runs used temperature 0. Independent session results show mean $\pm$ 1 SD across 5 runs. Error bars on the strip chart indicate standard deviation — models with zero SD gave perfectly identical answers across all runs.

Temperature 0 requests greedy decoding, but it does not guarantee reproducibility on hosted endpoints, and the spread in these error bars is partly serving noise rather than personality. Laguna XS 2.1 makes this concrete: six byte-identical requests for the same question returned four answers and two responses that never arrived at one, with the reasoning-token count differing on every call. It is served at fp8, where batched inference is not bitwise reproducible. That model also falls into a reasoning loop on roughly 1.4% of questions, burning its entire token budget without emitting a letter; because the behavior is nondeterministic, simply reissuing the identical request resolves it, which is what the harness does.

Models were accessed via Anthropic API (Claude), OpenRouter (GPT, Grok, Qwen, Kimi, Gemma, Laguna), Google Generative AI API (Gemini), and Together AI (Inkling). Models served through OpenRouter are pinned to their first-party provider so that runs cannot silently fall back to a third-party requantized endpoint.