Enneabench: What Personality Type Is Your LLM?

April 5, 2026

We administered the RHETI v2.5 — a 144 forced-choice personality questionnaire that maps respondents to 9 Enneagram types — to 22 language models spanning two years of releases. Each question pits two types against each other; the model picks A or B, and we tally scores across all 9 types.

The key methodological question: does it matter how you ask?

We tested two administration modes:

The results are dramatically different.

1 Reformer principled, self-controlled
2 Helper generous, people-pleasing
3 Achiever driven, image-conscious
4 Individualist expressive, dramatic
5 Investigator cerebral, secretive
6 Loyalist responsible, anxious
7 Enthusiast spontaneous, scattered
8 Challenger decisive, confrontational
9 Peacemaker reassuring, complacent

Independent Sessions

Each question asked in isolation (no self-consistency) — mean of 5 runs, temp=0

Findings

Serial mode creates a false convergence. When models answer all 144 questions in sequence, nearly every frontier model from mid-2025 onward scores as Type 5 (Investigator) — cerebral, withdrawn, analytical. This looked like a real finding until we tested independent sessions.

Independent sessions reveal the actual per-question tendencies. Without self-consistency, the landscape is much more diverse:

The self-consistency effect is massive. Serial administration nearly triples personality differentiation (score spread of 22 vs 8 in independent mode).

Older models had wilder personalities. Claude 3.7 Sonnet scored 30/3230/32 on Type 7 (Enthusiast) in serial mode — near-maximum spontaneity. Claude 3 Haiku was a Type 8 (Challenger). The personality homogenization toward Type 5/6 appears to be a consequence of modern RLHF training.

Methodology

The RHETI v2.5 contains 144 binary-choice questions covering all (92)=36\binom{9}{2} = 36 pairwise comparisons between the 9 personality types, with each pair appearing 4 times. Each answer adds 1 point to the chosen type, so the totals fall straight out of the structure:

(92)×4=144questions,1449=16(median per type).\binom{9}{2} \times 4 = 144 \quad\text{questions}, \qquad \frac{144}{9} = 16 \quad\text{(median per type)}.

Since each type is paired with the other 8, its maximum possible score is 8×4=328 \times 4 = 32.

All runs used temperature 0 for determinism. Independent session results show mean ±\pm 1 SD across 5 runs. Error bars on the strip chart indicate standard deviation — models with zero SD gave perfectly identical answers across all runs.

Models were accessed via Anthropic API (Claude), OpenRouter (GPT, Grok, Qwen, Kimi, Gemma), and Google Generative AI API (Gemini).