CRBC News

AI Outperforms Humans on Emotional Intelligence Tests — and Can Write the Questions, Study Finds

Key points: A joint study by the University of Geneva and the University of Bern tested six major AI models on five established emotional‑intelligence measures and found the AIs averaged ~81% correct versus ~56% for human benchmarks. The models agreed closely with one another and ChatGPT‑4 was able to generate original test items that humans found equally demanding and valid. Researchers stress that models do not feel emotions but can recognise states and select appropriate responses, suggesting supervised applications in tutoring, therapy and education.

AI Outperforms Humans on Emotional Intelligence Tests — and Can Write the Questions, Study Finds

AI excels at tests of emotional reasoning — but it doesn’t feel emotions

Although artificial intelligence is often praised for coding and numerical skills, a new joint study from the University of Geneva and the University of Bern asked how well AI handles something distinctly human: emotions. The researchers tested six popular large language models on five established measures of emotional understanding and regulation and found that the models outperformed human benchmarks by a wide margin.

What the researchers did

Between December 2024 and January 2025 the teams evaluated six systems — ChatGPT‑4, ChatGPT‑o1, Gemini 1.5 Flash, Copilot 365, Claude 3.5 Haiku and DeepSeek V3. Each model completed the assessments ten times so researchers could compute average scores and compare machine performance to previously collected human benchmark data.

Results

Across five tests of “ability emotional intelligence” (forced-choice items with a single best answer), the six AI models averaged about an 81% correct rate, compared with roughly 56% for human participants on the same measures. The systems also showed strong agreement with one another, producing similar emotional judgments despite not being explicitly trained to pass these specific tests.

Tests used

The researchers used well-established instruments psychologists use to measure emotion reasoning and management, including the Situational Test of Emotion Understanding (STEU), the Geneva Emotion Knowledge Test – Blends (GEMOK‑Blends), the Situational Test of Emotion Management (STEM), and subtests from the Geneva Emotional Competence Test (GECo). Each item presented a realistic social situation and asked respondents to choose the response that best demonstrated emotional intelligence.

Example: If an employee takes credit for a colleague’s idea and is praised by a supervisor, the emotionally intelligent choice is not retaliation but a calm, discreet conversation with the supervisor — an approach that shows emotional control and constructive problem‑solving.

Can AI write the tests?

To probe deeper, the team asked ChatGPT‑4 to generate new situational items with multiple-choice responses and indicate the best answer. They then had 467 human participants complete both the original, human-authored tests and the AI-generated versions.

The results were striking: the AI‑written items proved as demanding and valid as the originals. Participants scored similarly on both sets; judges rated the items as equally clear and realistic; statistical analyses found equivalent difficulty. According to the paper, about 88% of the items produced by ChatGPT‑4 were entirely original rather than direct rewrites, and the AI items correlated with vocabulary and other emotional‑intelligence measures in the same way as the human items.

Interpretation and caveats

Lead author Katja Schlegel (University of Bern) and senior scientist Marcello Mortillaro (Swiss Centre for Affective Sciences) note that these findings do not imply that models feel emotions. Instead, modern LLMs appear to have learned patterns of emotional reasoning from the large body of human text they were trained on and can reliably identify emotional states and select appropriate responses.

The researchers did flag modest differences: some human‑written items were judged slightly clearer, and the AI scenarios were somewhat less diverse. These differences, however, did not change the main conclusion that these models performed better than human benchmarks on the tested measures and can generate valid test items quickly.

Implications

While LLMs cannot have subjective experiences, their capacity to reason about emotions has practical implications. Properly designed and supervised tools could support tutoring, mental‑health triage, coaching, and education by recognising when someone is frustrated or distressed and suggesting appropriate responses — provided ethical safeguards, human oversight, and domain expertise are in place.

The study’s full results are available in Communications Psychology.

AI Outperforms Humans on Emotional Intelligence Tests — and Can Write the Questions, Study Finds - CRBC News