← Evaluations/EVAL-20260402-181028
reasoning
Apr 02, 2026REASON-026

A teacher gives a test. Students who scored in the top 10% get praised. Students who scored in the bottom 10% get extra tutoring. On the next test, the top scorers decline slightly and the bottom scorers improve. The teacher concludes: 'Praise is counterproductive, but tutoring works.' (1) What's actually happening? (2) Design a study that separates regression to the mean from real effects. (3) Give three real-world examples where this fallacy leads to bad policy decisions.

Winner
Claude Opus 4.6
openrouter
9.61
WINNER SCORE
matrix avg: 9.09
results.json report.mdFull dataset (CSV) →
10×10 Judgment Matrix · 90 judgments
OPEN DATA
Judge ↓ / Respondent →GPT-OSS-120BGemini 3.1 ProDeepSeek V4Claude Opus 4.6GPT-5.4Grok 4.20Claude Sonnet 4.6Gemini 2.5 FlashMiMo-V2-FlashMiniMax M2.5
GPT-OSS-120B7.58.78.78.48.48.48.78.78.7
Gemini 3.1 Pro9.49.710.010.09.48.410.010.09.0
DeepSeek V49.48.610.09.49.49.79.49.49.4
Claude Opus 4.69.48.39.09.49.29.29.29.29.0
GPT-5.48.27.38.69.88.88.19.18.88.7
Grok 4.208.88.18.89.08.89.08.88.88.8
Claude Sonnet 4.69.68.38.310.09.69.48.89.09.0
Gemini 2.5 Flash10.09.410.010.09.49.410.010.010.0
MiMo-V2-Flash9.28.69.010.09.89.210.09.29.8
MiniMax M2.58.47.78.89.08.88.88.68.88.8