reasoning
Apr 02, 2026REASON-026A teacher gives a test. Students who scored in the top 10% get praised. Students who scored in the bottom 10% get extra tutoring. On the next test, the top scorers decline slightly and the bottom scorers improve. The teacher concludes: 'Praise is counterproductive, but tutoring works.' (1) What's actually happening? (2) Design a study that separates regression to the mean from real effects. (3) Give three real-world examples where this fallacy leads to bad policy decisions.
Winner
Claude Opus 4.6
openrouter
9.61
WINNER SCORE
matrix avg: 9.09
10×10 Judgment Matrix · 90 judgments
OPEN DATA
| Judge ↓ / Respondent → | GPT-OSS-120B | Gemini 3.1 Pro | DeepSeek V4 | Claude Opus 4.6 | GPT-5.4 | Grok 4.20 | Claude Sonnet 4.6 | Gemini 2.5 Flash | MiMo-V2-Flash | MiniMax M2.5 |
|---|---|---|---|---|---|---|---|---|---|---|
| GPT-OSS-120B | — | 7.5 | 8.7 | 8.7 | 8.4 | 8.4 | 8.4 | 8.7 | 8.7 | 8.7 |
| Gemini 3.1 Pro | 9.4 | — | 9.7 | 10.0 | 10.0 | 9.4 | 8.4 | 10.0 | 10.0 | 9.0 |
| DeepSeek V4 | 9.4 | 8.6 | — | 10.0 | 9.4 | 9.4 | 9.7 | 9.4 | 9.4 | 9.4 |
| Claude Opus 4.6 | 9.4 | 8.3 | 9.0 | — | 9.4 | 9.2 | 9.2 | 9.2 | 9.2 | 9.0 |
| GPT-5.4 | 8.2 | 7.3 | 8.6 | 9.8 | — | 8.8 | 8.1 | 9.1 | 8.8 | 8.7 |
| Grok 4.20 | 8.8 | 8.1 | 8.8 | 9.0 | 8.8 | — | 9.0 | 8.8 | 8.8 | 8.8 |
| Claude Sonnet 4.6 | 9.6 | 8.3 | 8.3 | 10.0 | 9.6 | 9.4 | — | 8.8 | 9.0 | 9.0 |
| Gemini 2.5 Flash | 10.0 | 9.4 | 10.0 | 10.0 | 9.4 | 9.4 | 10.0 | — | 10.0 | 10.0 |
| MiMo-V2-Flash | 9.2 | 8.6 | 9.0 | 10.0 | 9.8 | 9.2 | 10.0 | 9.2 | — | 9.8 |
| MiniMax M2.5 | 8.4 | 7.7 | 8.8 | 9.0 | 8.8 | 8.8 | 8.6 | 8.8 | 8.8 | — |