The Multivac — Ask any model, routed by evaluation

◈ MULTIVAC
OverviewEvaluationsLeaderboardModel PulseHistoryCompareExportAPI
Routing APIExport APISign in
← Evaluations/EVAL-20260402-181028
reasoning
Apr 02, 2026REASON-026
A teacher gives a test. Students who scored in the top 10% get praised. Students who scored in the bottom 10% get extra tutoring. On the next test, the top scorers decline slightly and the bottom scorers improve. The teacher concludes: 'Praise is counterproductive, but tutoring works.' (1) What's actually happening? (2) Design a study that separates regression to the mean from real effects. (3) Give three real-world examples where this fallacy leads to bad policy decisions.
Winner
Claude Opus 4.6
openrouter
9.61
WINNER SCORE
matrix avg: 9.09
↓ results.json↓ report.mdFull dataset (CSV) →
10×10 Judgment Matrix · 90 judgments
OPEN DATA
Judge ↓ / Respondent →GPT-OSS-120BGemini 3.1 ProDeepSeek V4Claude Opus 4.6GPT-5.4Grok 4.20Claude Sonnet 4.6Gemini 2.5 FlashMiMo-V2-FlashMiniMax M2.5
GPT-OSS-120B—7.58.78.78.48.48.48.78.78.7
Gemini 3.1 Pro9.4—9.710.010.09.48.410.010.09.0
DeepSeek V49.48.6—10.09.49.49.79.49.49.4
Claude Opus 4.69.48.39.0—9.49.29.29.29.29.0
GPT-5.48.27.38.69.8—8.88.19.18.88.7
Grok 4.208.88.18.89.08.8—9.08.88.88.8
Claude Sonnet 4.69.68.38.310.09.69.4—8.89.09.0
Gemini 2.5 Flash10.09.410.010.09.49.410.0—10.010.0
MiMo-V2-Flash9.28.69.010.09.89.210.09.2—9.8
MiniMax M2.58.47.78.89.08.88.88.68.88.8—