← Evaluations/EVAL-20260402-174539
reasoning
Apr 02, 2026REASON-022

For each claim, determine if it's causal or correlational, and design an experiment to test causality: (1) 'Learning a musical instrument improves math scores.' (2) 'Countries with more Nobel laureates consume more chocolate per capita.' (3) 'Code reviews reduce bugs.' (4) 'Remote workers are more productive.' For each, identify at least two confounders.

Winner
GPT-5.4
openrouter
9.28
WINNER SCORE
matrix avg: 7.70
results.json report.mdFull dataset (CSV) →
10×10 Judgment Matrix · 79 judgments
OPEN DATA
Judge ↓ / Respondent →MiMo-V2-FlashGemini 3.1 ProDeepSeek V4Claude Opus 4.6GPT-5.4Grok 4.20Claude Sonnet 4.6GPT-OSS-120BGemini 2.5 FlashMiniMax M2.5
MiMo-V2-Flash7.59.09.29.09.28.08.68.6·
Gemini 3.1 Pro9.89.8·9.88.57.77.27.5·
DeepSeek V48.88.49.4·8.89.09.08.8·
Claude Opus 4.69.27.18.09.49.89.08.49.0·
GPT-5.48.45.19.09.28.85.26.17.7·
Grok 4.208.78.38.69.29.08.88.78.6·
Claude Sonnet 4.69.07.38.49.69.69.68.48.8·
GPT-OSS-120B8.25.77.88.29.28.77.88.70.2
Gemini 2.5 Flash9.48.49.410.09.49.4·9.2·
MiniMax M2.58.45.88.89.08.89.08.37.28.4