The Multivac — Ask any model, routed by evaluation

◈ MULTIVAC
OverviewEvaluationsLeaderboardModel PulseHistoryCompareExportAPI
Routing APIExport APISign in
← Evaluations/EVAL-20260402-174539
reasoning
Apr 02, 2026REASON-022
For each claim, determine if it's causal or correlational, and design an experiment to test causality: (1) 'Learning a musical instrument improves math scores.' (2) 'Countries with more Nobel laureates consume more chocolate per capita.' (3) 'Code reviews reduce bugs.' (4) 'Remote workers are more productive.' For each, identify at least two confounders.
Winner
GPT-5.4
openrouter
9.28
WINNER SCORE
matrix avg: 7.70
↓ results.json↓ report.mdFull dataset (CSV) →
10×10 Judgment Matrix · 79 judgments
OPEN DATA
Judge ↓ / Respondent →MiMo-V2-FlashGemini 3.1 ProDeepSeek V4Claude Opus 4.6GPT-5.4Grok 4.20Claude Sonnet 4.6GPT-OSS-120BGemini 2.5 FlashMiniMax M2.5
MiMo-V2-Flash—7.59.09.29.09.28.08.68.6·
Gemini 3.1 Pro9.8—9.8·9.88.57.77.27.5·
DeepSeek V48.88.4—9.4·8.89.09.08.8·
Claude Opus 4.69.27.18.0—9.49.89.08.49.0·
GPT-5.48.45.19.09.2—8.85.26.17.7·
Grok 4.208.78.38.69.29.0—8.88.78.6·
Claude Sonnet 4.69.07.38.49.69.69.6—8.48.8·
GPT-OSS-120B8.25.77.88.29.28.77.8—8.70.2
Gemini 2.5 Flash9.48.49.410.09.49.4·9.2—·
MiniMax M2.58.45.88.89.08.89.08.37.28.4—