← Evaluations/EVAL-20260402-153453
reasoning
Jan 28, 2026REASON-003

Estimate how many piano tuners there are in Chicago. Show your reasoning step by step, including: 1. All assumptions you make 2. How you derived each number 3. Sensitivity analysis (what if your assumptions are wrong?) 4. Final estimate with confidence interval

Winner
GPT-5.4
openrouter
9.07
WINNER SCORE
matrix avg: 8.10
results.json report.mdFull dataset (CSV) →
10×10 Judgment Matrix · 80 judgments
OPEN DATA
Judge ↓ / Respondent →Gemini 3.1 ProDeepSeek V4Claude Opus 4.6GPT-5.4Grok 4.20Claude Sonnet 4.6MiMo-V2-FlashGPT-OSS-120BGemini 2.5 FlashMiniMax M2.5
Gemini 3.1 Pro8.810.09.88.88.310.09.28.8·
DeepSeek V46.09.18.78.78.79.28.78.7·
Claude Opus 4.61.67.88.98.98.98.69.28.9·
GPT-5.40.77.88.28.68.67.98.28.4·
Grok 4.204.08.48.78.78.78.79.28.7·
Claude Sonnet 4.61.48.29.29.28.68.99.28.6·
MiMo-V2-Flash2.29.09.49.49.49.49.49.0·
GPT-OSS-120B3.38.28.48.48.48.48.28.3·
Gemini 2.5 Flash·9.28.89.29.09.18.810.0·
MiniMax M2.52.18.68.49.48.38.48.48.68.4