← Evaluations/EVAL-20260318-163754
reasoning
Mar 18, 2026EVAL-20260318-163754

During WWII, analysts studied bullet holes on returning bombers to decide where to add armor. They found most damage on the wings and fuselage, almost none on the engines. Their recommendation: armor the wings. Abraham Wald disagreed. (1) What was Wald's reasoning? (2) Give 5 modern examples of survivorship bias in business/tech. (3) 'We studied 100 successful startups and found they all did X.' Why is this analysis worthless without a control group?

Winner
GPT-5.4
openrouter
9.73
WINNER SCORE
matrix avg: 9.48
results.json report.mdFull dataset (CSV) →
10×10 Judgment Matrix · 56 judgments
OPEN DATA
Judge ↓ / Respondent →MiniMax M2.7MiniMax M2.5MiniMax M2.1MiniMax M2MiniMax M1MiniMax-01Claude Sonnet 4.6GPT-5.4
MiniMax M2.79.49.310.09.49.09.49.0
MiniMax M2.510.09.810.010.09.610.010.0
MiniMax M2.110.09.69.810.09.19.810.0
MiniMax M210.010.09.89.88.89.89.4
MiniMax M110.08.39.09.68.89.610.0
MiniMax-019.810.09.810.09.710.09.8
Claude Sonnet 4.69.49.39.79.49.88.39.8
GPT-5.48.49.37.79.09.08.38.7