reasoning
Mar 15, 2026EVAL-20260315-063934During WWII, analysts studied bullet holes on returning bombers to decide where to add armor. They found most damage on the wings and fuselage, almost none on the engines. Their recommendation: armor the wings. Abraham Wald disagreed. (1) What was Wald's reasoning? (2) Give 5 modern examples of survivorship bias in business/tech. (3) 'We studied 100 successful startups and found they all did X.' Why is this analysis worthless without a control group?
Winner
Kimi K2.5
openrouter
9.63
WINNER SCORE
matrix avg: 9.11
10×10 Judgment Matrix · 85 judgments
OPEN DATA
| Judge ↓ / Respondent → | Qwen 3 32B | Kimi K2.5 | Devstral Small | Gemma 3 27B | Llama 4 Scout | Phi-4 14B | Granite 4.0 Micro | Qwen 3 8B | Mistral Nemo 12B | Llama 3.1 8B |
|---|---|---|---|---|---|---|---|---|---|---|
| Qwen 3 32B | — | 10.0 | 9.3 | 8.8 | 9.3 | 9.0 | 7.8 | 10.0 | 8.7 | 8.3 |
| Kimi K2.5 | · | — | · | · | · | · | 3.6 | 9.4 | 7.4 | 4.4 |
| Devstral Small | 10.0 | 9.8 | — | 10.0 | 9.6 | 9.6 | 9.1 | 9.8 | 9.8 | 9.3 |
| Gemma 3 27B | 9.8 | 10.0 | 9.4 | — | 9.8 | 9.8 | 9.4 | 9.8 | 9.8 | 9.4 |
| Llama 4 Scout | 10.0 | 10.0 | 9.4 | 10.0 | — | 10.0 | 8.8 | 9.4 | 10.0 | 8.3 |
| Phi-4 14B | 10.0 | 9.6 | 9.7 | 10.0 | 10.0 | — | 8.9 | 10.0 | 9.7 | 8.4 |
| Granite 4.0 Micro | 8.8 | 8.7 | 8.3 | 8.8 | 8.3 | 8.6 | — | 8.8 | 8.8 | 8.3 |
| Qwen 3 8B | 10.0 | 10.0 | 9.8 | 10.0 | 8.8 | 8.4 | 8.4 | — | 9.3 | 8.5 |
| Mistral Nemo 12B | 8.4 | 9.3 | 9.1 | 9.1 | 8.3 | 8.6 | 8.5 | 9.1 | — | 7.5 |
| Llama 3.1 8B | 9.4 | 9.4 | 9.1 | 9.4 | 9.3 | 9.1 | 9.1 | 9.3 | 9.3 | — |