# The Multivac — Evaluation Report

**Evaluation ID:** EVAL-20260317-030901
**Date:** Mar 17, 2026
**Category:** reasoning
**Question ID:** EVAL-20260317-030901

---

## Question

During WWII, analysts studied bullet holes on returning bombers to decide where to add armor. They found most damage on the wings and fuselage, almost none on the engines. Their recommendation: armor the wings. Abraham Wald disagreed. (1) What was Wald's reasoning? (2) Give 5 modern examples of survivorship bias in business/tech. (3) 'We studied 100 successful startups and found they all did X.' Why is this analysis worthless without a control group?

---

## Winner

**Qwen 3.5 397B-A17B** (openrouter)
- Winner Score: 9.95
- Matrix Average: 9.02
- Total Judgments: 39

---

## Rankings

| Rank | Model | Provider | Avg Score | Judgments |
|------|-------|----------|-----------|----------|
| 1 | Qwen 3.5 397B-A17B | openrouter | 9.95 | 4 |
| 2 | Qwen 3 32B | openrouter | 9.90 | 4 |
| 3 | Qwen 3 Coder Next | openrouter | 9.85 | 4 |
| 4 | Qwen 3.5 122B-A10B | openrouter | 9.81 | 5 |
| 5 | Qwen 3 8B | openrouter | 9.80 | 5 |
| 6 | Qwen 3.5 35B-A3B | openrouter | 7.76 | 6 |
| 7 | Qwen 3.5 27B | openrouter | 7.62 | 5 |
| 8 | Qwen 3.5 9B | openrouter | 7.48 | 6 |

---

## 10×10 Judgment Matrix

Rows = Judge, Columns = Respondent. Self-judgments excluded (—).

| Judge ↓ / Resp → | Qwen 3 8B | Qwen 3 32B | Qwen 3 | Qwen 3.5 | Qwen 3.5 27B | Qwen 3.5 | Qwen 3.5 | Qwen 3.5 9B |
|---|---|---|---|---|---|---|---|---|
| Qwen 3 8B | — | 9.8 | 9.6 | 8.7 | 8.2 | 9.3 | 9.8 | 9.0 |
| Qwen 3 32B | 9.6 | — | 10.0 | 8.6 | 8.4 | 10.0 | 10.0 | 8.4 |
| Qwen 3 | 10.0 | 10.0 | — | 10.0 | 9.6 | 10.0 | 10.0 | 9.6 |
| Qwen 3.5 | 9.8 | · | · | — | · | · | · | · |
| Qwen 3.5 27B | 9.8 | · | · | 5.5 | — | 9.8 | · | 6.3 |
| Qwen 3.5 | · | 9.8 | 9.8 | 6.4 | 5.7 | — | 10.0 | 5.3 |
| Qwen 3.5 | 9.8 | 10.0 | 10.0 | 7.5 | 6.3 | 10.0 | — | 6.4 |
| Qwen 3.5 9B | · | · | · | · | · | · | · | — |

---

## Methodology

- **10×10 Blind Peer Matrix:** All models answer the same question, then all models judge all responses.
- **5 Criteria:** Correctness, completeness, clarity, depth, usefulness (each scored 1–10).
- **Self-judgments excluded:** Models do not judge their own responses.
- **Weighted Score:** Composite of all 5 criteria.

---

## Citation

The Multivac (2026). Blind Peer Evaluation: EVAL-20260317-030901. app.themultivac.com

## License

Open data. Free to use, share, and build upon. Please cite The Multivac when using this data.

Download raw JSON: https://app.themultivac.com/api/evaluations/EVAL-20260317-030901/results
Full dataset: https://app.themultivac.com/dashboard/export
