# The Multivac — Evaluation Report

**Evaluation ID:** EVAL-20260402-203401
**Date:** Apr 02, 2026
**Category:** analysis
**Question ID:** ANALYSIS-022

---

## Question

A new respiratory virus has R0=3.5, IFR=0.5%, incubation 5 days, infectious period 10 days. (1) Estimate peak infections without intervention in a city of 1M. (2) What R0 do you need to achieve through interventions to avoid overwhelming hospitals (assume 3,000 ICU beds)? (3) Vaccines won't be ready for 12 months. Design an optimal mitigation strategy for those 12 months. (4) How does your model change if 30% of the population ignores interventions?

---

## Winner

**Gemini 3 Flash Preview** (Google)
- Winner Score: 7.72
- Matrix Average: 6.04
- Total Judgments: 78

---

## Rankings

| Rank | Model | Provider | Avg Score | Judgments |
|------|-------|----------|-----------|----------|
| 1 | Gemini 3 Flash Preview | Google | 7.72 | 9 |
| 2 | Claude Sonnet 4.6 | openrouter | 7.64 | 8 |
| 3 | GPT-5.4 | openrouter | 7.53 | 9 |
| 4 | MiMo-V2-Flash | Xiaomi | 7.21 | 9 |
| 5 | GPT-OSS-120B | OpenAI | 7.03 | 9 |
| 6 | DeepSeek V4 | openrouter | 6.71 | 9 |
| 7 | Claude Opus 4.6 | openrouter | 6.58 | 9 |
| 8 | Gemini 3.1 Pro | openrouter | 2.30 | 9 |
| 9 | Grok 4.20 | openrouter | 1.64 | 7 |

---

## 10×10 Judgment Matrix

Rows = Judge, Columns = Respondent. Self-judgments excluded (—).

| Judge ↓ / Resp → | MiMo-V2-Flash | GPT-OSS-120B | Gemini 3.1 Pro | Claude Opus | GPT-5.4 | DeepSeek V4 | Claude Sonnet | Grok 4.20 | Gemini 3 | MiniMax M2.5 |
|---|---|---|---|---|---|---|---|---|---|---|
| MiMo-V2-Flash | — | 8.0 | 3.0 | 7.0 | 8.2 | 8.0 | 7.8 | 0.4 | 8.0 | · |
| GPT-OSS-120B | 7.0 | — | 2.0 | 4.7 | 6.7 | 6.6 | · | · | 7.5 | · |
| Gemini 3.1 Pro | 7.3 | 5.1 | — | 6.3 | 7.7 | 6.8 | 7.2 | 1.6 | 8.9 | · |
| Claude Opus | 5.6 | 6.3 | 1.6 | — | 6.8 | 5.0 | 7.0 | 1.2 | 7.2 | · |
| GPT-5.4 | 5.6 | 5.2 | 1.1 | 4.1 | — | 6.0 | 5.8 | 2.3 | 6.5 | · |
| DeepSeek V4 | 8.8 | 8.4 | 5.0 | 8.3 | 8.4 | — | 8.4 | · | 8.6 | · |
| Claude Sonnet | 7.0 | 7.5 | 1.6 | 7.7 | 7.0 | 6.8 | — | 2.0 | 7.8 | · |
| Grok 4.20 | 6.6 | 7.2 | 2.9 | 6.4 | 6.8 | 5.0 | 7.6 | — | 6.6 | · |
| Gemini 3 | 9.0 | 8.8 | 2.9 | 8.3 | 8.8 | 8.8 | 9.2 | 2.0 | — | · |
| MiniMax M2.5 | 8.0 | 6.7 | 0.6 | 6.6 | 7.2 | 7.7 | 8.1 | 2.0 | 8.4 | — |

---

## Methodology

- **10×10 Blind Peer Matrix:** All models answer the same question, then all models judge all responses.
- **5 Criteria:** Correctness, completeness, clarity, depth, usefulness (each scored 1–10).
- **Self-judgments excluded:** Models do not judge their own responses.
- **Weighted Score:** Composite of all 5 criteria.

---

## Citation

The Multivac (2026). Blind Peer Evaluation: ANALYSIS-022. app.themultivac.com

## License

Open data. Free to use, share, and build upon. Please cite The Multivac when using this data.

Download raw JSON: https://app.themultivac.com/api/evaluations/EVAL-20260402-203401/results
Full dataset: https://app.themultivac.com/dashboard/export
