# The Multivac — Evaluation Report

**Evaluation ID:** EVAL-20260402-195402
**Date:** Apr 02, 2026
**Category:** analysis
**Question ID:** ANALYSIS-017

---

## Question

A pharmaceutical company reports: 'Our drug reduced hospitalization by 50% (p < 0.001). 2% of patients in the treatment group were hospitalized vs 4% in the control group.' (1) Calculate the absolute risk reduction and NNT (number needed to treat). (2) The trial had 200 patients. Is this enough for the claimed significance? (3) The control group received no treatment (not a placebo). Why is this problematic? (4) Side effects occurred in 8% of the treatment group. Should this drug be approved?

---

## Winner

**GPT-OSS-120B** (OpenAI)
- Winner Score: 9.57
- Matrix Average: 8.78
- Total Judgments: 80

---

## Rankings

| Rank | Model | Provider | Avg Score | Judgments |
|------|-------|----------|-----------|----------|
| 1 | GPT-OSS-120B | OpenAI | 9.57 | 9 |
| 2 | Claude Sonnet 4.6 | openrouter | 9.23 | 9 |
| 3 | Grok 4.20 | openrouter | 9.22 | 9 |
| 4 | GPT-5.4 | openrouter | 9.16 | 9 |
| 5 | Gemini 3 Flash Preview | Google | 9.03 | 8 |
| 6 | Claude Opus 4.6 | openrouter | 8.97 | 9 |
| 7 | DeepSeek V4 | openrouter | 8.61 | 9 |
| 8 | MiMo-V2-Flash | Xiaomi | 7.75 | 9 |
| 9 | Gemini 3.1 Pro | openrouter | 7.51 | 9 |

---

## 10×10 Judgment Matrix

Rows = Judge, Columns = Respondent. Self-judgments excluded (—).

| Judge ↓ / Resp → | Gemini 3.1 Pro | Claude Opus | MiMo-V2-Flash | GPT-5.4 | DeepSeek V4 | Claude Sonnet | Grok 4.20 | GPT-OSS-120B | Gemini 3 | MiniMax M2.5 |
|---|---|---|---|---|---|---|---|---|---|---|
| Gemini 3.1 Pro | — | 9.3 | 7.0 | 10.0 | 7.9 | 10.0 | 10.0 | 10.0 | · | · |
| Claude Opus | 7.5 | — | 6.9 | 9.2 | 8.3 | 8.9 | 9.2 | 9.4 | 9.2 | · |
| MiMo-V2-Flash | 8.3 | 9.0 | — | 8.8 | 9.2 | 9.6 | 9.6 | 9.6 | 9.2 | · |
| GPT-5.4 | 6.5 | 8.2 | 6.5 | — | 8.8 | 8.8 | 9.0 | 9.6 | 8.8 | · |
| DeepSeek V4 | 8.6 | 8.8 | 8.7 | 8.7 | — | 9.8 | 8.8 | 9.4 | 9.2 | · |
| Claude Sonnet | 7.3 | 9.2 | 8.3 | 9.2 | 8.6 | — | 9.0 | 9.6 | 8.8 | · |
| Grok 4.20 | 8.1 | 9.0 | 8.8 | 8.8 | 8.8 | 8.6 | — | 8.8 | 8.8 | · |
| GPT-OSS-120B | 6.5 | 8.6 | 6.8 | 8.8 | 8.3 | 8.7 | 8.4 | — | 8.8 | · |
| Gemini 3 | 8.1 | 10.0 | 9.6 | 10.0 | 9.8 | 10.0 | 10.0 | 10.0 | — | · |
| MiniMax M2.5 | 6.9 | 8.8 | 7.3 | 9.0 | 7.9 | 8.8 | 9.0 | 9.8 | 9.4 | — |

---

## Methodology

- **10×10 Blind Peer Matrix:** All models answer the same question, then all models judge all responses.
- **5 Criteria:** Correctness, completeness, clarity, depth, usefulness (each scored 1–10).
- **Self-judgments excluded:** Models do not judge their own responses.
- **Weighted Score:** Composite of all 5 criteria.

---

## Citation

The Multivac (2026). Blind Peer Evaluation: ANALYSIS-017. app.themultivac.com

## License

Open data. Free to use, share, and build upon. Please cite The Multivac when using this data.

Download raw JSON: https://app.themultivac.com/api/evaluations/EVAL-20260402-195402/results
Full dataset: https://app.themultivac.com/dashboard/export