analysis
Apr 02, 2026ANALYSIS-017A pharmaceutical company reports: 'Our drug reduced hospitalization by 50% (p < 0.001). 2% of patients in the treatment group were hospitalized vs 4% in the control group.' (1) Calculate the absolute risk reduction and NNT (number needed to treat). (2) The trial had 200 patients. Is this enough for the claimed significance? (3) The control group received no treatment (not a placebo). Why is this problematic? (4) Side effects occurred in 8% of the treatment group. Should this drug be approved?
Winner
GPT-OSS-120B
OpenAI
9.57
WINNER SCORE
matrix avg: 8.78
10×10 Judgment Matrix · 80 judgments
OPEN DATA
| Judge ↓ / Respondent → | Gemini 3.1 Pro | Claude Opus 4.6 | MiMo-V2-Flash | GPT-5.4 | DeepSeek V4 | Claude Sonnet 4.6 | Grok 4.20 | GPT-OSS-120B | Gemini 3 | MiniMax M2.5 |
|---|---|---|---|---|---|---|---|---|---|---|
| Gemini 3.1 Pro | — | 9.3 | 7.0 | 10.0 | 7.9 | 10.0 | 10.0 | 10.0 | · | · |
| Claude Opus 4.6 | 7.5 | — | 6.9 | 9.2 | 8.3 | 8.9 | 9.2 | 9.4 | 9.2 | · |
| MiMo-V2-Flash | 8.3 | 9.0 | — | 8.8 | 9.2 | 9.6 | 9.6 | 9.6 | 9.2 | · |
| GPT-5.4 | 6.5 | 8.2 | 6.5 | — | 8.8 | 8.8 | 9.0 | 9.6 | 8.8 | · |
| DeepSeek V4 | 8.6 | 8.8 | 8.7 | 8.7 | — | 9.8 | 8.8 | 9.4 | 9.2 | · |
| Claude Sonnet 4.6 | 7.3 | 9.2 | 8.3 | 9.2 | 8.6 | — | 9.0 | 9.6 | 8.8 | · |
| Grok 4.20 | 8.1 | 9.0 | 8.8 | 8.8 | 8.8 | 8.6 | — | 8.8 | 8.8 | · |
| GPT-OSS-120B | 6.5 | 8.6 | 6.8 | 8.8 | 8.3 | 8.7 | 8.4 | — | 8.8 | · |
| Gemini 3 | 8.1 | 10.0 | 9.6 | 10.0 | 9.8 | 10.0 | 10.0 | 10.0 | — | · |
| MiniMax M2.5 | 6.9 | 8.8 | 7.3 | 9.0 | 7.9 | 8.8 | 9.0 | 9.8 | 9.4 | — |