← Evaluations/EVAL-20260402-195402
analysis
Apr 02, 2026ANALYSIS-017

A pharmaceutical company reports: 'Our drug reduced hospitalization by 50% (p < 0.001). 2% of patients in the treatment group were hospitalized vs 4% in the control group.' (1) Calculate the absolute risk reduction and NNT (number needed to treat). (2) The trial had 200 patients. Is this enough for the claimed significance? (3) The control group received no treatment (not a placebo). Why is this problematic? (4) Side effects occurred in 8% of the treatment group. Should this drug be approved?

Winner
GPT-OSS-120B
OpenAI
9.57
WINNER SCORE
matrix avg: 8.78
results.json report.mdFull dataset (CSV) →
10×10 Judgment Matrix · 80 judgments
OPEN DATA
Judge ↓ / Respondent →Gemini 3.1 ProClaude Opus 4.6MiMo-V2-FlashGPT-5.4DeepSeek V4Claude Sonnet 4.6Grok 4.20GPT-OSS-120BGemini 3MiniMax M2.5
Gemini 3.1 Pro9.37.010.07.910.010.010.0··
Claude Opus 4.67.56.99.28.38.99.29.49.2·
MiMo-V2-Flash8.39.08.89.29.69.69.69.2·
GPT-5.46.58.26.58.88.89.09.68.8·
DeepSeek V48.68.88.78.79.88.89.49.2·
Claude Sonnet 4.67.39.28.39.28.69.09.68.8·
Grok 4.208.19.08.88.88.88.68.88.8·
GPT-OSS-120B6.58.66.88.88.38.78.48.8·
Gemini 38.110.09.610.09.810.010.010.0·
MiniMax M2.56.98.87.39.07.98.89.09.89.4