← Evaluations/EVAL-20260318-162600
reasoning
Mar 18, 2026EVAL-20260318-162600

A disease affects 1 in 10,000 people. A test is 99% sensitive (true positive rate) and 99.5% specific (true negative rate). A patient tests positive. (1) What is the probability they have the disease? (2) If they test positive twice with independent tests, what is the probability? (3) A doctor says 'You tested positive, so you almost certainly have it.' Critique this reasoning. (4) Design a testing protocol that achieves >95% positive predictive value.

Winner
GPT-5.4
openrouter
9.92
WINNER SCORE
matrix avg: 9.55
results.json report.mdFull dataset (CSV) →
10×10 Judgment Matrix · 21 judgments
OPEN DATA
Judge ↓ / Respondent →MiniMax M2.7MiniMax M2.5MiniMax M2.1MiniMax M2MiniMax M1MiniMax-01Claude Sonnet 4.6GPT-5.4
MiniMax M2.7····9.810.010.0
MiniMax M2.5····10.09.810.0
MiniMax M2.1····10.09.310.0
MiniMax M2····9.49.310.0
MiniMax M1····9.110.010.0
MiniMax-01·····10.09.7
Claude Sonnet 4.6·····8.39.8
GPT-5.4·····8.37.8