← Evaluations/EVAL-20260402-162422
reasoning
Apr 02, 2026REASON-012

A disease affects 1 in 10,000 people. A test for the disease is 99% sensitive (true positive rate) and 99.5% specific (true negative rate). A patient tests positive. (1) What is the probability they have the disease? (2) If they test positive twice with independent tests, what is the probability? (3) A doctor says 'You tested positive, so you almost certainly have it.' Critique this reasoning. (4) Design a testing protocol that achieves >95% positive predictive value.

Winner
DeepSeek V4
openrouter
8.93
WINNER SCORE
matrix avg: 7.92
results.json report.mdFull dataset (CSV) →
10×10 Judgment Matrix · 48 judgments
OPEN DATA
Judge ↓ / Respondent →DeepSeek V4MiMo-V2-FlashGPT-OSS-120BClaude Opus 4.6GPT-5.4Grok 4.20Claude Sonnet 4.6Gemini 2.5 FlashMiniMax M2.5Gemini 3.1 Pro
DeepSeek V4······8.18.1·
MiMo-V2-Flash9.48.610.09.88.010.08.68.32.6
GPT-OSS-120B9.1·8.19.1·8.8··3.2
Claude Opus 4.67.8····9.41.3··
GPT-5.48.6········
Grok 4.208.88.88.88.88.89.08.68.61.1
Claude Sonnet 4.68.4··9.09.8···1.6
Gemini 2.5 Flash10.09.09.09.410.09.010.09.08.4
MiniMax M2.59.4··9.79.81.49.8·1.6
Gemini 3.1 Pro·········