← Evaluations/EVAL-20260317-023315
reasoning
Mar 17, 2026EVAL-20260317-023315

A disease affects 1 in 10,000 people. A test is 99% sensitive (true positive rate) and 99.5% specific (true negative rate). A patient tests positive. (1) What is the probability they have the disease? (2) If they test positive twice with independent tests, what is the probability? (3) A doctor says 'You tested positive, so you almost certainly have it.' Critique this reasoning. (4) Design a testing protocol that achieves >95% positive predictive value.

Winner
Qwen 3.5 397B-A17B
openrouter
10.00
WINNER SCORE
matrix avg: 9.80
results.json report.mdFull dataset (CSV) →
10×10 Judgment Matrix · 24 judgments
OPEN DATA
Judge ↓ / Respondent →Qwen 3 8BQwen 3 32BQwen 3 Coder NextQwen 3.5 27BQwen 3.5 122B-A10BQwen 3.5 397B-A17BQwen 3.5 35B-A3B
Qwen 3 8B·9.69.3·10.010.0
Qwen 3 32B9.810.010.0·10.010.0
Qwen 3 Coder Next10.0·10.0·10.010.0
Qwen 3.5 27B10.0·9.2··10.0
Qwen 3.5 122B-A10B9.8·8.710.010.09.3
Qwen 3.5 397B-A17B10.0··10.0·9.8
Qwen 3.5 35B-A3B······