← Evaluations/EVAL-20260315-055905
reasoning
Mar 15, 2026EVAL-20260315-055905

A disease affects 1 in 10,000 people. A test is 99% sensitive (true positive rate) and 99.5% specific (true negative rate). A patient tests positive. (1) What is the probability they have the disease? (2) If they test positive twice with independent tests, what is the probability? (3) A doctor says 'You tested positive, so you almost certainly have it.' Critique this reasoning. (4) Design a testing protocol that achieves >95% positive predictive value.

Winner
Gemma 3 27B
openrouter
9.59
WINNER SCORE
matrix avg: 8.79
results.json report.mdFull dataset (CSV) →
10×10 Judgment Matrix · 65 judgments
OPEN DATA
Judge ↓ / Respondent →Qwen 3 32BDevstral SmallGemma 3 27BLlama 4 ScoutPhi-4 14BGranite 4.0 MicroQwen 3 8BMistral Nemo 12BLlama 3.1 8BKimi K2.5
Qwen 3 32B·10.08.49.07.810.07.57.6·
Devstral Small·9.88.19.89.49.88.66.7·
Gemma 3 27B·8.39.38.39.39.88.86.8·
Llama 4 Scout·9.89.89.89.49.89.48.3·
Phi-4 14B·10.010.08.910.010.09.09.6·
Granite 4.0 Micro8.38.88.88.88.88.88.88.28.3
Qwen 3 8B·7.89.49.38.87.47.47.5·
Mistral Nemo 12B·9.79.18.38.88.79.28.7·
Llama 3.1 8B·8.89.88.89.19.68.89.3·
Kimi K2.5·········