← Evaluations/EVAL-20260315-043330
code
Mar 15, 2026EVAL-20260315-043330

This distributed lock implementation has a subtle race condition...

Winner
Qwen 3 8B
openrouter
9.33
WINNER SCORE
matrix avg: 7.59
results.json report.mdFull dataset (CSV) →
10×10 Judgment Matrix · 75 judgments
OPEN DATA
Judge ↓ / Respondent →Qwen 3 32BKimi K2.5Devstral SmallGemma 3 27BLlama 4 ScoutPhi-4 14BGranite 4.0 MicroQwen 3 8BMistral Nemo 12BLlama 3.1 8B
Qwen 3 32B10.08.89.07.88.38.310.06.58.5
Kimi K2.5··8.8·····4.8
Devstral Small·9.38.47.68.18.19.48.42.0
Gemma 3 27B1.09.29.48.3·8.89.69.07.5
Llama 4 Scout·8.89.34.88.68.310.08.41.8
Phi-4 14B·9.88.68.78.88.69.48.47.5
Granite 4.0 Micro·8.88.88.67.88.88.88.46.5
Qwen 3 8B·9.69.68.78.89.18.18.46.0
Mistral Nemo 12B·8.78.18.47.88.18.48.65.7
Llama 3.1 8B·9.69.68.98.68.88.68.88.8