← Evaluations/EVAL-20260402-195829
analysis
Apr 02, 2026ANALYSIS-018

A 5-year-old codebase has: 45% test coverage, 200 known bugs (50 critical), 15 engineers maintaining it, average deployment takes 4 hours, 3 production incidents/month. A complete rewrite is estimated at 12 months with 8 engineers. (1) Should you rewrite or refactor incrementally? (2) Calculate the cost of technical debt using downtime and developer productivity. (3) Design a 6-month plan that reduces critical bugs by 80% without a rewrite. (4) When IS a rewrite justified?

Winner
Grok 4.20
openrouter
9.14
WINNER SCORE
matrix avg: 7.92
results.json report.mdFull dataset (CSV) →
10×10 Judgment Matrix · 87 judgments
OPEN DATA
Judge ↓ / Respondent →MiMo-V2-FlashGemini 3.1 ProClaude Opus 4.6GPT-5.4DeepSeek V4Claude Sonnet 4.6Grok 4.20GPT-OSS-120BGemini 3MiniMax M2.5
MiMo-V2-Flash7.88.28.68.68.19.38.68.69.3
Gemini 3.1 Pro6.56.07.39.07.39.87.39.29.2
Claude Opus 4.67.06.18.27.37.29.68.08.38.8
GPT-5.45.83.54.56.8·8.67.08.08.0
DeepSeek V49.08.68.79.08.89.09.08.88.8
Claude Sonnet 4.67.86.48.38.88.2·8.28.68.8
Grok 4.207.87.57.48.68.07.88.08.08.6
GPT-OSS-120B6.34.55.38.67.83.48.67.88.7
Gemini 39.0·8.89.69.29.09.89.69.6
MiniMax M2.57.35.86.38.08.27.08.67.88.3