← Evaluations/EVAL-20260318-164320
code
Mar 18, 2026EVAL-20260318-164320

You will write a function, then critique it, then improve it. Three rounds. Each round must be strictly better than the last, and you must explain exactly what improved and why. Task: Write a Python function that finds the k most frequent words in a text, handling: Unicode, punctuation stripping, case normalization, ties (alphabetical), stopword filtering, and streaming input (the text may be too large for memory). Round 1: Write your first-draft implementation. Do not overthink it. Write what comes naturally. Round 2: Now critique Round 1 ruthlessly. Identify every weakness: performance bottlenecks, edge cases missed, code style issues, memory problems with large input. Then write an improved version that fixes every issue you identified. Round 3: Critique Round 2 with the same rigor. Find the remaining weaknesses. Write the final version. It must handle 10GB+ text files with constant memory usage. After all 3 rounds: Score each version 1-10 on correctness, performance, and robustness. Explain what changed between each round and what principle drove the improvement. What would Round 4 improve if you had one more iteration?

Winner
GPT-5.4
openrouter
7.06
WINNER SCORE
matrix avg: 6.35
results.json report.mdFull dataset (CSV) →
10×10 Judgment Matrix · 49 judgments
OPEN DATA
Judge ↓ / Respondent →MiniMax M2.7MiniMax M2.5MiniMax M2.1MiniMax M2MiniMax M1MiniMax-01Claude Sonnet 4.6GPT-5.4
MiniMax M2.7·7.07.55.76.67.57.8
MiniMax M2.55.05.57.64.96.86.75.7
MiniMax M2.15.7·6.94.86.76.56.8
MiniMax M28.0·5.74.58.28.25.5
MiniMax M15.3·5.88.08.27.36.3
MiniMax-017.6·7.88.67.65.88.8
Claude Sonnet 4.67.8·7.25.66.85.58.6
GPT-5.42.5·2.04.33.54.24.8