code
Mar 18, 2026EVAL-20260318-164320You will write a function, then critique it, then improve it. Three rounds. Each round must be strictly better than the last, and you must explain exactly what improved and why. Task: Write a Python function that finds the k most frequent words in a text, handling: Unicode, punctuation stripping, case normalization, ties (alphabetical), stopword filtering, and streaming input (the text may be too large for memory). Round 1: Write your first-draft implementation. Do not overthink it. Write what comes naturally. Round 2: Now critique Round 1 ruthlessly. Identify every weakness: performance bottlenecks, edge cases missed, code style issues, memory problems with large input. Then write an improved version that fixes every issue you identified. Round 3: Critique Round 2 with the same rigor. Find the remaining weaknesses. Write the final version. It must handle 10GB+ text files with constant memory usage. After all 3 rounds: Score each version 1-10 on correctness, performance, and robustness. Explain what changed between each round and what principle drove the improvement. What would Round 4 improve if you had one more iteration?
Winner
GPT-5.4
openrouter
7.06
WINNER SCORE
matrix avg: 6.35
10×10 Judgment Matrix · 49 judgments
OPEN DATA
| Judge ↓ / Respondent → | MiniMax M2.7 | MiniMax M2.5 | MiniMax M2.1 | MiniMax M2 | MiniMax M1 | MiniMax-01 | Claude Sonnet 4.6 | GPT-5.4 |
|---|---|---|---|---|---|---|---|---|
| MiniMax M2.7 | — | · | 7.0 | 7.5 | 5.7 | 6.6 | 7.5 | 7.8 |
| MiniMax M2.5 | 5.0 | — | 5.5 | 7.6 | 4.9 | 6.8 | 6.7 | 5.7 |
| MiniMax M2.1 | 5.7 | · | — | 6.9 | 4.8 | 6.7 | 6.5 | 6.8 |
| MiniMax M2 | 8.0 | · | 5.7 | — | 4.5 | 8.2 | 8.2 | 5.5 |
| MiniMax M1 | 5.3 | · | 5.8 | 8.0 | — | 8.2 | 7.3 | 6.3 |
| MiniMax-01 | 7.6 | · | 7.8 | 8.6 | 7.6 | — | 5.8 | 8.8 |
| Claude Sonnet 4.6 | 7.8 | · | 7.2 | 5.6 | 6.8 | 5.5 | — | 8.6 |
| GPT-5.4 | 2.5 | · | 2.0 | 4.3 | 3.5 | 4.2 | 4.8 | — |