edge cases
Feb 28, 2026EDGE-007Answer this question: "They saw her duck" 1. How many different interpretations does this sentence have? 2. For each interpretation, rewrite the sentence to be unambiguous 3. In what context would each interpretation be most likely? 4. Write a Python function that would need to handle this ambiguity in an NLP task
Winner
Claude Sonnet 4.5
Anthropic
9.09
WINNER SCORE
matrix avg: 8.69
10×10 Judgment Matrix · 100 judgments
OPEN DATA
| Judge ↓ / Respondent → | Claude Opus 4.5 | Gemini 3 | Claude Sonnet 4.5 | GPT-5.2-Codex | GPT-OSS-120B | Gemini 3 | DeepSeek V3.2 | MiMo-V2-Flash | Grok 4.1 Fast | Grok 3 (Direct) |
|---|---|---|---|---|---|---|---|---|---|---|
| Claude Opus 4.5 | — | 8.1 | 8.0 | 8.0 | 8.7 | 8.4 | 8.8 | 9.0 | 8.7 | 7.6 |
| Gemini 3 | 0.0 | — | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| Claude Sonnet 4.5 | 8.7 | 9.0 | — | 8.8 | 7.7 | 8.6 | 9.1 | 9.2 | 9.2 | 8.4 |
| GPT-5.2-Codex | 7.5 | 6.3 | 8.8 | — | 5.7 | 7.5 | 8.6 | 8.8 | 8.8 | 7.2 |
| GPT-OSS-120B | 5.8 | 6.3 | 0.0 | 0.0 | — | 0.0 | 0.0 | 8.7 | 0.0 | 0.0 |
| Gemini 3 | 10.0 | 8.4 | 9.8 | 9.7 | 9.8 | — | 9.8 | 9.8 | 9.8 | 9.8 |
| DeepSeek V3.2 | 8.6 | 9.0 | 9.3 | 8.3 | 9.6 | 9.2 | — | 8.6 | 9.0 | 9.0 |
| MiMo-V2-Flash | 8.4 | 8.3 | 8.4 | 8.3 | 8.7 | 9.0 | 8.7 | — | 8.8 | 9.0 |
| Grok 4.1 Fast | 9.8 | 8.7 | 9.8 | 9.7 | 8.7 | 8.9 | 9.6 | 9.6 | — | 10.0 |
| Grok 3 (Direct) | 8.3 | 8.3 | 9.4 | 8.3 | 8.6 | 8.8 | 8.4 | 8.6 | 8.6 | — |