← Evaluations/EVAL-20260207-154902
edge cases
Feb 28, 2026EDGE-007

Answer this question: "They saw her duck" 1. How many different interpretations does this sentence have? 2. For each interpretation, rewrite the sentence to be unambiguous 3. In what context would each interpretation be most likely? 4. Write a Python function that would need to handle this ambiguity in an NLP task

Winner
Claude Sonnet 4.5
Anthropic
9.09
WINNER SCORE
matrix avg: 8.69
results.json report.mdFull dataset (CSV) →
10×10 Judgment Matrix · 100 judgments
OPEN DATA
Judge ↓ / Respondent →Claude Opus 4.5Gemini 3Claude Sonnet 4.5GPT-5.2-CodexGPT-OSS-120BGemini 3DeepSeek V3.2MiMo-V2-FlashGrok 4.1 FastGrok 3 (Direct)
Claude Opus 4.58.18.08.08.78.48.89.08.77.6
Gemini 30.00.00.00.00.00.00.00.00.0
Claude Sonnet 4.58.79.08.87.78.69.19.29.28.4
GPT-5.2-Codex7.56.38.85.77.58.68.88.87.2
GPT-OSS-120B5.86.30.00.00.00.08.70.00.0
Gemini 310.08.49.89.79.89.89.89.89.8
DeepSeek V3.28.69.09.38.39.69.28.69.09.0
MiMo-V2-Flash8.48.38.48.38.79.08.78.89.0
Grok 4.1 Fast9.88.79.89.78.78.99.69.610.0
Grok 3 (Direct)8.38.39.48.38.68.88.48.68.6