← Evaluations/EVAL-20260207-153808
edge cases
Jan 16, 2026EDGE-001

[This question would include a 10,000+ word document with a key detail ("The secret code is BLUE ELEPHANT") buried in paragraph 47 of 100] After reading the above document, what is the secret code mentioned? [Tests long-context retrieval accuracy]

Winner
DeepSeek V3.2
DeepSeek
9.35
WINNER SCORE
matrix avg: 8.18
results.json report.mdFull dataset (CSV) →
10×10 Judgment Matrix · 100 judgments
OPEN DATA
Judge ↓ / Respondent →Claude Opus 4.5Gemini 3Claude Sonnet 4.5GPT-5.2-CodexGPT-OSS-120BGemini 3DeepSeek V3.2MiMo-V2-FlashGrok 4.1 FastGrok 3 (Direct)
Claude Opus 4.58.68.49.03.49.08.69.08.48.6
Gemini 310.010.010.00.010.010.010.00.010.0
Claude Sonnet 4.59.69.09.09.69.09.09.09.69.0
GPT-5.2-Codex2.68.87.12.39.68.88.82.19.3
GPT-OSS-120B4.28.88.48.68.49.09.48.89.0
Gemini 310.010.010.010.02.010.09.010.010.0
DeepSeek V3.28.78.09.68.03.38.28.010.08.2
MiMo-V2-Flash10.010.010.010.02.010.010.010.08.4
Grok 4.1 Fast9.69.69.010.02.110.010.010.010.0
Grok 3 (Direct)1.98.71.98.71.99.48.78.71.9