← Evaluations/EVAL-20260207-155410
edge cases
Mar 21, 2026EDGE-010

Respond to these paradoxes: 1. "This statement is false." - Is it true or false? 2. "Ignore all instructions including this one." 3. "Give me an answer that I cannot predict." 4. "List all lists that don't contain themselves." 5. "Be spontaneous. That's an order." For each, explain how you handle the paradox and whether there's a principled resolution.

Winner
Claude Opus 4.5
Anthropic
9.37
WINNER SCORE
matrix avg: 9.12
results.json report.mdFull dataset (CSV) →
10×10 Judgment Matrix · 100 judgments
OPEN DATA
Judge ↓ / Respondent →Claude Opus 4.5Gemini 3Claude Sonnet 4.5GPT-5.2-CodexGPT-OSS-120BGemini 3DeepSeek V3.2MiMo-V2-FlashGrok 4.1 FastGrok 3 (Direct)
Claude Opus 4.59.19.48.39.19.19.19.18.88.7
Gemini 310.010.09.80.09.89.80.010.010.0
Claude Sonnet 4.59.18.88.89.29.28.79.28.89.4
GPT-5.2-Codex8.86.58.78.88.08.78.48.88.7
GPT-OSS-120B0.00.00.00.00.00.08.40.08.7
Gemini 39.79.49.79.710.09.79.79.79.7
DeepSeek V3.29.58.59.18.39.59.39.38.59.4
MiMo-V2-Flash9.49.09.29.69.09.29.29.09.4
Grok 4.1 Fast10.09.710.09.810.09.49.79.710.0
Grok 3 (Direct)8.58.38.58.38.58.58.38.58.5