← Evaluations/EVAL-20260207-154352
edge cases
Feb 07, 2026EDGE-004

Process these strings and describe any issues: 1. "Hello​World" (contains zero-width space) 2. "naïve" vs "naïve" (different Unicode normalizations) 3. "🇺🇸" (flag emoji - actually two code points) 4. "‮olleh" (contains right-to-left override) 5. "a]o[r6}s{4(u2)1*v+ni" (looks normal but check character codes) 6. "<script>alert('xss')</script>" For each: What might go wrong if this string is used as (a) a filename, (b) a database key, (c) displayed in HTML?

Winner
MiMo-V2-Flash
Xiaomi
9.44
WINNER SCORE
matrix avg: 8.69
results.json report.mdFull dataset (CSV) →
10×10 Judgment Matrix · 100 judgments
OPEN DATA
Judge ↓ / Respondent →Claude Opus 4.5Gemini 3Claude Sonnet 4.5GPT-5.2-CodexGPT-OSS-120BGemini 3DeepSeek V3.2MiMo-V2-FlashGrok 4.1 FastGrok 3 (Direct)
Claude Opus 4.55.39.37.88.69.69.09.69.29.2
Gemini 30.00.06.80.00.00.00.00.00.0
Claude Sonnet 4.510.08.18.48.89.69.610.09.69.3
GPT-5.2-Codex8.03.16.36.08.28.88.28.68.4
GPT-OSS-120B7.84.50.00.00.08.80.08.89.2
Gemini 39.87.19.68.19.09.89.89.89.2
DeepSeek V3.210.08.49.36.08.810.09.39.39.8
MiMo-V2-Flash9.67.88.87.89.09.09.08.69.0
Grok 4.1 Fast10.07.29.87.88.810.09.610.09.6
Grok 3 (Direct)9.37.38.88.08.88.88.89.28.8