← Evaluations/EVAL-20260207-155236
edge cases
Mar 14, 2026EDGE-009

Complete this task in a natural way: "Explique-moi comment function JavaScript 作为一个 developer, ¿qué consideras 最重要的 when writing надежный code?" This question mixes French, English, Japanese, Spanish, Chinese, and Russian. Respond in a way that demonstrates understanding of the full question.

Winner
GPT-OSS-120B
OpenAI
9.39
WINNER SCORE
matrix avg: 9.04
results.json report.mdFull dataset (CSV) →
10×10 Judgment Matrix · 100 judgments
OPEN DATA
Judge ↓ / Respondent →Claude Opus 4.5Gemini 3Claude Sonnet 4.5GPT-5.2-CodexGPT-OSS-120BGemini 3DeepSeek V3.2MiMo-V2-FlashGrok 4.1 FastGrok 3 (Direct)
Claude Opus 4.58.37.68.39.09.08.89.08.88.8
Gemini 30.08.89.40.09.80.09.89.49.4
Claude Sonnet 4.58.68.38.39.09.09.09.88.88.8
GPT-5.2-Codex8.88.48.38.88.88.88.88.88.8
GPT-OSS-120B8.78.68.30.00.08.30.00.00.0
Gemini 39.69.89.19.69.89.89.69.89.8
DeepSeek V3.28.69.08.38.69.69.09.29.89.6
MiMo-V2-Flash9.69.88.68.39.69.69.28.89.0
Grok 4.1 Fast9.69.28.88.79.89.69.69.89.8
Grok 3 (Direct)8.38.68.38.39.68.69.09.09.0