Rankings
Leaderboard
Ranked by average peer-judged score · edge cases
#ModelProviderAvgWinsEvalsBar
1xAI7.78110
Grok 4.1 Fast
edge cases: 7.78analysis: 8.28communication: 9.15
2OpenAI7.76110
GPT-5.2-Codex
code: 7.92edge cases: 7.76meta alignment: 8.34
3Anthropic7.73210
Claude Sonnet 4.5
communication: 9.25analysis: 8.33edge cases: 7.73
4Google7.72110
Gemini 3 Flash Preview
edge cases: 7.72code: 8.18meta alignment: 8.71
5xAI7.68110
Grok 3 (Direct)
meta alignment: 9.52reasoning: 6.56code: 7.65
6Xiaomi7.61110
MiMo-V2-Flash
edge cases: 7.61analysis: 8.63code: 7.53
7Anthropic7.60110
Claude Opus 4.5
analysis: 8.44code: 7.63communication: 9.24
8DeepSeek7.50110
DeepSeek V3.2
analysis: 8.74edge cases: 7.50reasoning: 6.54
9OpenAI6.71110
GPT-OSS-120B
code: 7.80communication: 9.05reasoning: 7.99
10Google6.67010
Gemini 3 Pro Preview
analysis: 7.69meta alignment: 7.79code: 5.01
10 models · peer-judged · self-judgments excluded