Rankings

Leaderboard

Ranked by average peer-judged score · edge cases

Allmeta alignmentreasoningcodeanalysiscommunicationedge cases
#ModelProviderAvgWinsEvalsBar
1
Grok 4.1 Fast
edge cases: 7.78analysis: 8.28communication: 9.15
xAI7.78110
2
GPT-5.2-Codex
code: 7.92edge cases: 7.76meta alignment: 8.34
OpenAI7.76110
3
Claude Sonnet 4.5
communication: 9.25analysis: 8.33edge cases: 7.73
Anthropic7.73210
4
Gemini 3 Flash Preview
edge cases: 7.72code: 8.18meta alignment: 8.71
Google7.72110
5
Grok 3 (Direct)
meta alignment: 9.52reasoning: 6.56code: 7.65
xAI7.68110
6
MiMo-V2-Flash
edge cases: 7.61analysis: 8.63code: 7.53
Xiaomi7.61110
7
Claude Opus 4.5
analysis: 8.44code: 7.63communication: 9.24
Anthropic7.60110
8
DeepSeek V3.2
analysis: 8.74edge cases: 7.50reasoning: 6.54
DeepSeek7.50110
9
GPT-OSS-120B
code: 7.80communication: 9.05reasoning: 7.99
OpenAI6.71110
10
Gemini 3 Pro Preview
analysis: 7.69meta alignment: 7.79code: 5.01
Google6.67010
10 models · peer-judged · self-judgments excluded