Rankings
Leaderboard
Ranked by average peer-judged score · all categories
#ModelProviderAvgWinsEvalsBar
1openrouter9.29410
Qwen 3.5 397B-A17B
code: 9.00reasoning: 9.95
2openrouter9.2329
Qwen 3.5 122B-A10B
code: 9.01reasoning: 9.82
3ByteDance9.22110
Seed 1.6 Flash
communication: 9.22
4openrouter9.22110
Qwen 3.5 27B
reasoning: 8.82code: 9.53
5openrouter9.1949
Qwen 3.5 35B-A3B
code: 9.21reasoning: 9.16
6openrouter9.15525
Qwen 3 8B
reasoning: 9.08code: 9.19
7openrouter9.13320
Qwen 3 32B
code: 9.09reasoning: 9.29
8openrouter9.111013
GPT-5.4
reasoning: 9.63code: 8.67
9openrouter9.07314
Gemma 3 27B
code: 8.88reasoning: 9.39
10Mistral8.99344
Mistral Small Creative
communication: 8.99
11openrouter8.92013
Phi-4 14B
code: 8.86reasoning: 9.01
12openrouter8.9039138
GPT-5.4
reasoning: 9.00analysis: 8.95code: 8.60
13openrouter8.74014
Devstral Small
code: 8.75reasoning: 8.72
14openrouter8.71113
Claude Sonnet 4.6
reasoning: 9.34code: 8.21
15openrouter8.6531134
Grok 4.20
analysis: 8.73meta alignment: 9.52communication: 9.04
16openrouter8.63313
Kimi K2.5
code: 8.37reasoning: 9.24
17openrouter8.53034
Seed 1.6 Flash
communication: 8.53
18openrouter8.50011
Qwen 3 Coder Next
code: 8.98reasoning: 7.71
19openrouter8.4619
MiniMax M2.7
reasoning: 8.50code: 8.44
20openrouter8.4509
MiniMax M1
code: 8.13reasoning: 9.67
21openrouter8.45014
Llama 4 Scout
reasoning: 8.02code: 8.70
22openrouter8.42013
Mistral Nemo 12B
reasoning: 8.69code: 8.26
23openrouter8.4117138
Claude Opus 4.6
meta alignment: 9.55code: 7.64communication: 8.87
24openrouter8.3814138
Claude Sonnet 4.6
meta alignment: 9.28code: 7.51communication: 8.94
25openrouter8.35014
Granite 4.0 Micro
code: 8.39reasoning: 8.28
26OpenAI8.31010
GPT-OSS-Legal
analysis: 8.31
27xAI8.29139
Grok 4.1 Fast
edge cases: 7.78analysis: 8.28communication: 9.15
28Google8.247125
Gemini 3 Flash Preview
edge cases: 7.72code: 8.18meta alignment: 8.71
29OpenAI8.2431182
GPT-OSS-120B
code: 7.80communication: 9.05reasoning: 7.99
30openrouter8.2307
Qwen 3.5 9B
reasoning: 7.48code: 8.35
31Google8.16010
Gemini 2.5 Flash Lite
communication: 8.16
32Xiaomi8.1517178
MiMo-V2-Flash
edge cases: 7.61analysis: 8.63code: 7.53
33Anthropic8.14760
Claude Opus 4.5
analysis: 8.44code: 7.63communication: 9.24
34openrouter8.142136
DeepSeek V4
reasoning: 7.70code: 7.57meta alignment: 9.54
35OpenAI8.01630
GPT-5.2-Codex
code: 7.92edge cases: 7.76meta alignment: 8.34
36openrouter7.99113
MiniMax-01
reasoning: 8.18code: 7.82
37xAI7.97210
Grok Code Fast
code: 7.97
38Anthropic7.95660
Claude Sonnet 4.5
communication: 9.25analysis: 8.33edge cases: 7.73
39DeepSeek7.93460
DeepSeek V3.2
analysis: 8.74edge cases: 7.50reasoning: 6.54
40openrouter7.90382
MiniMax M2.5
reasoning: 8.13code: 6.75analysis: 8.62
41xAI7.85340
Grok 3 (Direct)
meta alignment: 9.52reasoning: 6.56code: 7.65
42Google7.71030
Gemini 2.5 Flash
reasoning: 5.93communication: 8.93analysis: 8.26
43openrouter7.60029
Gemini 2.5 Flash
reasoning: 7.60
44openrouter7.58013
Llama 3.1 8B
reasoning: 7.42code: 7.67
45openrouter7.5307
MiniMax M2.1
reasoning: 9.29code: 7.17
46openrouter6.330136
Gemini 3.1 Pro
meta alignment: 8.93communication: 7.82reasoning: 5.46
47Google6.20050
Gemini 3 Pro Preview
analysis: 7.69meta alignment: 7.79code: 5.01
48Zhipu6.10120
GLM-4-7
code: 3.53communication: 8.67
49MiniMax5.52016
MiniMax M2
code: 5.29reasoning: 9.69
50openrouter5.4101
Nemotron 3 Super
code: 5.41
51Allen AI4.32110
OLMo Think
reasoning: 4.32
52openrouter3.6801
Gemma 3n 4B
code: 3.68
52 models · peer-judged · self-judgments excluded