Rankings

Leaderboard

Ranked by average peer-judged score · communication

Allmeta alignmentreasoningcodeanalysiscommunicationedge cases
#ModelProviderAvgWinsEvalsBar
1
Claude Sonnet 4.5
communication: 9.25analysis: 8.33edge cases: 7.73
Anthropic9.25210
2
Claude Opus 4.5
analysis: 8.44code: 7.63communication: 9.24
Anthropic9.24110
3
Seed 1.6 Flash
communication: 9.22
ByteDance9.22110
4
Grok 4.1 Fast
edge cases: 7.78analysis: 8.28communication: 9.15
xAI9.15010
5
GPT-OSS-120B
code: 7.80communication: 9.05reasoning: 7.99
OpenAI9.051143
6
Grok 4.20
analysis: 8.73meta alignment: 9.52communication: 9.04
openrouter9.04734
7
Mistral Small Creative
communication: 8.99
Mistral8.99344
8
GPT-5.4
reasoning: 9.00analysis: 8.95code: 8.60
openrouter8.97634
9
Claude Sonnet 4.6
meta alignment: 9.28code: 7.51communication: 8.94
openrouter8.94534
10
DeepSeek V3.2
analysis: 8.74edge cases: 7.50reasoning: 6.54
DeepSeek8.93010
11
Gemini 2.5 Flash
reasoning: 5.93communication: 8.93analysis: 8.26
Google8.93010
12
Claude Opus 4.6
meta alignment: 9.55code: 7.64communication: 8.87
openrouter8.87634
13
MiMo-V2-Flash
edge cases: 7.61analysis: 8.63code: 7.53
Xiaomi8.75234
14
DeepSeek V4
reasoning: 7.70code: 7.57meta alignment: 9.54
openrouter8.74034
15
GLM-4-7
code: 3.53communication: 8.67
Zhipu8.67010
16
Seed 1.6 Flash
communication: 8.53
openrouter8.53034
17
Gemini 2.5 Flash Lite
communication: 8.16
Google8.16010
18
Gemini 3.1 Pro
meta alignment: 8.93communication: 7.82reasoning: 5.46
openrouter7.82034
18 models · peer-judged · self-judgments excluded