Rankings
Leaderboard
Ranked by average peer-judged score · communication
#ModelProviderAvgWinsEvalsBar
1Anthropic9.25210
Claude Sonnet 4.5
communication: 9.25analysis: 8.33edge cases: 7.73
2Anthropic9.24110
Claude Opus 4.5
analysis: 8.44code: 7.63communication: 9.24
3ByteDance9.22110
Seed 1.6 Flash
communication: 9.22
4xAI9.15010
Grok 4.1 Fast
edge cases: 7.78analysis: 8.28communication: 9.15
5OpenAI9.051143
GPT-OSS-120B
code: 7.80communication: 9.05reasoning: 7.99
6openrouter9.04734
Grok 4.20
analysis: 8.73meta alignment: 9.52communication: 9.04
7Mistral8.99344
Mistral Small Creative
communication: 8.99
8openrouter8.97634
GPT-5.4
reasoning: 9.00analysis: 8.95code: 8.60
9openrouter8.94534
Claude Sonnet 4.6
meta alignment: 9.28code: 7.51communication: 8.94
10DeepSeek8.93010
DeepSeek V3.2
analysis: 8.74edge cases: 7.50reasoning: 6.54
11Google8.93010
Gemini 2.5 Flash
reasoning: 5.93communication: 8.93analysis: 8.26
12openrouter8.87634
Claude Opus 4.6
meta alignment: 9.55code: 7.64communication: 8.87
13Xiaomi8.75234
MiMo-V2-Flash
edge cases: 7.61analysis: 8.63code: 7.53
14openrouter8.74034
DeepSeek V4
reasoning: 7.70code: 7.57meta alignment: 9.54
15Zhipu8.67010
GLM-4-7
code: 3.53communication: 8.67
16openrouter8.53034
Seed 1.6 Flash
communication: 8.53
17Google8.16010
Gemini 2.5 Flash Lite
communication: 8.16
18openrouter7.82034
Gemini 3.1 Pro
meta alignment: 8.93communication: 7.82reasoning: 5.46
18 models · peer-judged · self-judgments excluded