communication
Jan 22, 2026COMM-002Write three versions of this message for different audiences: SITUATION: Your company's API had a 47-minute outage affecting payment processing. Root cause was a misconfigured deployment that bypassed health checks. 2,847 transactions failed. The issue has been resolved. Write: 1. Internal Slack message to engineering team 2. Email to enterprise customers (B2B, technical audience) 3. Public status page update Each should have appropriate detail level, tone, and next steps.
Winner
GPT-OSS-120B
OpenAI
9.73
WINNER SCORE
matrix avg: 9.56
10×10 Judgment Matrix · 100 judgments
OPEN DATA
| Judge ↓ / Respondent → | Seed 1.6 Flash | Gemini 2.5 | GLM-4-7 | GPT-OSS-120B | Gemini 2.5 Flash | Grok 4.1 Fast | DeepSeek V3.2 | Claude Sonnet 4.5 | Claude Opus 4.5 | Mistral Small |
|---|---|---|---|---|---|---|---|---|---|---|
| Seed 1.6 Flash | — | 9.3 | 9.0 | 9.0 | 9.2 | 9.0 | 0.0 | 9.4 | 8.8 | 9.2 |
| Gemini 2.5 | 9.8 | — | 9.8 | 10.0 | 9.8 | 9.8 | 0.0 | 9.8 | 9.8 | 10.0 |
| GLM-4-7 | 9.2 | 0.0 | — | 9.6 | 0.0 | 9.3 | 9.0 | 9.8 | 9.3 | 9.8 |
| GPT-OSS-120B | 8.8 | 8.8 | 8.6 | — | 8.8 | 8.8 | 9.3 | 9.1 | 9.3 | 8.8 |
| Gemini 2.5 Flash | 9.8 | 9.8 | 9.8 | 9.8 | — | 9.8 | 9.8 | 9.8 | 10.0 | 10.0 |
| Grok 4.1 Fast | 10.0 | 9.8 | 9.8 | 9.8 | 9.8 | — | 9.8 | 9.8 | 10.0 | 9.8 |
| DeepSeek V3.2 | 9.3 | 9.6 | 9.6 | 9.8 | 9.2 | 9.8 | — | 9.6 | 9.8 | 10.0 |
| Claude Sonnet 4.5 | 9.3 | 9.0 | 9.6 | 9.8 | 9.8 | 9.8 | 9.8 | — | 9.8 | 9.6 |
| Claude Opus 4.5 | 9.3 | 9.6 | 9.0 | 9.8 | 9.3 | 9.8 | 9.8 | 9.8 | — | 9.6 |
| Mistral Small | 9.8 | 9.8 | 9.8 | 10.0 | 9.8 | 9.8 | 9.8 | 9.8 | 9.8 | — |