← Evaluations/EVAL-20260402-133034
code
Apr 02, 2026CODE-015

Build a production-ready WebSocket chat server in Python using asyncio. Requirements: support multiple rooms, handle disconnections gracefully, implement message history (last 100 messages per room), rate limiting (5 messages/second per user), and heartbeat/keepalive. Include error handling for malformed messages.

Winner
Grok 4.20
openrouter
8.49
WINNER SCORE
matrix avg: 7.07
results.json report.mdFull dataset (CSV) →
10×10 Judgment Matrix · 88 judgments
OPEN DATA
Judge ↓ / Respondent →MiMo-V2-FlashGPT-OSS-120BGemini 3GPT-5.4Claude Opus 4.6Gemini 3.1 ProClaude Sonnet 4.6Grok 4.20DeepSeek V4MiniMax M2.5
MiMo-V2-Flash9.38.88.08.67.08.28.68.65.7
GPT-OSS-120B8.18.6·5.84.76.18.88.23.4
Gemini 39.28.88.88.35.58.39.69.08.8
GPT-5.45.84.56.84.71.65.07.27.02.0
Claude Opus 4.67.26.37.86.82.95.37.57.24.0
Gemini 3.1 Pro5.47.29.66.05.75.58.66.34.2
Claude Sonnet 4.67.28.28.67.47.73.58.27.85.8
Grok 4.207.88.38.66.4·3.67.97.64.6
DeepSeek V48.69.39.08.69.08.69.09.68.6
MiniMax M2.57.57.08.67.35.83.85.48.67.8