← Evaluations/EVAL-20260207-133847
reasoning
Feb 18, 2026REASON-006

Schedule a one-day conference with these constraints: TALKS: A (90min), B (60min), C (45min), D (30min), E (30min), F (45min) ROOMS: Main Hall (capacity 500), Room 2 (capacity 100), Room 3 (capacity 50) TIME: 9:00 AM - 5:00 PM, with mandatory lunch break 12:00-1:00 PM CONSTRAINTS: 1. Talk A must be in Main Hall (expected attendance: 400) 2. Talk B and C cannot overlap (same speaker) 3. Talk D must be before Talk E (E builds on D's content) 4. Talk F requires Room 2's AV equipment 5. No room can have more than 3 talks total 6. At least one talk must be running at all times (except lunch) 7. Talk A cannot start before 10:00 AM (speaker arriving late) 8. Talk E must end by 3:00 PM (speaker leaving early) Find a valid schedule or prove none exists.

Winner
GPT-OSS-120B
OpenAI
8.31
WINNER SCORE
matrix avg: 5.78
results.json report.mdFull dataset (CSV) →
10×10 Judgment Matrix · 100 judgments
OPEN DATA
Judge ↓ / Respondent →OLMo ThinkMiMo-V2-FlashGemini 3Grok 3 (Direct)Claude Sonnet 4.5DeepSeek V3.2Claude Opus 4.5Gemini 3Gemini 2.5 FlashGPT-OSS-120B
OLMo Think0.00.00.00.00.00.00.00.00.0
MiMo-V2-Flash1.63.94.66.47.55.81.65.52.8
Gemini 30.08.27.74.58.28.64.25.010.0
Grok 3 (Direct)0.08.28.49.47.68.22.65.98.8
Claude Sonnet 4.50.06.07.25.77.08.20.74.19.8
DeepSeek V3.27.57.38.25.24.27.61.66.29.5
Claude Opus 4.50.06.04.84.35.44.01.03.67.3
Gemini 30.00.00.00.00.00.00.02.00.0
Gemini 2.5 Flash0.07.85.25.68.66.53.60.010.0
GPT-OSS-120B0.04.35.00.00.00.00.01.92.6