← Evaluations/EVAL-20260402-190115
analysis
Feb 12, 2026ANALYSIS-005

Your team ran an A/B test on a checkout flow. Here are the results: Control (A): 10,000 visitors, 320 conversions (3.2%) Treatment (B): 10,000 visitors, 380 conversions (3.8%) The product manager says: "B wins! Let's ship it - that's an 18.75% improvement!" 1. Calculate the statistical significance (provide p-value) 2. What's the 95% confidence interval for the true difference? 3. The test ran for 2 days. What concerns does this raise? 4. You discover Treatment B had a bug on iOS that blocked 2,000 users from even reaching checkout. How does this change your analysis? 5. What would you recommend?

Winner
GPT-5.4
openrouter
9.30
WINNER SCORE
matrix avg: 7.84
results.json report.mdFull dataset (CSV) →
10×10 Judgment Matrix · 81 judgments
OPEN DATA
Judge ↓ / Respondent →Gemini 3.1 ProClaude Opus 4.6GPT-5.4DeepSeek V4MiMo-V2-FlashClaude Sonnet 4.6Grok 4.20GPT-OSS-120BGemini 3MiniMax M2.5
Gemini 3.1 Pro8.210.07.27.29.18.65.56.5·
Claude Opus 4.61.69.88.88.99.29.37.08.9·
GPT-5.41.26.76.17.77.87.74.58.6·
DeepSeek V46.08.88.89.09.69.08.49.2·
MiMo-V2-Flash2.29.69.08.39.09.37.39.3·
Claude Sonnet 4.61.49.29.38.69.28.88.18.6·
Grok 4.203.68.89.08.46.88.67.98.6·
GPT-OSS-120B2.07.78.87.88.88.88.38.3·
Gemini 32.99.810.09.89.89.89.88.3·
MiniMax M2.51.49.09.08.89.09.08.87.28.8