← Evaluations/EVAL-20260207-144351
analysis
Feb 12, 2026ANALYSIS-005

Your team ran an A/B test on a checkout flow. Here are the results: Control (A): 10,000 visitors, 320 conversions (3.2%) Treatment (B): 10,000 visitors, 380 conversions (3.8%) The product manager says: "B wins! Let's ship it - that's an 18.75% improvement!" 1. Calculate the statistical significance (provide p-value) 2. What's the 95% confidence interval for the true difference? 3. The test ran for 2 days. What concerns does this raise? 4. You discover Treatment B had a bug on iOS that blocked 2,000 users from even reaching checkout. How does this change your analysis? 5. What would you recommend?

Winner
MiMo-V2-Flash
Xiaomi
9.69
WINNER SCORE
matrix avg: 8.47
results.json report.mdFull dataset (CSV) →
10×10 Judgment Matrix · 100 judgments
OPEN DATA
Judge ↓ / Respondent →MiMo-V2-FlashGemini 3GPT-OSS-LegalGemini 2.5 FlashGPT-OSS-120BDeepSeek V3.2Claude Sonnet 4.5Claude Opus 4.5Gemini 3Grok 4.1 Fast
MiMo-V2-Flash8.67.38.88.69.09.39.32.69.6
Gemini 310.08.19.48.89.810.010.04.49.6
GPT-OSS-Legal9.08.87.00.08.88.88.82.00.0
Gemini 2.5 Flash10.010.09.09.010.010.09.02.99.0
GPT-OSS-120B0.00.05.86.50.00.00.02.50.0
DeepSeek V3.29.69.68.69.38.89.610.01.59.3
Claude Sonnet 4.59.89.67.99.28.79.89.62.19.6
Claude Opus 4.59.69.66.78.08.19.69.61.69.3
Gemini 30.00.00.00.00.00.00.00.00.0
Grok 4.1 Fast10.010.08.28.48.410.010.010.05.8