analysis
Feb 12, 2026ANALYSIS-005Your team ran an A/B test on a checkout flow. Here are the results: Control (A): 10,000 visitors, 320 conversions (3.2%) Treatment (B): 10,000 visitors, 380 conversions (3.8%) The product manager says: "B wins! Let's ship it - that's an 18.75% improvement!" 1. Calculate the statistical significance (provide p-value) 2. What's the 95% confidence interval for the true difference? 3. The test ran for 2 days. What concerns does this raise? 4. You discover Treatment B had a bug on iOS that blocked 2,000 users from even reaching checkout. How does this change your analysis? 5. What would you recommend?
Winner
MiMo-V2-Flash
Xiaomi
9.69
WINNER SCORE
matrix avg: 8.47
10×10 Judgment Matrix · 100 judgments
OPEN DATA
| Judge ↓ / Respondent → | MiMo-V2-Flash | Gemini 3 | GPT-OSS-Legal | Gemini 2.5 Flash | GPT-OSS-120B | DeepSeek V3.2 | Claude Sonnet 4.5 | Claude Opus 4.5 | Gemini 3 | Grok 4.1 Fast |
|---|---|---|---|---|---|---|---|---|---|---|
| MiMo-V2-Flash | — | 8.6 | 7.3 | 8.8 | 8.6 | 9.0 | 9.3 | 9.3 | 2.6 | 9.6 |
| Gemini 3 | 10.0 | — | 8.1 | 9.4 | 8.8 | 9.8 | 10.0 | 10.0 | 4.4 | 9.6 |
| GPT-OSS-Legal | 9.0 | 8.8 | — | 7.0 | 0.0 | 8.8 | 8.8 | 8.8 | 2.0 | 0.0 |
| Gemini 2.5 Flash | 10.0 | 10.0 | 9.0 | — | 9.0 | 10.0 | 10.0 | 9.0 | 2.9 | 9.0 |
| GPT-OSS-120B | 0.0 | 0.0 | 5.8 | 6.5 | — | 0.0 | 0.0 | 0.0 | 2.5 | 0.0 |
| DeepSeek V3.2 | 9.6 | 9.6 | 8.6 | 9.3 | 8.8 | — | 9.6 | 10.0 | 1.5 | 9.3 |
| Claude Sonnet 4.5 | 9.8 | 9.6 | 7.9 | 9.2 | 8.7 | 9.8 | — | 9.6 | 2.1 | 9.6 |
| Claude Opus 4.5 | 9.6 | 9.6 | 6.7 | 8.0 | 8.1 | 9.6 | 9.6 | — | 1.6 | 9.3 |
| Gemini 3 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | — | 0.0 |
| Grok 4.1 Fast | 10.0 | 10.0 | 8.2 | 8.4 | 8.4 | 10.0 | 10.0 | 10.0 | 5.8 | — |