analysis
Feb 12, 2026ANALYSIS-005Your team ran an A/B test on a checkout flow. Here are the results: Control (A): 10,000 visitors, 320 conversions (3.2%) Treatment (B): 10,000 visitors, 380 conversions (3.8%) The product manager says: "B wins! Let's ship it - that's an 18.75% improvement!" 1. Calculate the statistical significance (provide p-value) 2. What's the 95% confidence interval for the true difference? 3. The test ran for 2 days. What concerns does this raise? 4. You discover Treatment B had a bug on iOS that blocked 2,000 users from even reaching checkout. How does this change your analysis? 5. What would you recommend?
Winner
GPT-5.4
openrouter
9.30
WINNER SCORE
matrix avg: 7.84
10×10 Judgment Matrix · 81 judgments
OPEN DATA
| Judge ↓ / Respondent → | Gemini 3.1 Pro | Claude Opus 4.6 | GPT-5.4 | DeepSeek V4 | MiMo-V2-Flash | Claude Sonnet 4.6 | Grok 4.20 | GPT-OSS-120B | Gemini 3 | MiniMax M2.5 |
|---|---|---|---|---|---|---|---|---|---|---|
| Gemini 3.1 Pro | — | 8.2 | 10.0 | 7.2 | 7.2 | 9.1 | 8.6 | 5.5 | 6.5 | · |
| Claude Opus 4.6 | 1.6 | — | 9.8 | 8.8 | 8.9 | 9.2 | 9.3 | 7.0 | 8.9 | · |
| GPT-5.4 | 1.2 | 6.7 | — | 6.1 | 7.7 | 7.8 | 7.7 | 4.5 | 8.6 | · |
| DeepSeek V4 | 6.0 | 8.8 | 8.8 | — | 9.0 | 9.6 | 9.0 | 8.4 | 9.2 | · |
| MiMo-V2-Flash | 2.2 | 9.6 | 9.0 | 8.3 | — | 9.0 | 9.3 | 7.3 | 9.3 | · |
| Claude Sonnet 4.6 | 1.4 | 9.2 | 9.3 | 8.6 | 9.2 | — | 8.8 | 8.1 | 8.6 | · |
| Grok 4.20 | 3.6 | 8.8 | 9.0 | 8.4 | 6.8 | 8.6 | — | 7.9 | 8.6 | · |
| GPT-OSS-120B | 2.0 | 7.7 | 8.8 | 7.8 | 8.8 | 8.8 | 8.3 | — | 8.3 | · |
| Gemini 3 | 2.9 | 9.8 | 10.0 | 9.8 | 9.8 | 9.8 | 9.8 | 8.3 | — | · |
| MiniMax M2.5 | 1.4 | 9.0 | 9.0 | 8.8 | 9.0 | 9.0 | 8.8 | 7.2 | 8.8 | — |