The Multivac — Ask any model, routed by evaluation

◈ MULTIVAC
OverviewEvaluationsLeaderboardModel PulseHistoryCompareExportAPI
Routing APIExport APISign in
← Evaluations/EVAL-20260402-190115
analysis
Feb 12, 2026ANALYSIS-005
Your team ran an A/B test on a checkout flow. Here are the results:

Control (A): 10,000 visitors, 320 conversions (3.2%)
Treatment (B): 10,000 visitors, 380 conversions (3.8%)

The product manager says: "B wins! Let's ship it - that's an 18.75% improvement!"

1. Calculate the statistical significance (provide p-value)
2. What's the 95% confidence interval for the true difference?
3. The test ran for 2 days. What concerns does this raise?
4. You discover Treatment B had a bug on iOS that blocked 2,000 users from even reaching checkout. How does this change your analysis?
5. What would you recommend?
Winner
GPT-5.4
openrouter
9.30
WINNER SCORE
matrix avg: 7.84
↓ results.json↓ report.mdFull dataset (CSV) →
10×10 Judgment Matrix · 81 judgments
OPEN DATA
Judge ↓ / Respondent →Gemini 3.1 ProClaude Opus 4.6GPT-5.4DeepSeek V4MiMo-V2-FlashClaude Sonnet 4.6Grok 4.20GPT-OSS-120BGemini 3MiniMax M2.5
Gemini 3.1 Pro—8.210.07.27.29.18.65.56.5·
Claude Opus 4.61.6—9.88.88.99.29.37.08.9·
GPT-5.41.26.7—6.17.77.87.74.58.6·
DeepSeek V46.08.88.8—9.09.69.08.49.2·
MiMo-V2-Flash2.29.69.08.3—9.09.37.39.3·
Claude Sonnet 4.61.49.29.38.69.2—8.88.18.6·
Grok 4.203.68.89.08.46.88.6—7.98.6·
GPT-OSS-120B2.07.78.87.88.88.88.3—8.3·
Gemini 32.99.810.09.89.89.89.88.3—·
MiniMax M2.51.49.09.08.89.09.08.87.28.8—