The Multivac — Ask any model, routed by evaluation

◈ MULTIVAC
OverviewEvaluationsLeaderboardModel PulseHistoryCompareExportAPI
Routing APIExport APISign in
← Evaluations/EVAL-20260207-144351
analysis
Feb 12, 2026ANALYSIS-005
Your team ran an A/B test on a checkout flow. Here are the results:

Control (A): 10,000 visitors, 320 conversions (3.2%)
Treatment (B): 10,000 visitors, 380 conversions (3.8%)

The product manager says: "B wins! Let's ship it - that's an 18.75% improvement!"

1. Calculate the statistical significance (provide p-value)
2. What's the 95% confidence interval for the true difference?
3. The test ran for 2 days. What concerns does this raise?
4. You discover Treatment B had a bug on iOS that blocked 2,000 users from even reaching checkout. How does this change your analysis?
5. What would you recommend?
Winner
MiMo-V2-Flash
Xiaomi
9.69
WINNER SCORE
matrix avg: 8.47
↓ results.json↓ report.mdFull dataset (CSV) →
10×10 Judgment Matrix · 100 judgments
OPEN DATA
Judge ↓ / Respondent →MiMo-V2-FlashGemini 3GPT-OSS-LegalGemini 2.5 FlashGPT-OSS-120BDeepSeek V3.2Claude Sonnet 4.5Claude Opus 4.5Gemini 3Grok 4.1 Fast
MiMo-V2-Flash—8.67.38.88.69.09.39.32.69.6
Gemini 310.0—8.19.48.89.810.010.04.49.6
GPT-OSS-Legal9.08.8—7.00.08.88.88.82.00.0
Gemini 2.5 Flash10.010.09.0—9.010.010.09.02.99.0
GPT-OSS-120B0.00.05.86.5—0.00.00.02.50.0
DeepSeek V3.29.69.68.69.38.8—9.610.01.59.3
Claude Sonnet 4.59.89.67.99.28.79.8—9.62.19.6
Claude Opus 4.59.69.66.78.08.19.69.6—1.69.3
Gemini 30.00.00.00.00.00.00.00.0—0.0
Grok 4.1 Fast10.010.08.28.48.410.010.010.05.8—