The Multivac — Ask any model, routed by evaluation

◈ MULTIVAC
OverviewEvaluationsLeaderboardModel PulseHistoryCompareExportAPI
Routing APIExport APISign in
← Evaluations/EVAL-20260402-193016
analysis
Apr 02, 2026ANALYSIS-012
A bank uses an ML model for loan approvals. Accuracy: 92%. But analysis shows: approval rate for Group A: 78%, Group B: 45%. The bank says 'the model doesn't use race as a feature.' (1) Explain how the model can be discriminatory without using race directly. (2) What proxy variables might cause this? (3) Is equalizing approval rates the right fix? What are the tradeoffs? (4) Design an audit procedure.
Winner
Grok 4.20
openrouter
9.47
WINNER SCORE
matrix avg: 8.81
↓ results.json↓ report.mdFull dataset (CSV) →
10×10 Judgment Matrix · 90 judgments
OPEN DATA
Judge ↓ / Respondent →Gemini 3.1 ProClaude Opus 4.6GPT-5.4DeepSeek V4MiMo-V2-FlashClaude Sonnet 4.6Grok 4.20GPT-OSS-120BGemini 3MiniMax M2.5
Gemini 3.1 Pro—6.89.69.810.07.510.07.99.89.2
Claude Opus 4.68.3—9.29.09.27.710.08.79.29.2
GPT-5.47.56.8—8.89.06.59.26.78.69.0
DeepSeek V49.08.79.8—9.28.89.89.29.09.8
MiMo-V2-Flash8.88.79.09.0—8.79.09.29.29.2
Claude Sonnet 4.68.39.29.88.89.0—9.88.88.88.8
Grok 4.208.38.78.88.78.88.7—8.88.78.8
GPT-OSS-120B7.36.88.88.68.87.78.7—8.48.7
Gemini 39.49.49.89.89.89.89.89.4—9.8
MiniMax M2.58.17.49.28.89.28.19.07.78.8—