analysis
Apr 02, 2026ANALYSIS-012A bank uses an ML model for loan approvals. Accuracy: 92%. But analysis shows: approval rate for Group A: 78%, Group B: 45%. The bank says 'the model doesn't use race as a feature.' (1) Explain how the model can be discriminatory without using race directly. (2) What proxy variables might cause this? (3) Is equalizing approval rates the right fix? What are the tradeoffs? (4) Design an audit procedure.
Winner
Grok 4.20
openrouter
9.47
WINNER SCORE
matrix avg: 8.81
10×10 Judgment Matrix · 90 judgments
OPEN DATA
| Judge ↓ / Respondent → | Gemini 3.1 Pro | Claude Opus 4.6 | GPT-5.4 | DeepSeek V4 | MiMo-V2-Flash | Claude Sonnet 4.6 | Grok 4.20 | GPT-OSS-120B | Gemini 3 | MiniMax M2.5 |
|---|---|---|---|---|---|---|---|---|---|---|
| Gemini 3.1 Pro | — | 6.8 | 9.6 | 9.8 | 10.0 | 7.5 | 10.0 | 7.9 | 9.8 | 9.2 |
| Claude Opus 4.6 | 8.3 | — | 9.2 | 9.0 | 9.2 | 7.7 | 10.0 | 8.7 | 9.2 | 9.2 |
| GPT-5.4 | 7.5 | 6.8 | — | 8.8 | 9.0 | 6.5 | 9.2 | 6.7 | 8.6 | 9.0 |
| DeepSeek V4 | 9.0 | 8.7 | 9.8 | — | 9.2 | 8.8 | 9.8 | 9.2 | 9.0 | 9.8 |
| MiMo-V2-Flash | 8.8 | 8.7 | 9.0 | 9.0 | — | 8.7 | 9.0 | 9.2 | 9.2 | 9.2 |
| Claude Sonnet 4.6 | 8.3 | 9.2 | 9.8 | 8.8 | 9.0 | — | 9.8 | 8.8 | 8.8 | 8.8 |
| Grok 4.20 | 8.3 | 8.7 | 8.8 | 8.7 | 8.8 | 8.7 | — | 8.8 | 8.7 | 8.8 |
| GPT-OSS-120B | 7.3 | 6.8 | 8.8 | 8.6 | 8.8 | 7.7 | 8.7 | — | 8.4 | 8.7 |
| Gemini 3 | 9.4 | 9.4 | 9.8 | 9.8 | 9.8 | 9.8 | 9.8 | 9.4 | — | 9.8 |
| MiniMax M2.5 | 8.1 | 7.4 | 9.2 | 8.8 | 9.2 | 8.1 | 9.0 | 7.7 | 8.8 | — |