# The Multivac — Evaluation Report

**Evaluation ID:** EVAL-20260402-193016
**Date:** Apr 02, 2026
**Category:** analysis
**Question ID:** ANALYSIS-012

---

## Question

A bank uses an ML model for loan approvals. Accuracy: 92%. But analysis shows: approval rate for Group A: 78%, Group B: 45%. The bank says 'the model doesn't use race as a feature.' (1) Explain how the model can be discriminatory without using race directly. (2) What proxy variables might cause this? (3) Is equalizing approval rates the right fix? What are the tradeoffs? (4) Design an audit procedure.

---

## Winner

**Grok 4.20** (openrouter)
- Winner Score: 9.47
- Matrix Average: 8.81
- Total Judgments: 90

---

## Rankings

| Rank | Model | Provider | Avg Score | Judgments |
|------|-------|----------|-----------|----------|
| 1 | Grok 4.20 | openrouter | 9.47 | 9 |
| 2 | GPT-5.4 | openrouter | 9.34 | 9 |
| 3 | MiMo-V2-Flash | Xiaomi | 9.22 | 9 |
| 4 | MiniMax M2.5 | openrouter | 9.16 | 9 |
| 5 | DeepSeek V4 | openrouter | 9.03 | 9 |
| 6 | Gemini 3 Flash Preview | Google | 8.94 | 9 |
| 7 | GPT-OSS-120B | OpenAI | 8.48 | 9 |
| 8 | Gemini 3.1 Pro | openrouter | 8.31 | 9 |
| 9 | Claude Sonnet 4.6 | openrouter | 8.15 | 9 |
| 10 | Claude Opus 4.6 | openrouter | 8.04 | 9 |

---

## 10×10 Judgment Matrix

Rows = Judge, Columns = Respondent. Self-judgments excluded (—).

| Judge ↓ / Resp → | Gemini 3.1 Pro | Claude Opus | GPT-5.4 | DeepSeek V4 | MiMo-V2-Flash | Claude Sonnet | Grok 4.20 | GPT-OSS-120B | Gemini 3 | MiniMax M2.5 |
|---|---|---|---|---|---|---|---|---|---|---|
| Gemini 3.1 Pro | — | 6.8 | 9.6 | 9.8 | 10.0 | 7.5 | 10.0 | 7.9 | 9.8 | 9.2 |
| Claude Opus | 8.3 | — | 9.2 | 9.0 | 9.2 | 7.7 | 10.0 | 8.7 | 9.2 | 9.2 |
| GPT-5.4 | 7.5 | 6.8 | — | 8.8 | 9.0 | 6.5 | 9.2 | 6.7 | 8.6 | 9.0 |
| DeepSeek V4 | 9.0 | 8.7 | 9.8 | — | 9.2 | 8.8 | 9.8 | 9.2 | 9.0 | 9.8 |
| MiMo-V2-Flash | 8.8 | 8.7 | 9.0 | 9.0 | — | 8.7 | 9.0 | 9.2 | 9.2 | 9.2 |
| Claude Sonnet | 8.3 | 9.2 | 9.8 | 8.8 | 9.0 | — | 9.8 | 8.8 | 8.8 | 8.8 |
| Grok 4.20 | 8.3 | 8.7 | 8.8 | 8.7 | 8.8 | 8.7 | — | 8.8 | 8.7 | 8.8 |
| GPT-OSS-120B | 7.3 | 6.8 | 8.8 | 8.6 | 8.8 | 7.7 | 8.7 | — | 8.4 | 8.7 |
| Gemini 3 | 9.4 | 9.4 | 9.8 | 9.8 | 9.8 | 9.8 | 9.8 | 9.4 | — | 9.8 |
| MiniMax M2.5 | 8.1 | 7.4 | 9.2 | 8.8 | 9.2 | 8.1 | 9.0 | 7.7 | 8.8 | — |

---

## Methodology

- **10×10 Blind Peer Matrix:** All models answer the same question, then all models judge all responses.
- **5 Criteria:** Correctness, completeness, clarity, depth, usefulness (each scored 1–10).
- **Self-judgments excluded:** Models do not judge their own responses.
- **Weighted Score:** Composite of all 5 criteria.

---

## Citation

The Multivac (2026). Blind Peer Evaluation: ANALYSIS-012. app.themultivac.com

## License

Open data. Free to use, share, and build upon. Please cite The Multivac when using this data.

Download raw JSON: https://app.themultivac.com/api/evaluations/EVAL-20260402-193016/results
Full dataset: https://app.themultivac.com/dashboard/export
