# The Multivac — Evaluation Report

**Evaluation ID:** EVAL-20260402-232602
**Date:** Apr 02, 2026
**Category:** communication
**Question ID:** COMM-019

---

## Question

Your company's AI product generated offensive content that went viral. Write: (1) An immediate public statement (first 2 hours — acknowledge, no excuses), (2) A detailed follow-up 24 hours later (root cause, what you're doing about it), (3) An internal all-hands message to employees who are demoralized. Each must be genuine, take responsibility, and not use passive voice or the phrase 'we take this seriously' (which everyone uses and nobody believes).

---

## Winner

**GPT-5.4** (openrouter)
- Winner Score: 9.38
- Matrix Average: 8.80
- Total Judgments: 90

---

## Rankings

| Rank | Model | Provider | Avg Score | Judgments |
|------|-------|----------|-----------|----------|
| 1 | GPT-5.4 | openrouter | 9.38 | 9 |
| 2 | Grok 4.20 | openrouter | 9.33 | 9 |
| 3 | Mistral Small Creative | Mistral | 9.21 | 9 |
| 4 | MiMo-V2-Flash | Xiaomi | 9.17 | 9 |
| 5 | Claude Sonnet 4.6 | openrouter | 9.10 | 9 |
| 6 | Seed 1.6 Flash | openrouter | 8.98 | 9 |
| 7 | Claude Opus 4.6 | openrouter | 8.98 | 9 |
| 8 | GPT-OSS-120B | OpenAI | 8.86 | 9 |
| 9 | DeepSeek V4 | openrouter | 8.79 | 9 |
| 10 | Gemini 3.1 Pro | openrouter | 6.22 | 9 |

---

## 10×10 Judgment Matrix

Rows = Judge, Columns = Respondent. Self-judgments excluded (—).

| Judge ↓ / Resp → | MiMo-V2-Flash | Claude Opus | GPT-5.4 | Claude Sonnet | Grok 4.20 | Gemini 3.1 Pro | DeepSeek V4 | GPT-OSS-120B | Mistral Small | Seed 1.6 Flash |
|---|---|---|---|---|---|---|---|---|---|---|
| MiMo-V2-Flash | — | 9.2 | 9.6 | 9.6 | 9.6 | 6.0 | 9.0 | 8.8 | 9.6 | 9.6 |
| Claude Opus | 9.6 | — | 10.0 | 9.3 | 9.6 | 4.3 | 8.6 | 9.2 | 9.6 | 9.2 |
| GPT-5.4 | 8.8 | 7.2 | — | 8.4 | 9.2 | 4.7 | 8.6 | 8.8 | 9.0 | 8.2 |
| Claude Sonnet | 9.6 | 9.6 | 9.6 | — | 9.6 | 5.3 | 8.8 | 8.8 | 9.3 | 8.8 |
| Grok 4.20 | 8.8 | 8.8 | 8.8 | 8.8 | — | 6.0 | 8.4 | 8.8 | 9.0 | 8.8 |
| Gemini 3.1 Pro | 9.1 | 8.1 | 9.1 | 8.6 | 9.2 | — | 9.1 | 7.0 | 8.8 | 7.9 |
| DeepSeek V4 | 9.8 | 10.0 | 9.8 | 9.8 | 9.3 | 8.3 | — | 9.8 | 10.0 | 9.8 |
| GPT-OSS-120B | 8.3 | 9.2 | 8.8 | 8.8 | 8.8 | 4.9 | 8.3 | — | 8.8 | 8.8 |
| Mistral Small | 9.8 | 10.0 | 10.0 | 10.0 | 10.0 | 8.7 | 9.6 | 9.8 | — | 9.8 |
| Seed 1.6 Flash | 8.8 | 8.8 | 8.8 | 8.6 | 8.8 | 7.8 | 8.8 | 8.8 | 8.8 | — |

---

## Methodology

- **10×10 Blind Peer Matrix:** All models answer the same question, then all models judge all responses.
- **5 Criteria:** Correctness, completeness, clarity, depth, usefulness (each scored 1–10).
- **Self-judgments excluded:** Models do not judge their own responses.
- **Weighted Score:** Composite of all 5 criteria.

---

## Citation

The Multivac (2026). Blind Peer Evaluation: COMM-019. app.themultivac.com

## License

Open data. Free to use, share, and build upon. Please cite The Multivac when using this data.

Download raw JSON: https://app.themultivac.com/api/evaluations/EVAL-20260402-232602/results
Full dataset: https://app.themultivac.com/dashboard/export
