# The Multivac — Evaluation Report

**Evaluation ID:** EVAL-20260402-235645
**Date:** Apr 02, 2026
**Category:** communication
**Question ID:** COMM-026

---

## Question

Your cloud service had a 6-hour outage affecting 10,000 customers. Write a customer-facing FAQ that covers: (1) What happened (plain English, no blame-shifting), (2) What data was affected, (3) What you're doing to prevent recurrence, (4) What customers should do right now, (5) How to get support, (6) Whether there will be service credits. Anticipate the angry questions and address them proactively.

---

## Winner

**GPT-OSS-120B** (OpenAI)
- Winner Score: 9.43
- Matrix Average: 9.07
- Total Judgments: 90

---

## Rankings

| Rank | Model | Provider | Avg Score | Judgments |
|------|-------|----------|-----------|----------|
| 1 | GPT-OSS-120B | OpenAI | 9.43 | 9 |
| 2 | GPT-5.4 | openrouter | 9.28 | 9 |
| 3 | MiMo-V2-Flash | Xiaomi | 9.22 | 9 |
| 4 | Grok 4.20 | openrouter | 9.21 | 9 |
| 5 | Mistral Small Creative | Mistral | 9.21 | 9 |
| 6 | Seed 1.6 Flash | openrouter | 9.20 | 9 |
| 7 | Claude Opus 4.6 | openrouter | 9.07 | 9 |
| 8 | Claude Sonnet 4.6 | openrouter | 9.06 | 9 |
| 9 | DeepSeek V4 | openrouter | 8.94 | 9 |
| 10 | Gemini 3.1 Pro | openrouter | 8.06 | 9 |

---

## 10×10 Judgment Matrix

Rows = Judge, Columns = Respondent. Self-judgments excluded (—).

| Judge ↓ / Resp → | Claude Opus | GPT-5.4 | Claude Sonnet | Mistral Small | Gemini 3.1 Pro | Grok 4.20 | DeepSeek V4 | GPT-OSS-120B | MiMo-V2-Flash | Seed 1.6 Flash |
|---|---|---|---|---|---|---|---|---|---|---|
| Claude Opus | — | 9.6 | 9.6 | 9.6 | 8.2 | 9.0 | 9.0 | 9.6 | 9.3 | 9.6 |
| GPT-5.4 | 8.3 | — | 7.3 | 8.9 | 5.5 | 8.6 | 7.4 | 8.8 | 8.0 | 8.2 |
| Claude Sonnet | 9.6 | 9.6 | — | 9.3 | 8.2 | 8.8 | 8.6 | 9.6 | 9.3 | 8.6 |
| Mistral Small | 9.8 | 9.8 | 10.0 | — | 9.6 | 9.8 | 9.8 | 9.8 | 9.8 | 9.8 |
| Gemini 3.1 Pro | 8.8 | 9.4 | 8.8 | 10.0 | — | 9.8 | 8.8 | 10.0 | 9.3 | 9.8 |
| Grok 4.20 | 8.8 | 8.8 | 9.2 | 9.2 | 7.8 | — | 8.8 | 9.2 | 9.2 | 8.8 |
| DeepSeek V4 | 9.2 | 9.8 | 9.6 | 9.8 | 9.0 | 9.8 | — | 9.4 | 9.8 | 9.8 |
| GPT-OSS-120B | 8.8 | 8.4 | 8.8 | 8.8 | 7.7 | 8.8 | 9.0 | — | 9.0 | 8.8 |
| MiMo-V2-Flash | 9.6 | 9.4 | 9.6 | 9.0 | 8.6 | 9.6 | 9.6 | 9.6 | — | 9.6 |
| Seed 1.6 Flash | 8.8 | 8.8 | 8.8 | 8.2 | 8.0 | 8.8 | 9.6 | 9.0 | 9.2 | — |

---

## Methodology

- **10×10 Blind Peer Matrix:** All models answer the same question, then all models judge all responses.
- **5 Criteria:** Correctness, completeness, clarity, depth, usefulness (each scored 1–10).
- **Self-judgments excluded:** Models do not judge their own responses.
- **Weighted Score:** Composite of all 5 criteria.

---

## Citation

The Multivac (2026). Blind Peer Evaluation: COMM-026. app.themultivac.com

## License

Open data. Free to use, share, and build upon. Please cite The Multivac when using this data.

Download raw JSON: https://app.themultivac.com/api/evaluations/EVAL-20260402-235645/results
Full dataset: https://app.themultivac.com/dashboard/export
