# The Multivac — Evaluation Report

**Evaluation ID:** EVAL-20260402-232850
**Date:** Apr 02, 2026
**Category:** communication
**Question ID:** COMM-020

---

## Question

Write day-one onboarding documentation for a new engineer joining your team. Include: (1) How to set up the dev environment (step by step, assume macOS), (2) Architecture overview (what talks to what and why), (3) Deployment process (how code gets to production), (4) Where to ask for help (and what NOT to do), (5) First task assignment. Make it warm, practical, and impossible to get stuck on.

---

## Winner

**Grok 4.20** (openrouter)
- Winner Score: 9.19
- Matrix Average: 8.29
- Total Judgments: 90

---

## Rankings

| Rank | Model | Provider | Avg Score | Judgments |
|------|-------|----------|-----------|----------|
| 1 | Grok 4.20 | openrouter | 9.19 | 9 |
| 2 | Mistral Small Creative | Mistral | 8.80 | 9 |
| 3 | GPT-OSS-120B | OpenAI | 8.57 | 9 |
| 4 | GPT-5.4 | openrouter | 8.38 | 9 |
| 5 | DeepSeek V4 | openrouter | 8.27 | 9 |
| 6 | MiMo-V2-Flash | Xiaomi | 8.23 | 9 |
| 7 | Seed 1.6 Flash | openrouter | 8.11 | 9 |
| 8 | Claude Sonnet 4.6 | openrouter | 7.87 | 9 |
| 9 | Claude Opus 4.6 | openrouter | 7.76 | 9 |
| 10 | Gemini 3.1 Pro | openrouter | 7.71 | 9 |

---

## 10×10 Judgment Matrix

Rows = Judge, Columns = Respondent. Self-judgments excluded (—).

| Judge ↓ / Resp → | MiMo-V2-Flash | Gemini 3.1 Pro | Claude Opus | GPT-5.4 | Claude Sonnet | Grok 4.20 | DeepSeek V4 | GPT-OSS-120B | Mistral Small | Seed 1.6 Flash |
|---|---|---|---|---|---|---|---|---|---|---|
| MiMo-V2-Flash | — | 8.6 | 8.8 | 9.2 | 8.8 | 9.2 | 8.6 | 9.0 | 9.3 | 9.2 |
| Gemini 3.1 Pro | 7.3 | — | 6.2 | 6.0 | 6.6 | 9.8 | 9.0 | 7.3 | 8.8 | 7.3 |
| Claude Opus | 8.3 | 7.3 | — | 8.8 | 8.0 | 9.2 | 8.0 | 8.8 | 8.6 | 7.5 |
| GPT-5.4 | 6.6 | 5.7 | 5.3 | — | 5.0 | 8.8 | 6.8 | 6.8 | 7.7 | 6.0 |
| Claude Sonnet | 8.2 | 7.6 | 8.6 | 8.6 | — | 9.2 | 8.3 | 8.6 | 8.9 | 8.0 |
| Grok 4.20 | 8.8 | 8.4 | 9.0 | 9.0 | 9.0 | — | 8.6 | 8.8 | 9.0 | 8.8 |
| DeepSeek V4 | 8.8 | 9.2 | 9.6 | 9.8 | 9.8 | 9.2 | — | 9.6 | 9.2 | 9.0 |
| GPT-OSS-120B | 7.6 | 5.7 | 7.0 | 5.3 | 6.5 | 8.8 | 8.0 | — | 9.0 | 7.6 |
| Mistral Small | 9.8 | 9.2 | 9.8 | 9.8 | 9.6 | 9.8 | 9.2 | 9.8 | — | 9.6 |
| Seed 1.6 Flash | 8.7 | 7.8 | 5.6 | 9.0 | 7.5 | 9.0 | 8.0 | 8.6 | 8.8 | — |

---

## Methodology

- **10×10 Blind Peer Matrix:** All models answer the same question, then all models judge all responses.
- **5 Criteria:** Correctness, completeness, clarity, depth, usefulness (each scored 1–10).
- **Self-judgments excluded:** Models do not judge their own responses.
- **Weighted Score:** Composite of all 5 criteria.

---

## Citation

The Multivac (2026). Blind Peer Evaluation: COMM-020. app.themultivac.com

## License

Open data. Free to use, share, and build upon. Please cite The Multivac when using this data.

Download raw JSON: https://app.themultivac.com/api/evaluations/EVAL-20260402-232850/results
Full dataset: https://app.themultivac.com/dashboard/export
