# The Multivac — Evaluation Report

**Evaluation ID:** EVAL-20260402-201915
**Date:** Apr 02, 2026
**Category:** analysis
**Question ID:** ANALYSIS-020

---

## Question

Estimate the total energy cost and carbon footprint of training a frontier AI model (like GPT-5). Include: GPU hours, electricity cost, cooling overhead, water usage, and embodied carbon of hardware. (1) Compare to: one year of Netflix streaming for all users, one transatlantic flight, and one Bitcoin transaction. (2) Inference costs are growing faster than training costs. Why? (3) What changes would reduce AI's environmental impact by 10x?

---

## Winner

**Grok 4.20** (openrouter)
- Winner Score: 8.62
- Matrix Average: 6.99
- Total Judgments: 89

---

## Rankings

| Rank | Model | Provider | Avg Score | Judgments |
|------|-------|----------|-----------|----------|
| 1 | Grok 4.20 | openrouter | 8.62 | 9 |
| 2 | Gemini 3 Flash Preview | Google | 8.24 | 9 |
| 3 | MiMo-V2-Flash | Xiaomi | 8.02 | 9 |
| 4 | MiniMax M2.5 | openrouter | 7.91 | 9 |
| 5 | GPT-5.4 | openrouter | 7.58 | 9 |
| 6 | Claude Opus 4.6 | openrouter | 7.21 | 8 |
| 7 | Claude Sonnet 4.6 | openrouter | 6.93 | 9 |
| 8 | DeepSeek V4 | openrouter | 6.43 | 9 |
| 9 | GPT-OSS-120B | OpenAI | 6.03 | 9 |
| 10 | Gemini 3.1 Pro | openrouter | 2.93 | 9 |

---

## 10×10 Judgment Matrix

Rows = Judge, Columns = Respondent. Self-judgments excluded (—).

| Judge ↓ / Resp → | Gemini 3.1 Pro | Claude Opus | GPT-5.4 | DeepSeek V4 | MiMo-V2-Flash | Claude Sonnet | GPT-OSS-120B | Grok 4.20 | Gemini 3 | MiniMax M2.5 |
|---|---|---|---|---|---|---|---|---|---|---|
| Gemini 3.1 Pro | — | · | 5.8 | 5.0 | 8.7 | 6.3 | 2.6 | 8.9 | 9.7 | 7.9 |
| Claude Opus | 1.6 | — | 7.6 | 5.7 | 7.5 | 6.8 | 4.8 | 8.2 | 7.7 | 7.9 |
| GPT-5.4 | 0.7 | 5.0 | — | 5.3 | 7.0 | 4.5 | 3.5 | 8.2 | 7.5 | 6.5 |
| DeepSeek V4 | 6.2 | 8.3 | 8.7 | — | 9.0 | 8.4 | 8.4 | 9.0 | 9.0 | 8.8 |
| MiMo-V2-Flash | 5.8 | 7.4 | 8.2 | 8.0 | — | 7.5 | 8.6 | 8.6 | 8.4 | 8.6 |
| Claude Sonnet | 3.3 | 8.0 | 8.4 | 7.2 | 8.0 | — | 6.3 | 8.6 | 8.0 | 8.2 |
| GPT-OSS-120B | 1.4 | 5.9 | 5.9 | 6.7 | 8.0 | 5.9 | — | 8.2 | 7.5 | 7.5 |
| Grok 4.20 | 3.6 | 6.8 | 7.8 | 4.5 | 7.0 | 7.8 | 6.0 | — | 7.6 | 6.8 |
| Gemini 3 | 2.0 | 9.0 | 9.2 | 8.3 | 9.0 | 8.3 | 7.3 | 9.6 | — | 9.0 |
| MiniMax M2.5 | 1.6 | 7.3 | 6.6 | 7.3 | 8.2 | 6.8 | 6.7 | 8.6 | 8.8 | — |

---

## Methodology

- **10×10 Blind Peer Matrix:** All models answer the same question, then all models judge all responses.
- **5 Criteria:** Correctness, completeness, clarity, depth, usefulness (each scored 1–10).
- **Self-judgments excluded:** Models do not judge their own responses.
- **Weighted Score:** Composite of all 5 criteria.

---

## Citation

The Multivac (2026). Blind Peer Evaluation: ANALYSIS-020. app.themultivac.com

## License

Open data. Free to use, share, and build upon. Please cite The Multivac when using this data.

Download raw JSON: https://app.themultivac.com/api/evaluations/EVAL-20260402-201915/results
Full dataset: https://app.themultivac.com/dashboard/export
