# The Multivac — Evaluation Report

**Evaluation ID:** EVAL-20260402-175350
**Date:** Apr 02, 2026
**Category:** reasoning
**Question ID:** REASON-023

---

## Question

Achilles gives a tortoise a 100-meter head start. Achilles runs at 10 m/s, the tortoise at 1 m/s. Zeno argues Achilles can never catch the tortoise because he must first reach where the tortoise was, but by then the tortoise has moved. (1) Resolve the paradox using limits. (2) Resolve it without calculus — using only physical reasoning. (3) Is there a version of Zeno's paradox that modern physics cannot fully resolve? (Hint: consider Planck length.)

---

## Winner

**GPT-OSS-120B** (OpenAI)
- Winner Score: 9.38
- Matrix Average: 8.35
- Total Judgments: 88

---

## Rankings

| Rank | Model | Provider | Avg Score | Judgments |
|------|-------|----------|-----------|----------|
| 1 | GPT-OSS-120B | OpenAI | 9.38 | 9 |
| 2 | Claude Sonnet 4.6 | openrouter | 9.29 | 9 |
| 3 | MiMo-V2-Flash | Xiaomi | 9.19 | 9 |
| 4 | Claude Opus 4.6 | openrouter | 9.10 | 9 |
| 5 | GPT-5.4 | openrouter | 8.99 | 9 |
| 6 | Gemini 2.5 Flash | openrouter | 8.90 | 9 |
| 7 | Grok 4.20 | openrouter | 8.89 | 9 |
| 8 | DeepSeek V4 | openrouter | 8.80 | 9 |
| 9 | Gemini 3.1 Pro | openrouter | 6.71 | 8 |
| 10 | MiniMax M2.5 | openrouter | 4.29 | 8 |

---

## 10×10 Judgment Matrix

Rows = Judge, Columns = Respondent. Self-judgments excluded (—).

| Judge ↓ / Resp → | MiMo-V2-Flash | Gemini 3.1 Pro | DeepSeek V4 | Claude Opus | GPT-5.4 | Grok 4.20 | Claude Sonnet | GPT-OSS-120B | Gemini 2.5 | MiniMax M2.5 |
|---|---|---|---|---|---|---|---|---|---|---|
| MiMo-V2-Flash | — | 8.3 | 9.0 | 9.4 | 9.2 | 9.2 | 9.2 | 9.0 | 9.2 | 8.4 |
| Gemini 3.1 Pro | 10.0 | — | 9.7 | 8.5 | 10.0 | 9.2 | 10.0 | 10.0 | 9.7 | 1.6 |
| DeepSeek V4 | 9.3 | 8.1 | — | 9.7 | 9.3 | 8.7 | 9.7 | 9.3 | 9.3 | 7.9 |
| Claude Opus | 9.2 | 7.3 | 8.0 | — | 9.0 | 9.2 | 9.2 | 9.4 | 9.0 | 2.5 |
| GPT-5.4 | 8.6 | 3.6 | 8.8 | 8.8 | — | 8.8 | 8.6 | 9.0 | 8.4 | 1.9 |
| Grok 4.20 | 8.7 | 7.5 | 8.4 | 8.8 | 8.7 | — | 8.7 | 8.7 | 8.7 | 3.6 |
| Claude Sonnet | 9.0 | 7.8 | 8.6 | 9.4 | 8.8 | 9.0 | — | 9.4 | 8.8 | 3.1 |
| GPT-OSS-120B | 8.8 | 5.8 | 8.8 | 8.7 | 8.7 | 8.4 | 8.4 | — | 8.4 | · |
| Gemini 2.5 | 9.3 | · | 9.1 | 9.8 | 8.8 | 9.1 | 9.8 | 9.8 | — | 5.4 |
| MiniMax M2.5 | 10.0 | 5.3 | 8.8 | 8.8 | 8.4 | 8.4 | 10.0 | 9.8 | 8.7 | — |

---

## Methodology

- **10×10 Blind Peer Matrix:** All models answer the same question, then all models judge all responses.
- **5 Criteria:** Correctness, completeness, clarity, depth, usefulness (each scored 1–10).
- **Self-judgments excluded:** Models do not judge their own responses.
- **Weighted Score:** Composite of all 5 criteria.

---

## Citation

The Multivac (2026). Blind Peer Evaluation: REASON-023. app.themultivac.com

## License

Open data. Free to use, share, and build upon. Please cite The Multivac when using this data.

Download raw JSON: https://app.themultivac.com/api/evaluations/EVAL-20260402-175350/results
Full dataset: https://app.themultivac.com/dashboard/export
