Open & Community-Driven

Don't Trust Us. Verify.

The Multivac is only valuable if you trust the data. That's why everything is open — the methodology, the raw data, and the ability to run your own evaluations. Contribute questions, vote on results, and hold us accountable.

Already Open

Raw Data (CSV)

LIVE

Download every evaluation, judgment, and leaderboard ranking. 5,391 rows of peer judgment data.

Routing API

LIVE

Free, no auth. GET /api/route-rec returns ranked models with confidence scores.

Methodology

LIVE

10×10 blind peer matrix. 5 criteria. Self-judgments excluded. Full transparency.

GitHub Repo

PUBLIC

Evaluation harness, model configs, question sets. Fork it, run it yourself, submit PRs.

Contribute

Submit Evaluation Questions

The evaluation is only as good as the questions. Propose questions that you think would reveal meaningful differences between frontier models. The best questions are specific, testable, and target a particular capability.

Option 1: GitHub Issues

Open an issue on the evaluation repo with the tag question-proposal. Include the question text, target category, and why you think it's revealing.

Open issue →

Option 2: Community Form

Submit via form (coming soon). Questions will be reviewed and queued for evaluation. Top-voted questions get priority.

Coming soon

Vote & Validate

Human Validation Layer

AI models judging AI models has a circularity problem. That's why we're building a human validation layer. Vote on whether you agree with the peer evaluation rankings. Your votes create a human baseline that we publish alongside the AI rankings.

Agree / Disagree

BUILDING

For each evaluation, vote whether the AI peer ranking matches your human judgment.

Blind Response Ranking

PLANNED

Read anonymized responses and rank them yourself. We compare your ranking to the AI consensus.

Correlation Dashboard

PLANNED

How well does AI peer evaluation correlate with human judgment? Public, updated live.

Run Your Own

Reproducible Evaluations

Clone the evaluation harness. Pick a question. Run it against any models you have API access to. Compare your results with ours. If they diverge significantly, that's a finding worth publishing.

# clone the evaluation harness

$ git clone https://github.com/themultivac/multivac-evaluation.git

$ cd multivac-evaluation

$ pip install -r requirements.txt

# run a single evaluation

$ python multivac.py --question "Your question here" --category code

View evaluation harness on GitHub →

Trust Principles

⊘

No Affiliations

We don't work for any AI lab. We pay for API access like everyone else.

◎

Full Transparency

Exact prompts, scoring criteria, and raw outputs published. Every judgment downloadable as CSV.

↻

Reproducibility

Anyone can verify results using the same prompts. The evaluation harness is open source.

⚡

Fresh Questions

New questions daily that can't be pre-memorized or gamed. No recycled benchmarks.

⚖

Human Accountability

Community votes create a human baseline. If AI rankings diverge from human judgment, we publish that too.

The Multivac is only as trustworthy as the community that verifies it.

GitHub Download Data Substack