← Back to The Multivac
Open & Community-Driven

Don't Trust Us. Verify.

The Multivac is only valuable if you trust the data. That's why everything is open — the methodology, the raw data, and the ability to run your own evaluations. Contribute questions, vote on results, and hold us accountable.

Already Open
Raw Data (CSV)
LIVE
Download every evaluation, judgment, and leaderboard ranking. 5,391 rows of peer judgment data.
Routing API
LIVE
Free, no auth. GET /api/route-rec returns ranked models with confidence scores.
Methodology
LIVE
10×10 blind peer matrix. 5 criteria. Self-judgments excluded. Full transparency.
GitHub Repo
PUBLIC
Evaluation harness, model configs, question sets. Fork it, run it yourself, submit PRs.
Contribute

Submit Evaluation Questions

The evaluation is only as good as the questions. Propose questions that you think would reveal meaningful differences between frontier models. The best questions are specific, testable, and target a particular capability.

Option 1: GitHub Issues

Open an issue on the evaluation repo with the tag question-proposal. Include the question text, target category, and why you think it's revealing.

Open issue →
Option 2: Community Form

Submit via form (coming soon). Questions will be reviewed and queued for evaluation. Top-voted questions get priority.

Coming soon
Vote & Validate

Human Validation Layer

AI models judging AI models has a circularity problem. That's why we're building a human validation layer. Vote on whether you agree with the peer evaluation rankings. Your votes create a human baseline that we publish alongside the AI rankings.

Agree / Disagree
BUILDING
For each evaluation, vote whether the AI peer ranking matches your human judgment.
Blind Response Ranking
PLANNED
Read anonymized responses and rank them yourself. We compare your ranking to the AI consensus.
Correlation Dashboard
PLANNED
How well does AI peer evaluation correlate with human judgment? Public, updated live.
Run Your Own

Reproducible Evaluations

Clone the evaluation harness. Pick a question. Run it against any models you have API access to. Compare your results with ours. If they diverge significantly, that's a finding worth publishing.

# clone the evaluation harness
$ git clone https://github.com/themultivac/multivac-evaluation.git
$ cd multivac-evaluation
$ pip install -r requirements.txt
# run a single evaluation
$ python multivac.py --question "Your question here" --category code
View evaluation harness on GitHub →
Trust Principles
No Affiliations
We don't work for any AI lab. We pay for API access like everyone else.
Full Transparency
Exact prompts, scoring criteria, and raw outputs published. Every judgment downloadable as CSV.
Reproducibility
Anyone can verify results using the same prompts. The evaluation harness is open source.
Fresh Questions
New questions daily that can't be pre-memorized or gamed. No recycled benchmarks.
Human Accountability
Community votes create a human baseline. If AI rankings diverge from human judgment, we publish that too.
The Multivac is only as trustworthy as the community that verifies it.
GitHubDownload DataSubstack