Don't Trust Us. Verify.
The Multivac is only valuable if you trust the data. That's why everything is open — the methodology, the raw data, and the ability to run your own evaluations. Contribute questions, vote on results, and hold us accountable.
Submit Evaluation Questions
The evaluation is only as good as the questions. Propose questions that you think would reveal meaningful differences between frontier models. The best questions are specific, testable, and target a particular capability.
Open an issue on the evaluation repo with the tag question-proposal. Include the question text, target category, and why you think it's revealing.
Submit via form (coming soon). Questions will be reviewed and queued for evaluation. Top-voted questions get priority.
Coming soonHuman Validation Layer
AI models judging AI models has a circularity problem. That's why we're building a human validation layer. Vote on whether you agree with the peer evaluation rankings. Your votes create a human baseline that we publish alongside the AI rankings.
Reproducible Evaluations
Clone the evaluation harness. Pick a question. Run it against any models you have API access to. Compare your results with ours. If they diverge significantly, that's a finding worth publishing.