The Multivac — Ask any model, routed by evaluation

◈ MULTIVAC
OverviewEvaluationsLeaderboardModel PulseHistoryCompareExportAPI
Routing APIExport APISign in
← Evaluations/EVAL-20260402-204743
analysis
Apr 02, 2026ANALYSIS-024
A quantitative trading firm backtests a strategy: 15% annual return, Sharpe ratio 2.1, max drawdown 8%. They want to go live. (1) What could go wrong between backtest and live trading? List at least 5 risks. (2) The backtest used 5 years of data and tested 200 parameter combinations. Calculate the probability this outperformance is due to overfitting. (3) Design a live testing protocol that minimizes capital at risk while validating the strategy.
Winner
GPT-5.4
openrouter
9.05
WINNER SCORE
matrix avg: 6.93
↓ results.json↓ report.mdFull dataset (CSV) →
10×10 Judgment Matrix · 78 judgments
OPEN DATA
Judge ↓ / Respondent →Gemini 3.1 ProClaude Opus 4.6GPT-5.4DeepSeek V4MiMo-V2-FlashClaude Sonnet 4.6Grok 4.20GPT-OSS-120BGemini 3MiniMax M2.5
Gemini 3.1 Pro—·9.85.37.56.88.37.58.2·
Claude Opus 4.62.3—9.66.26.56.88.68.27.50.5
GPT-5.41.4·—·6.53.26.86.16.3·
DeepSeek V45.08.19.2—9.09.49.29.28.88.1
MiMo-V2-Flash4.27.89.38.3—8.89.08.68.68.3
Claude Sonnet 4.62.0·8.66.57.8—8.68.88.0·
Grok 4.203.06.08.66.87.38.2—8.87.86.0
GPT-OSS-120B··8.46.07.53.47.0—7.3·
Gemini 32.2·9.69.09.08.79.28.8—·
MiniMax M2.52.50.58.57.88.27.98.68.88.4—