← Evaluations/EVAL-20260403-103635
analysis
Apr 03, 2026ANALYSIS-024

A quantitative trading firm backtests a strategy: 15% annual return, Sharpe ratio 2.1, max drawdown 8%. They want to go live. (1) What could go wrong between backtest and live trading? List at least 5 risks. (2) The backtest used 5 years of data and tested 200 parameter combinations. Calculate the probability this outperformance is due to overfitting. (3) Design a live testing protocol that minimizes capital at risk while validating the strategy.

Winner
GPT-5.4
openrouter
9.29
WINNER SCORE
matrix avg: 7.10
results.json report.mdFull dataset (CSV) →
10×10 Judgment Matrix · 74 judgments
OPEN DATA
Judge ↓ / Respondent →GPT-OSS-120BGemini 3.1 ProClaude Opus 4.6GPT-5.4DeepSeek V4MiMo-V2-FlashClaude Sonnet 4.6Grok 4.20Gemini 3MiniMax M2.5
GPT-OSS-120B1.95.68.87.86.86.28.1··
Gemini 3.1 Pro·7.39.87.07.76.07.3··
Claude Opus 4.67.20.89.37.06.87.08.80.5·
GPT-5.44.41.25.06.05.13.87.0··
DeepSeek V49.25.59.29.29.09.29.28.0·
MiMo-V2-Flash8.35.38.49.08.8·9.08.2·
Claude Sonnet 4.68.01.48.09.37.38.28.8··
Grok 4.208.43.37.59.06.8·8.66.0·
Gemini 39.02.09.29.89.09.09.09.6·
MiniMax M2.55.52.58.89.37.87.86.89.08.0