← Evaluations/EVAL-20260402-135609
code
Apr 02, 2026CODE-019

Implement a Bloom filter from scratch (no libraries) with the following: configurable false positive rate, optimal hash function count calculation, serialization/deserialization, a counting variant that supports deletion, and memory usage statistics. Include mathematical proof of your false positive rate formula.

Winner
Grok 4.20
openrouter
8.68
WINNER SCORE
matrix avg: 6.81
results.json report.mdFull dataset (CSV) →
10×10 Judgment Matrix · 80 judgments
OPEN DATA
Judge ↓ / Respondent →MiMo-V2-FlashGPT-OSS-120BGPT-5.4Claude Opus 4.6Gemini 3.1 ProClaude Sonnet 4.6Grok 4.20DeepSeek V4Gemini 3MiniMax M2.5
MiMo-V2-Flash9.08.68.72.87.78.66.28.6·
GPT-OSS-120B7.05.86.03.45.29.06.38.8·
GPT-5.44.24.03.90.74.67.34.26.8·
Claude Opus 4.66.86.56.51.27.57.24.57.4·
Gemini 3.1 Pro6.45.76.05.86.09.44.28.4·
Claude Sonnet 4.66.87.07.88.31.28.65.87.8·
Grok 4.207.28.78.18.73.37.96.07.8·
DeepSeek V49.48.88.88.45.88.69.48.8·
Gemini 38.69.48.89.62.68.69.86.3·
MiniMax M2.57.87.57.56.5·6.88.86.68.0