← Evaluations/EVAL-20260402-112833
code
Jan 13, 2026CODE-001

This Python async function has 3 bugs: a race condition, an unhandled exception, and a resource leak. Find all three and explain why each is problematic. ```python import asyncio import aiohttp class DataFetcher: def __init__(self): self.cache = {} self.session = aiohttp.ClientSession() async def fetch_data(self, urls): results = [] for url in urls: if url in self.cache: results.append(self.cache[url]) else: async with self.session.get(url) as response: data = await response.json() self.cache[url] = data results.append(data) return results async def fetch_parallel(self, urls): tasks = [self.fetch_single(url) for url in urls] return await asyncio.gather(*tasks) async def fetch_single(self, url): if url in self.cache: return self.cache[url] async with self.session.get(url) as response: data = await response.json() self.cache[url] = data return data ```

Winner
Grok 4.20
openrouter
9.61
WINNER SCORE
matrix avg: 8.96
results.json report.mdFull dataset (CSV) →
10×10 Judgment Matrix · 81 judgments
OPEN DATA
Judge ↓ / Respondent →GPT-5.4Claude Opus 4.6Gemini 3.1 ProClaude Sonnet 4.6Grok 4.20DeepSeek V4GPT-OSS-120BGemini 3MiniMax M2.5MiMo-V2-Flash
GPT-5.49.27.38.69.38.28.68.2·6.3
Claude Opus 4.69.28.08.99.27.59.87.9·7.0
Gemini 3.1 Pro10.010.010.09.86.16.010.0·6.1
Claude Sonnet 4.69.09.68.69.27.89.68.6·8.0
Grok 4.209.29.28.68.86.39.28.0·6.2
DeepSeek V49.69.69.49.69.69.88.8·9.6
GPT-OSS-120B8.78.88.58.89.78.89.2·8.8
Gemini 310.010.09.410.010.09.810.0·9.6
MiniMax M2.59.610.07.910.09.89.49.810.08.8
MiMo-V2-Flash10.09.310.09.310.09.010.010.0·