SWE-bench Benchmark Contamination — When the Test Answers Are in the Training Data
·1111 words·6 mins
Research reveals that top AI coding model scores on SWE-bench may be inflated due to git history leaks, raising fundamental questions about how we evaluate AI coding capabilities.