Two fundamental problems plague the benchmark ecosystem — rapid saturation and prohibitive manual cost.
Model accuracy on released benchmarks climbs rapidly — often exceeding 80% within a year — leaving little room to differentiate SoTA models. A new benchmark becomes necessary almost as soon as the previous one is published.
Existing pipelines are largely human-driven, requiring substantial effort in task design, data collection, manual annotation, and quality control. Each benchmark is effectively built from scratch — slow, costly, and impossible to reuse.
Real samples generated by BenchmarkAgent — click an option to see whether you got it right.