BenchmarkAgent

A fully autonomous agentic system for building customized, high-quality benchmarks for LLMs & MLLMs — at scale, on demand.
 Abstract

Benchmark Everything, Everywhere, All at Once

BenchmarkAgent system overview
Benchmarks are crucial for evaluating LLMs and MLLMs, yet their construction is labor-intensive, difficult to scale, and often prone to rapid performance saturation.

We present BenchmarkAgent, a fully autonomous agentic system that standardizes and automates benchmark construction, covering user requirement analysis, subtask design, data annotation, and quality control.

Across 15 representative benchmarks spanning text, multimodal, and domain-specific reasoning, BenchmarkAgent produces high-quality samples with minimal human involvement, as validated by human evaluation, LLM-as-a-judge assessment, and consistency checks.

Two fundamental problems plague the benchmark ecosystem — rapid saturation and prohibitive manual cost.

📉

Performance Saturation

Model accuracy on released benchmarks climbs rapidly — often exceeding 80% within a year — leaving little room to differentiate SoTA models. A new benchmark becomes necessary almost as soon as the previous one is published.

Qwen series — accuracy on a representative benchmark
2022
42%
2023
64%
2024
82%
2025
94% ⚠️

Labor-Intensive Construction

Existing pipelines are largely human-driven, requiring substantial effort in task design, data collection, manual annotation, and quality control. Each benchmark is effectively built from scratch — slow, costly, and impossible to reuse.

Traditional benchmark construction pipeline
1
Task Design  (days – weeks)
2
Data Collection  (weeks)
3
Manual Annotation  (weeks – months)
4
Quality Control  (weeks)
 Months of expert labor, per benchmark
🚀

Our Solution: BenchmarkAgent

A fully autonomous agentic system that turns benchmark construction into a standardized, automated workflow — enabling scalable, high-quality, low-cost, and continuously renewable evaluation tailored to any user requirement.

-->
 Examples

Try the Benchmarks Yourself

Real samples generated by BenchmarkAgent — click an option to see whether you got it right.

1 / 20