BenchmarkAgent: Autonomous Benchmark Construction

Abstract

Benchmark Everything, Everywhere, All at Once

Benchmarks are crucial for evaluating LLMs and MLLMs, yet their construction is labor-intensive, difficult to scale, and often prone to rapid performance saturation.

We present BenchmarkAgent, a fully autonomous agentic system that standardizes and automates benchmark construction, covering user requirement analysis, subtask design, data annotation, and quality control.

Across 15 representative benchmarks spanning text, multimodal, and domain-specific reasoning, BenchmarkAgent produces high-quality samples with minimal human involvement, as validated by human evaluation, LLM-as-a-judge assessment, and consistency checks.

Two fundamental problems plague the benchmark ecosystem — rapid saturation and prohibitive manual cost.

📉

Performance Saturation

Model accuracy on released benchmarks climbs rapidly — often exceeding 80% within a year — leaving little room to differentiate SoTA models. A new benchmark becomes necessary almost as soon as the previous one is published.

Qwen series — accuracy on a representative benchmark

2022

42%

2023

64%

2024

82%

2025

94% ⚠️

⏳

Labor-Intensive Construction

Existing pipelines are largely human-driven, requiring substantial effort in task design, data collection, manual annotation, and quality control. Each benchmark is effectively built from scratch — slow, costly, and impossible to reuse.

Traditional benchmark construction pipeline

Task Design (days – weeks)

Data Collection (weeks)

Manual Annotation (weeks – months)

Quality Control (weeks)

→ Months of expert labor, per benchmark

Examples

Try the Benchmarks Yourself

Real samples generated by BenchmarkAgent — click an option to see whether you got it right.

1 / 20

BenchmarkAgent

Benchmark Everything, Everywhere, All at Once

Performance Saturation

Labor-Intensive Construction

Our Solution: BenchmarkAgent

Try the Benchmarks Yourself