olmo-eval

Overview

This project provides a unified workbench for evaluating language models throughout the model development loop.

Features:

Registry of benchmark tasks and composable suites, with named variants for few-shot settings, formatting, and scoring (e.g. humaneval:3shot:bpb).
Support for inference via vLLM, LiteLLM for commercial APIs, and a mock provider for dry runs and debugging.
Harness abstraction that separates execution policy from task definition, so any task can be run baseline or tool-augmented without modification.
Multi-turn agentic evaluation with tool calling, scaffolds, and sandboxed environments via Docker, Podman, or Modal.
LLM-as-judge scoring with auxiliary providers, including locally served judge models.
Aggregate and instance-level prediction storage.
Inspection tooling for viewing instances, formatted prompts, token arrays, and model responses.

Quick Start

This project uses uv with a checked-in uv.lock for reproducible builds. To get started, sync the repo with uv, browse the available tasks and suites, and preview a run with the built-in mock provider.

Run Your First Eval

# Install uv if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install Python 3.12 if your machine does not already have it
uv python install 3.12

# Install dependencies + the package (editable) from the lockfile.
# The default groups (`dev` + `vllm`) are installed automatically, which
# pulls in storage, beaker, hf, and the vLLM inference provider. vLLM
# deps are marked Linux-only via PEP 508 markers, so this works on macOS
# too — no extra flags needed.
uv sync --frozen

# Install pre-commit hooks
make setup

# To update the lockfile after changing pyproject.toml
uv lock

# Add an optional extra on top of the defaults (e.g. agents, litellm)
uv sync --frozen --extra agents

# `openhands` conflicts with vllm — opt out of the vllm group when using it
uv sync --frozen --no-group vllm --extra openhands

# Browse a few suites
uv run olmo-eval suite inspect mmlu
uv run olmo-eval suite inspect gpqa
uv run olmo-eval suite inspect olmobase:code

# Preview a run without loading a model
uv run olmo-eval run -m mock -t gsm8k --dry-run

# Preview another run with a different task spec
uv run olmo-eval run -m mock -t humaneval:3shot:bpb --dry-run

Key Concepts

The evaluation framework is built around these core abstractions:

Abstraction	Description
Task	Benchmark specification defining dataset slice, request construction, and scoring logic
Suite	Benchmark collection that composes tasks and/or nested suites and defines result aggregation
Harness	Execution runtime around the inference provider, tools, scaffolds, and runtime behavior
Formatter	Prompt renderer from an instance and few-shot context to an LM request
Scorer	Per-example evaluator from model output to raw score or judgment
Metric	Dataset-level aggregator over per-example scores

Tasks

Tasks define how to load data, format prompts, and score outputs. Register with @register:

from olmo_eval.evals.tasks.common import Task, register
from olmo_eval.data import DataSource

@register("my_task")
class MyTask(Task):
    # DataSource specifies path, subset (optional), and split
    data_source = DataSource(path="cais/mmlu", subset="abstract_algebra", split="test")
    ...

Variants can also act as named evaluation presets (for example, few-shot settings):

from olmo_eval.evals.tasks.common import register_variant

register_variant("my_task", "3shot", num_fewshot=3, fewshot_seed=42)
# Built-in example: uv run olmo-eval run -m llama3.1-8b -t humaneval:3shot:bpb

Runtime Dependencies allow tasks to specify packages installed at job startup:

@register("code_eval")
class CodeEvalTask(Task):
    data_source = DataSource(path="my-org/code-dataset", split="test")
    dependencies = ["code-sandbox==1.0", "git+https://github.com/user/[email protected]"]
    ...

Suites

Suites group multiple tasks for batch evaluation:

from olmo_eval.evals.suites import Suite, register

register(Suite(
    name="my_suite",
    tasks=("task_a:3shot", "task_b:3shot", "task_c:3shot"),
))

Aggregation

Suites support different strategies for combining task results:

Strategy	Description
`AVERAGE`	Simple average of all task scores (default)
`AVERAGE_OF_AVERAGES`	Average over child suite averages (equal weight per child)
`DISPLAY_ONLY`	Display child results without computing suite average
`NONE`	No aggregation - just collect individual task results

Average of Averages Example:

from olmo_eval.evals.suites import Suite, AggregationStrategy, register

# Nested suite with 3 tasks
multilingual_code = Suite(
    name="multilingual_code",
    tasks=("mbpp_python", "mbpp_java", "mbpp_rust"),
    aggregation=AggregationStrategy.AVERAGE,
)

# Parent suite using average of averages
register(Suite(
    name="code_eval",
    tasks=(
        "humaneval",        # Single task (score: 0.80)
        multilingual_code,  # Nested suite with 3 tasks (scores: 0.40, 0.50, 0.60)
    ),
    aggregation=AggregationStrategy.AVERAGE_OF_AVERAGES,
))

# Results:
# - humaneval: 0.80
# - multilingual_code average: (0.40 + 0.50 + 0.60) / 3 = 0.50
#
# AVERAGE_OF_AVERAGES: (0.80 + 0.50) / 2 = 0.65
# vs AVERAGE:          (0.80 + 0.40 + 0.50 + 0.60) / 4 = 0.575

Note: Currently AVERAGE_OF_AVERAGES gives each child equal weight regardless of how many tasks it contains. Custom weighting may be supported in the future.

Formatters

Formatters convert instances into LM requests. See olmo_eval.common.formatters for available options.

from olmo_eval.common.formatters import MultipleChoiceFormatter, ChatFormatter

# Multiple choice with logprob scoring
formatter = MultipleChoiceFormatter(template="Q: {question}\n\nA:")

# Chat-based formatting
formatter = ChatFormatter(system_prompt="You are a helpful assistant.")

Scorers

Scorers compute a score for each instance/output pair. See olmo_eval.common.scorers for available options.

from olmo_eval.common.scorers import ExactMatchScorer, MultipleChoiceScorer

# Exact string match
scorer = ExactMatchScorer()

# Multiple choice comparison
scorer = MultipleChoiceScorer()

Metrics

Metrics aggregate scores across responses. See olmo_eval.common.metrics for available options.

from olmo_eval.common.metrics import AccuracyMetric, F1Metric
from olmo_eval.common.scorers import ExactMatchScorer, F1Scorer

# Mean accuracy
metric = AccuracyMetric(scorer=ExactMatchScorer)

# Mean F1 score
metric = F1Metric(scorer=F1Scorer)

Model Presets

Pre-configured model settings in olmo_eval/common/constants/models.py:

from olmo_eval.common.constants import get_model_presets

# Returns dict of preset name -> ModelConfig
presets = get_model_presets()
# {
#     "llama3.1-8b": ModelConfig(model="meta-llama/Meta-Llama-3.1-8B"),
#     "olmo-2-7b": ModelConfig(model="allenai/OLMo-2-1124-7B"),
#     ...
# }

Harness

A Harness is the runtime orchestration layer for an evaluation run. It combines the primary inference provider with execution policy such as system prompts, tools, auxiliary providers, sandboxing, metrics collection, and an optional scaffold for multi-turn control. This lets the same task run in plain, tool-using, or scaffolded modes without changing the task definition.

Key concept: Any task can be run with or without tools—that's determined by the Harness configuration, not the task definition. This allows comparing baseline vs tool-augmented performance on the same task.

Using Harness via CLI

# Run task without tools or a scaffold (baseline)
uv run olmo-eval run -m llama3.1-8b -t simpleqa:judge

# Run task with search tools via harness preset
uv run olmo-eval run -m llama3.1-8b -t simpleqa:judge --harness dr_tulu

# Use a custom harness config file
uv run olmo-eval run -m llama3.1-8b -t simpleqa:judge --harness-config ./my_harness.yaml

HarnessConfig

Configuration for a harness:

from olmo_eval.harness import HarnessConfig, ProviderConfig, get_harness_preset
from olmo_eval.harness.tools.search import (
    semantic_scholar_search,
    serper_web_search,
    serper_fetch_page,
)

# Get a preset
config = get_harness_preset("dr_tulu")

# Or create custom config with tools
config = HarnessConfig(
    name="my_harness",
    provider=ProviderConfig(model="gpt-4o", kind="litellm"),
    tools=(semantic_scholar_search, serper_web_search, serper_fetch_page),
    system_prompt="You are a helpful assistant with search tools.",
    max_turns=10,
    max_concurrency=8,
    scaffold="openai_agents",
    required_secrets=("S2_API_KEY", "SERPER_API_KEY"),
)

Field	Type	Default	Description
`name`	`str`	Required	Harness identifier
`provider`	`ProviderConfig`	`ProviderConfig()`	Model provider configuration
`tools`	`tuple[Tool \| str, ...]`	`()`	Tool instances or registered tool names
`system_prompt`	`str \| None`	`None`	System prompt to inject
`tool_choice`	`str`	`"auto"`	Tool selection mode (`auto`, `none`, `required`)
`scaffold`	`str \| None`	`None`	Execution scaffold (e.g., `openai_agents`)
`max_turns`	`int \| None`	`None`	Max turns for multi-turn execution
`max_concurrency`	`int \| None`	`None`	Concurrent executions
`scoring_concurrency`	`int \| None`	`None`	Max concurrent scoring operations
`sandboxes`	`tuple[SandboxConfig, ...]`	`()`	Sandbox configurations for isolated tool execution
`scaffold_kwargs`	`dict[str, Any]`	`{}`	Scaffold-specific options (e.g., `enable_compaction`)
`metrics`	`MetricsConfig \| None`	`None`	Inference metrics collection config
`batching`	`BatchConfig \| None`	`None`	Batching strategy configuration
`required_secrets`	`tuple[str, ...]`	`()`	Required environment variables

Scaffolds

Scaffolds define how the Harness executes multi-turn requests with tool calling. A scaffold handles the agentic loop: calling the model, executing tools, and feeding results back.

# List available scaffolds
uv run olmo-eval scaffolds

When to use a scaffold:

For multi-turn execution with harness.run(), you must specify a scaffold
For single-turn generation with harness.generate(), no scaffold is needed

# Multi-turn execution requires a scaffold
config = HarnessConfig(
    name="my_agent",
    provider=ProviderConfig(model="gpt-4o", kind="litellm"),
    tools=(semantic_scholar_search, serper_web_search),
    scaffold="openai_agents",  # Required for run()
)
harness = Harness(config)
result = await harness.run(request)  # Uses the scaffold

# Single-turn generation works without a scaffold
config = HarnessConfig(
    name="simple",
    provider=ProviderConfig(model="gpt-4o", kind="litellm"),
)
harness = Harness(config)
outputs = harness.generate(requests)  # No scaffold needed

Inference Metrics

Harness configurations can include MetricsConfig to collect inference performance metrics during evaluation:

from olmo_eval.harness import HarnessConfig, ProviderConfig
from olmo_eval.inference.metrics import MetricsConfig

config = HarnessConfig(
    name="with_metrics",
    provider=ProviderConfig(model="llama3.1-8b", kind="vllm_server"),
    metrics=MetricsConfig(
        enabled=True,
        reporters=("file", "db"),  # Save to file and database
        collect_vllm_server=True,  # Poll vLLM server /metrics endpoint
    ),
)

Visualizing Metrics:

# Plot metrics from database (requires at least one filter)
uv run olmo-eval metrics plot -G my-benchmark-group
uv run olmo-eval metrics plot -m OLMo-3 --metric throughput

# Show statistics table without interactive plots
uv run olmo-eval metrics plot -e experiment_123 --stats-only

When using the db reporter, metrics are stored in a PostgreSQL database (default name: olmo_eval_metrics). You must configure your own database connection using the OLMO_EVAL_DB_* environment variables (see Database Configuration).

Auxiliary Providers and Local Judge Models

Some tasks or custom scorers use LLM-as-judge scoring, where a separate model evaluates responses. The auxiliary_providers configuration lets you specify additional inference providers for scoring or judging. Harness overrides must come immediately after --harness, while task overrides like limit=... must come after -t.

Local example with uv run olmo-eval run:

uv run olmo-eval run \
    --harness default \
    -o provider.max_model_len=16384 \
    -o provider.num_instances=1 \
    -o 'metrics.reporters=[file]' \
    -o 'metrics.collect_gpu=true' \
    -o 'provider.kwargs.timeout=300' \
    -o auxiliary_providers.judge.kind=vllm_server \
    -o auxiliary_providers.judge.model=Qwen/Qwen3-8B \
    -o auxiliary_providers.judge.num_instances=1 \
    -o scoring_concurrency=4 \
    -m Qwen/Qwen3-8B \
    -t simpleqa:judge \
    -o limit=10

Key configuration options:

Option	Description
`auxiliary_providers.judge.kind`	Provider type: `vllm_server`, `litellm`, etc.
`auxiliary_providers.judge.model`	Model to use for judging
`auxiliary_providers.judge.num_instances`	Number of parallel vLLM instances
`auxiliary_providers.judge.base_url`	URL for external servers (when not spawning locally)
`scoring_concurrency`	Number of concurrent scoring requests

Defining Tools

Tools combine schema (for the LLM) and implementation (for execution) in a single definition:

from olmo_eval.harness import tool, registered_tool

# Option 1: @tool decorator (local use)
@tool(description="Search the web for information")
async def web_search(query: str) -> str:
    """Search implementation."""
    return await search_api(query)

# Option 2: @registered_tool decorator (global registry, for cross-process use)
@registered_tool(description="Fetch a webpage")
async def fetch_page(url: str) -> str:
    """Fetch implementation."""
    return await fetch_url(url)

Tools are automatically registered when using @registered_tool, making them available by name in HarnessConfig.

Custom Harness Config File

Create a YAML file for custom harness configurations:

# my_harness.yaml
name: custom_search
tool_names:
  - semantic_scholar_snippet_search
  - serper_google_webpage_search
system_prompt: |
  You are a research assistant with web search capabilities.
  Use search tools to find accurate information before answering.
max_turns: 15
max_concurrency: 4
required_secrets:
  - S2_API_KEY
  - SERPER_API_KEY

uv run olmo-eval run -m llama3.1-8b -t simpleqa:judge --harness-config my_harness.yaml

Programmatic Usage

from olmo_eval.harness import Harness, HarnessConfig, ProviderConfig, get_harness_preset
from olmo_eval.harness.tools.search import (
    semantic_scholar_search,
    serper_web_search,
)

# Create harness with preset and provider override
config = get_harness_preset("dr_tulu").with_provider(
    ProviderConfig(model="meta-llama/Llama-3.1-8B-Instruct", kind="vllm")
)
harness = Harness(config)

# Or create from scratch
config = HarnessConfig(
    name="my_harness",
    provider=ProviderConfig(model="gpt-4o", kind="litellm"),
    tools=(semantic_scholar_search, serper_web_search),
    system_prompt="You are a helpful assistant.",
    scaffold="openai_agents",
)
harness = Harness(config)

# Multi-turn execution with tool calling
result = await harness.run(request, sampling_params)
print(result.trajectory)  # Shows all turns including tool calls
print(result.final_output)  # Final model response

Adding New Tasks

This section explains how to create new evaluation tasks.

Quick Start: Minimal Task Example

"""Example: Minimal task implementation."""
from collections.abc import Iterator
from typing import Any

from olmo_eval.common.types import Instance, LMOutput, LMRequest, RequestType
from olmo_eval.data import DataLoader, DataSource
from olmo_eval.evals.tasks.common import Task, register


@register("my_task")
class MyTask(Task):
    """My task implementation."""

    # DataSource arguments:
    #   path: HuggingFace dataset path (e.g., "cais/mmlu")
    #   subset: Dataset subset/config (e.g., "abstract_algebra")
    #   split: Dataset split (e.g., "test", "validation")
    data_source = DataSource(path="cais/mmlu", subset="abstract_algebra", split="test")

    @property
    def instances(self) -> Iterator[Instance]:
        """Load and yield instances from the dataset."""
        if self._instances_cache is None:
            self._instances_cache = []
            loader = DataLoader()
            source = self.config.get_data_source()
            for doc in loader.load(source):
                self._instances_cache.append(self.process_doc(doc))
        yield from self._instances_cache

    def process_doc(self, doc: dict[str, Any]) -> Instance:
        """Convert a dataset document to an Instance."""
        return Instance(
            question=doc["question"],
            gold_answer=doc["answer"],
            choices=tuple(doc["choices"]),  # For MC tasks
            metadata={"id": doc["id"]},
        )

    def format_request(self, instance: Instance) -> LMRequest:
        """Format instance for the language model."""
        if self.config.formatter is not None:
            return self.config.formatter.format(instance, self.get_fewshot())
        # Fallback formatting
        return LMRequest(request_type=RequestType.COMPLETION, prompt=instance.question)

    def extract_answer(self, output: LMOutput) -> str | None:
        """Extract the answer from model output."""
        return output.text.strip()

Task Class Overview

Method	Required	Purpose
`instances`	Yes	Property that yields `Instance` objects from the dataset
`process_doc(doc)`	Yes	Converts a raw document dict into an `Instance`
`format_request(instance)`	Yes	Converts an `Instance` into an `LMRequest` for the model
`extract_answer(output)`	Yes	Extracts the answer string from `LMOutput`
`_build_fewshot()`	No	Override to customize few-shot example loading
`score_responses(...)`	No	Override to customize scoring logic
`compute_metrics(...)`	No	Override to customize metric computation

TaskConfig Reference

Field	Type	Default	Description
`name`	`str`	Required	Task identifier used in CLI
`data_source`	`DataSource \| str`	`None`	Dataset source (HuggingFace, S3, GCS, local, or URI string)
`fewshot_source`	`DataSource \| str`	`None`	Optional separate source for few-shot examples
`formatter`	`Formatter`	`None`	Request formatter
`metrics`	`tuple[Metric, ...]`	`()`	Evaluation metrics (scorers are inferred from metrics)
`num_fewshot`	`int`	`0`	Number of few-shot examples
`fewshot_seed`	`int`	`42`	Random seed for few-shot selection
`seed`	`int`	`42`	General random seed for task
`limit`	`int \| None`	`None`	Max instances to evaluate
`split`	`Split`	`Split.TEST`	Dataset split to use
`primary_metric`	`MetricName \| Metric \| None`	`None`	Primary metric for ranking (defaults to single metric if only one)
`sampling_params`	`SamplingParams \| None`	`None`	Default sampling parameters for this task
`dependencies`	`list[str] \| None`	`None`	Runtime packages to install (e.g., `["pkg==1.0"]`)

Data Sources

Tasks can load data from multiple sources using DataSource:

from olmo_eval.data import DataSource

# HuggingFace datasets - specify path, subset, and split
DataSource(path="cais/mmlu", subset="abstract_algebra", split="test")

# Using URI string (alternative syntax)
DataSource.from_uri("hf://cais/mmlu?subset=abstract_algebra&split=test")

# Without subset (for datasets that don't have subsets)
DataSource(path="openai_humaneval", split="test")

# With specific data files and revision
DataSource(path="my-org/dataset", data_files="data/test.jsonl", revision="v1.0")

# Local JSONL files
DataSource(path="/path/to/dataset.jsonl")

# S3
DataSource(path="s3://my-bucket/datasets/data.jsonl")

# GCS
DataSource(path="gs://my-bucket/datasets/data.parquet")

DataSource Fields:

Field	Type	Default	Description
`path`	`str`	Required	Dataset path (HuggingFace repo, S3/GCS URI, or local path)
`subset`	`str \| None`	`None`	Dataset subset/config name
`split`	`str`	`"test"`	Dataset split
`data_files`	`str \| None`	`None`	Specific data files to load
`revision`	`str \| None`	`None`	Dataset revision/version

Common Patterns

Multiple Choice Tasks:

formatter=MultipleChoiceFormatter(template="Question: {question}\n\nAnswer:")
metrics=(AccuracyMetric(scorer=MultipleChoiceScorer),)

Generation Tasks (exact match):

formatter=CompletionFormatter(template="{question}")
metrics=(AccuracyMetric(scorer=ExactMatchScorer),)

Tasks with Multiple Subsets (like MMLU with 57 subjects):

# Base class with shared logic
class MMLUTask(Task):
    ...

# Register each subset - the subset is specified in DataSource
@register("mmlu_anatomy")
class MMLUAnatomy(MMLUTask):
    data_source = DataSource(path="cais/mmlu", subset="anatomy", split="test")

@register("mmlu_physics")
class MMLUPhysics(MMLUTask):
    data_source = DataSource(path="cais/mmlu", subset="high_school_physics", split="test")

Adding Variants

Variants modify how a task is formatted/scored (e.g., :mc, :bpb):

from olmo_eval.evals.tasks.common import register_variant

# Register after task is defined
register_variant("my_task", "bpb", formatter=PPLFormatter(), metrics=(BPBMetricByteAvg(scorer=BitsPerByteScorer),))

Variants can also encode configuration presets (e.g., :3shot, :zero):

from olmo_eval.evals.tasks.common import register_variant

register_variant("my_task", "3shot", num_fewshot=3, fewshot_seed=1234)
register_variant("my_task", "zero", num_fewshot=0)

Usage: uv run olmo-eval run -m llama3.1-8b -t humaneval:3shot:bpb

Tool-Augmented Evaluation

olmo-eval supports evaluating models with tool use through the Harness abstraction. This enables comparing baseline model performance against tool-augmented performance on the same tasks.

The Harness is the preferred way to add tools to evaluations. It separates tool configuration from task definition, allowing any task to be run with or without tools:

# Baseline evaluation (no tools)
uv run olmo-eval run -m llama3.1-8b -t simpleqa:judge

# Same task with search tools
uv run olmo-eval run -m llama3.1-8b -t simpleqa:judge --harness dr_tulu

See the Harness section above for full documentation on:

Creating custom harness configurations
Defining tools with the @tool decorator
Programmatic usage

Querying Results

Evaluation results can be stored in PostgreSQL and queried via the CLI.

Basic Queries

# Query by experiment ID
uv run olmo-eval results query --experiment exp_001

# Query by model
uv run olmo-eval results query --model llama3.1-8b

# Query by task (shows comparison matrix)
uv run olmo-eval results query --task mmlu --task gsm8k

# Query by experiment group
uv run olmo-eval results query -G my-benchmark-group --format json

# Combine filters
uv run olmo-eval results query --model llama3.1-8b --task mmlu --format json

Instance-Level Predictions

Include --instances to retrieve instance-level predictions:

# Get instances for an experiment
uv run olmo-eval results query --experiment exp_001 --task mmlu --instances --format json

# Paginate through large result sets using keyset pagination
uv run olmo-eval results query --task mmlu --instances --limit 1000 --format json

# Get next page using last_id from previous response
uv run olmo-eval results query --task mmlu --instances --limit 1000 --after-id 1000 --format json

JSON output includes pagination metadata:

{
  "experiments": [...],
  "pagination": {
    "last_id": 12345,
    "has_more": true
  }
}

Output Formats

Format	Flag	Description
Table	`--format table`	Rich terminal tables (default)
JSON	`--format json`	Structured JSON with pagination metadata
CSV	`--format csv`	CSV output to stdout

Database Configuration

AI2 Users (Recommended)

Set these two environment variables to connect to the shared database:

export OLMO_EVAL_DB_HOST="<database-host>"
export OLMO_EVAL_DB_SECRET_ARN="arn:aws:secretsmanager:us-west-2:..."

The password is automatically fetched from AWS Secrets Manager on first connection. This requires AWS credentials configured (via ~/.aws/credentials or environment variables).

All Environment Variables

Variable	Default	Description
`OLMO_EVAL_DB_HOST`	`localhost`	Database host
`OLMO_EVAL_DB_PORT`	`5432`	Database port
`OLMO_EVAL_DB_NAME`	`olmo_eval`	Database name
`OLMO_EVAL_DB_USER`	`postgres`	Database user
`OLMO_EVAL_DB_PASSWORD`	-	Database password (use this OR `OLMO_EVAL_DB_SECRET_ARN`)
`OLMO_EVAL_DB_SECRET_ARN`	-	AWS Secrets Manager ARN for password (fetched on auth failure)

Advanced Usage

Multi-GPU and Tool-Augmented Evaluation

# Basic evaluation
uv run olmo-eval run -m llama3.1-8b -t mmlu -t gsm8k -t arc_easy

# Large models with multi-GPU tensor parallelism
uv run olmo-eval run -m llama3.1-70b -t mmlu --num-gpus 4

# Refresh Hugging Face cache before loading a remote model
uv run olmo-eval run -m allenai/OLMo-2-1124-7B -t mmlu --force-download-model

# Tool-augmented evaluation with harness
uv run olmo-eval run -m llama3.1-8b -t simpleqa:judge --harness dr_tulu

Debugging and Inspection

olmo-eval provides tools for inspecting tasks, requests, and responses at various stages of evaluation.

Task Inspection (`uv run olmo-eval task inspect`)

Inspect task instances without running evaluation:

# View raw instance data
uv run olmo-eval task inspect arc_easy

# View multiple instances
uv run olmo-eval task inspect arc_easy -n 5 --skip 10

# View the LM request that will be sent to the model
uv run olmo-eval task inspect arc_easy:mc --request

# View formatted prompt with chat template applied
uv run olmo-eval task inspect humaneval -T meta-llama/Llama-3.1-8B-Instruct --formatted

# View tokenized representation
uv run olmo-eval task inspect humaneval -T meta-llama/Llama-3.1-8B-Instruct --tokens

# Export as JSON for programmatic use
uv run olmo-eval task inspect arc_easy --json

Option	Description
`-n, --count`	Number of instances to display
`-s, --skip`	Number of instances to skip
`--instance`	Show instance details (default if no other flags)
`--request`	Show the LM request
`-T, --tokenizer`	Tokenizer for formatting/tokenization
`--formatted`	Show prompt after template applied (requires `-T`)
`--tokens`	Show token array (requires `-T`)
`--max-tokens`	Max tokens to display (0 for no limit)
`--max-chars`	Max chars for formatted prompt (0 for no limit)
`--max-string-length`	Max chars for instance field values (0 for no limit)
`--json`	Output as JSON

Runtime Inspection Flags

Inspect data during evaluation runs with uv run olmo-eval run:

# Enable all inspection flags at once
uv run olmo-eval run -m llama3.1-8b -t mmlu --inspect

# Or use individual flags for specific inspection
uv run olmo-eval run -m llama3.1-8b -t mmlu --inspect-instance --inspect-request

# Inspect the response after model generation
uv run olmo-eval run -m llama3.1-8b -t mmlu --inspect-response

# Combine multiple inspection flags
uv run olmo-eval run -m llama3.1-8b -t mmlu \
    --inspect-instance \
    --inspect-request \
    --inspect-response

Flag	Description
`--inspect`	Enable all inspection flags below
`--inspect-instance`	Print the first instance of each task before running
`--inspect-request`	Print the first LM request before model generation
`--inspect-formatted`	Show formatted prompt (after chat template applied)
`--inspect-tokens`	Show token array before evaluation
`--inspect-response`	Print the first response after model generation

Mock Provider for Testing

Use the mock provider to test inspection tools without loading a real model:

# Quick inspection without vLLM or PyTorch
uv run olmo-eval run -m mock -t humaneval:3shot:bpb --inspect-request

# Dry run with mock to preview configuration
uv run olmo-eval run -m mock -t mmlu --dry-run

External Evals

External evals are standalone evaluations that run outside the normal task pipeline. Use them when a benchmark already comes with its own harness, verifier, or environment and does not fit cleanly into the usual task formatter/scorer flow. They are a good fit for agent-style benchmarks like terminal_bench_2, tau2_bench, and asta_bench that need sandbox orchestration, benchmark-specific setup, or end-to-end execution against an external repo or runner.

Defining an External Eval

from typing import Any

from olmo_eval.evals.external import SandboxedExternalEval, ExternalEvalResult, register_external_eval

class MyBenchmarkExternalEval(SandboxedExternalEval):
    """My benchmark evaluation."""

    name = "my_benchmark"
    description = "Evaluates model on my benchmark"
    timeout_seconds = 3600
    required_secrets = ("MY_API_KEY",)

    @property
    def sandbox_image(self) -> str:
        return "my-benchmark:latest"

    @property
    def working_dir(self) -> str:
        return "/workspace"

    @property
    def setup_command(self) -> tuple[str, ...]:
        return ("pip install -r requirements.txt",)

    @property
    def arguments(self) -> dict[str, tuple[str, Any | None]]:
        # Returns dict of arg_name -> (description, default_value)
        return {"subset": ("Which subset to evaluate", "default")}

    async def execute(self, provider, args, output_dir, container_runtime):
        # Run benchmark in sandbox
        result = await self.run_in_sandbox(provider, args, output_dir)
        return ExternalEvalResult(
            name=self.name,
            metrics={"accuracy": result.score},
            success=True,
        )

# Register the eval
register_external_eval(MyBenchmarkExternalEval())

Running External Evals

# List available external evals
uv run olmo-eval external-evals

# Run a built-in external eval
uv run olmo-eval run-external -e tau2_bench --model llama3.1-8b -a domain=airline -a num_tasks=1

ExternalEvalResult

External evals return structured results:

Field	Type	Description
`name`	`str`	Eval identifier
`metrics`	`dict[str, float]`	Evaluation metrics
`metadata`	`dict`	Additional metadata
`success`	`bool`	Whether the eval completed successfully
`error`	`str \| None`	Error message if failed
`duration_seconds`	`float`	Execution time
`raw_output`	`str \| None`	Raw stdout/stderr from the evaluation
`predictions`	`list`	Instance-level predictions

Sandboxes

Sandboxes provide isolated execution environments for code execution, tool use, and external evals.

Configuration

from olmo_eval.harness.sandbox import SandboxConfig, SandboxMode, Capability

config = SandboxConfig(
    image="python:3.12-slim",
    mode=SandboxMode.DOCKER,
    command_timeout=30.0,
    startup_timeout=60.0,
    instances=4,  # Run 4 parallel executors
    working_dir="/workspace",
    environment=(("MY_VAR", "value"),),
    volumes=(("/host/path", "/container/path"),),
    capabilities=Capability.BASH | Capability.PYTHON,  # Union of frozensets
)

SandboxConfig Fields

Field	Type	Default	Description
`image`	`str`	Required	Container image
`mode`	`SandboxMode`	`DOCKER`	`LOCAL`, `DOCKER`, or `MODAL`
`container_runtime`	`str`	`"podman"`	`"docker"` or `"podman"`
`command_timeout`	`float`	`30.0`	Timeout per command (seconds)
`startup_timeout`	`float`	`60.0`	Container startup timeout
`instances`	`int`	`1`	Number of parallel executors
`working_dir`	`str`	`"/workspace"`	Working directory in container
`environment`	`tuple`	`()`	Environment variables
`volumes`	`tuple`	`()`	Volume mounts (host, container)
`capabilities`	`frozenset[str]`	`Capability.DEFAULT`	Capabilities like `Capability.BASH`, `Capability.PYTHON`
`remove_container`	`bool`	`True`	Remove container after use
`docker_args`	`tuple[str, ...]`	`()`	Additional Docker/Podman arguments
`log_dir`	`str \| None`	`None`	Directory for container logs
`exec_shell`	`tuple[str, ...] \| None`	`None`	Custom shell for command execution
`enable_diagnostics`	`bool`	`True`	Run background diagnostics monitor

Using SandboxManager

The SandboxManager manages multiple executors with capability-based routing:

from olmo_eval.harness.sandbox import SandboxConfig, SandboxManager, SandboxMode, Capability

configs = [
    SandboxConfig(image="python:3.12", mode=SandboxMode.DOCKER, capabilities=Capability.PYTHON, instances=2),
    SandboxConfig(image="ubuntu:22.04", mode=SandboxMode.DOCKER, capabilities=Capability.BASH),
]

manager = SandboxManager(configs, owner="my-scorer")
await manager.start()

# Execute with specific capability - routes to matching executor
result = await manager.execute_with_capabilities(
    "print('hello')",
    Capability.PYTHON
)

# Round-robin across matching executors
results = await asyncio.gather(*[
    manager.execute_with_capabilities(cmd, Capability.PYTHON)
    for cmd in commands
])

await manager.stop()

Sandbox Modes

Mode	Description
`LOCAL`	Run commands locally (development only)
`DOCKER`	Run in Docker/Podman containers
`MODAL`	Run on Modal cloud platform

Launching on Beaker

olmo-eval includes built-in support for launching evaluation jobs on Beaker.

Installation

The beaker extra is included in the default dev group, so a plain uv sync --frozen is enough. If you previously opted out of the default groups, re-enable it with:

uv sync --frozen --extra beaker

CLI Usage

Launch an evaluation job:

# Basic evaluation
uv run olmo-eval beaker launch -n "eval-llama3-mmlu" -m llama3.1-8b -t mmlu \
    --cluster h100 \
    -w "ai2/olmo-eval-debug" \
    -B "ai2/oe-base"

# Multiple tasks
uv run olmo-eval beaker launch -n "eval-llama3-suite" \
    -m llama3.1-8b \
    -t mmlu -t gsm8k -t hellaswag \
    --cluster h100 \
    -w "ai2/olmo-eval-debug" \
    -B "ai2/oe-base"

# Large model with multiple GPUs
uv run olmo-eval beaker launch \
    --name "eval-70b-full" \
    --model meta-llama/Llama-3.1-70B-Instruct \
    --task mmlu --task gsm8k --task arc_easy \
    --cluster h100 \
    --workspace "ai2/olmo-eval-debug" \
    --budget "ai2/oe-base" \
    --gpus 4 \
    --timeout 48h

# Preview the Beaker spec without launching
uv run olmo-eval beaker launch -n "test" -m llama3.1-8b -t arc_easy \
    --cluster h100 \
    -w "ai2/olmo-eval-debug" \
    -B "ai2/oe-base" \
    --dry-run

# With a harness preset for tool-augmented evaluation
uv run olmo-eval beaker launch -n "eval-with-tools" \
    -m llama3.1-8b \
    -t simpleqa:judge \
    --harness dr_tulu \
    --cluster h100 \
    -w "ai2/olmo-eval-debug" \
    -B "ai2/oe-base"

# With inspection flags for debugging
uv run olmo-eval beaker launch -n "debug-eval" \
    -m llama3.1-8b \
    -t mmlu -o limit=10 \
    --inspect-request \
    --inspect-response \
    --cluster h100 \
    -w "ai2/olmo-eval-debug" \
    -B "ai2/oe-base"

# Run external evaluations
uv run olmo-eval beaker launch -n "external-eval" \
    -E tau2_bench \
    -m llama3.1-8b \
    -A domain=airline \
    -A num_tasks=1 \
    --cluster h100 \
    -w "ai2/olmo-eval-debug" \
    -B "ai2/oe-base"

Advanced Usage

Local Judge Models

For tasks or custom scorers that use a named auxiliary judge provider, you can run a local judge model alongside the main model. Put harness overrides immediately after --harness, then put task overrides after -t.

uv run olmo-eval beaker launch \
    --harness default \
    -o provider.max_model_len=16384 \
    -o provider.num_instances=1 \
    -o 'metrics.reporters=[file]' \
    -o 'metrics.collect_gpu=true' \
    -o 'provider.kwargs.timeout=300' \
    -o auxiliary_providers.judge.kind=vllm_server \
    -o auxiliary_providers.judge.model=Qwen/Qwen3-8B \
    -o auxiliary_providers.judge.num_instances=1 \
    -o scoring_concurrency=4 \
    -m Qwen/Qwen3-8B \
    -t "simpleqa:judge@urgent" \
    -o limit=10 \
    -w "ai2/olmo-eval-debug" \
    -B "ai2/oe-base" \
    --cluster h100 \
    --inspect \
    --group olmo-eval-local-judge-2 -y

Per-Task Priorities

Tasks can include an optional @priority suffix to set different priorities per task. Tasks with different priorities will be launched as separate Beaker experiments:

# Mixed priorities - creates separate experiments per priority level
uv run olmo-eval beaker launch -n "eval-suite" -m llama3.1-8b \
    -t "mmlu@high" \
    -t "gsm8k@normal" \
    -t "arc_easy@low" \
    --cluster h100 \
    -w "ai2/olmo-eval-debug" \
    -B "ai2/oe-base"

# Creates 3 experiments:
#   eval-suite-high:   runs mmlu at high priority
#   eval-suite-normal: runs gsm8k at normal priority
#   eval-suite-low:    runs arc_easy at low priority

# With task variants (@ comes after the task spec)
uv run olmo-eval beaker launch -n "eval" -m llama3.1-8b -t "arc_easy:mc@high" \
    --cluster h100 \
    -w "ai2/olmo-eval-debug" \
    -B "ai2/oe-base"

# Tasks without @priority use the config file priority (default: normal)

Experiment Groups

Groups logically organize experiments for management and result retrieval:

# Launch with grouping
uv run olmo-eval beaker launch -n "benchmark-v1" --group "benchmark-2024" \
    -m llama3.1-8b -m olmo-2-7b \
    -t mmlu -t gsm8k -t hellaswag \
    --cluster h100 \
    -w "ai2/olmo-eval-debug" \
    -B "ai2/oe-base"

# Creates experiment and adds it to "benchmark-2024" group

# Check group status and results
uv run olmo-eval beaker group info benchmark-2024

# Show detailed task info
uv run olmo-eval beaker group info benchmark-2024 --verbose

# Wait for completion and export as CSV
uv run olmo-eval beaker group info benchmark-2024 --wait --format csv > results.csv

# Export as JSON
uv run olmo-eval beaker group info benchmark-2024 --format json

# Watch experiment logs
uv run olmo-eval beaker watch -e <experiment-id>

# Cancel all experiments in a group
uv run olmo-eval beaker group cancel benchmark-2024

# List groups in a workspace
uv run olmo-eval beaker group list -w <workspace>

Inference Provider Configuration

Docker images do NOT include inference providers (vllm, transformers, litellm) by default. Each model must resolve to a provider configuration, either from a built-in model preset or from harness overrides.

Via config file (recommended):

name: eval-mixed-providers
models:
  - llama3.1-8b
  - gpt-4o
tasks:
  - mmlu
cluster: h100
workspace: ai2/olmo-eval-debug
budget: ai2/oe-base

CLI Options

Option	Short	Default	Description
`--config`	`-f`	none	YAML config file (CLI args override config values)
`--name`	`-n`	auto	Experiment name (auto-generated from model/tasks if not provided)
`--model`	`-m`	required	Model name or HuggingFace path (can specify multiple)
`--task`	`-t`	required	Task name with optional `@priority` suffix (can specify multiple)
`--harness`	`-H`	none	Harness preset name
`--override`	`-o`	none	Override for preceding `-t` or `-H` (can specify multiple)
`--cluster`	`-c`	required	Cluster alias (`h100`, `a100`, `aus`) or full name
`--gpus`	`-G`	auto	Number of GPUs (defaults to 1 for GPU providers, 0 otherwise)
`--max-gpus-per-node`		`8`	Maximum GPUs per node (tasks split if exceeded)
`--priority`	`-p`	`normal`	Job priority (`low`, `normal`, `high`, `urgent`)
`--preemptible`		`true`	Allow preemption
`--timeout`	`-T`	`24h`	Job timeout (e.g., `24h`, `30m`)
`--retries`	`-r`	none	Number of retries on failure
`--workspace`	`-w`	required	Beaker workspace
`--budget`	`-B`	required	Beaker budget
`--image`	`-I`	default	Custom Beaker image
`--group`	`-g`	auto	Add experiments to Beaker group(s) (auto-generated if not specified)
`--external-eval`	`-E`	none	External evaluation name(s) to run instead of tasks
`--eval-arg`	`-A`	none	Arguments for external evals (`key=value`)
`--provider-kwarg`	`-K`	none	Provider kwargs for external evals (`key=value`)
`--force-download-model`		`false`	Refresh Hugging Face model/tokenizer cache before loading
`--uv-cache-dir`		default	UV cache directory for package downloads
`--dry-run`	`-d`	`false`	Print spec without launching
`--yes`	`-y`	`false`	Skip confirmation prompt
`--follow/--no-follow`		`true`	Follow logs after launch
`--secret-env`		none	Map Beaker secret to env var (`SECRET:VAR`)
`--aws-credentials`		auto	Inject AWS credentials (auto-detected from s3:// paths)
`--gcp-credentials`		auto	Inject GCP credentials (auto-detected from gs:// model paths)
`--store`		`false`	Persist results to configured database

Per-Task Overrides

Use the -o/--override flag to apply configuration overrides to the preceding -t:

# Task overrides (apply to the preceding -t)
uv run olmo-eval beaker launch -n "eval" \
    -m llama3.1-8b \
    -t mmlu -o limit=100 -o num_fewshot=5 \
    -t gsm8k -o limit=50 \
    --cluster h100 \
    -w "ai2/olmo-eval-debug" \
    -B "ai2/oe-base"

The -o flag uses OmegaConf dotlist syntax, supporting:

Type	Syntax	Example
String	`key=value`	`-o formatter.template="Q: {q}"`
Number	`key=123`	`-o limit=100`
Boolean	`key=true`	`-o preemptible=false`
Nested	`a.b.c=val`	`-o scorer.normalize=true`
List	`key=[a,b]`	`-o 'dependencies=[pkg1, pkg2]'`

Note: Quote complex values to prevent shell interpretation:

# Good - single quotes protect the value
-o 'extra_config={key: value, nested: {a: 1}}'

Secret Environment Overrides

By default, Beaker secrets are mapped using the pattern {username}_{ENV_VAR} (e.g., ai2-tylerm_OPENAI_API_KEY). Use --secret-env to override this with a custom Beaker secret name:

# Use a team-shared secret instead of your personal secret
uv run olmo-eval beaker launch -n "eval" -m gpt-4o -t mmlu \
    --secret-env team-openai-key:OPENAI_API_KEY \
    --cluster h100 \
    -w "ai2/olmo-eval-debug" \
    -B "ai2/oe-base"

# Multiple secret overrides
uv run olmo-eval beaker launch -n "eval" -m gpt-4o -t simpleqa:judge \
    --harness dr_tulu \
    --secret-env team-openai-key:OPENAI_API_KEY \
    --secret-env shared-serper-key:SERPER_API_KEY \
    --secret-env shared-s2-key:S2_API_KEY \
    --cluster h100 \
    -w "ai2/olmo-eval-debug" \
    -B "ai2/oe-base"

Format: BEAKER_SECRET_NAME:ENV_VAR_NAME

This is useful for:

Using team-shared API keys instead of personal secrets
Testing with different credential sets
Sharing jobs that use organization-level secrets

YAML Configuration

For complex or reusable configurations, use YAML config files with the --config/-f option. CLI arguments override values from the config file.

Basic config file (eval_config.yaml):

name: eval-llama3-core
models:
  - llama3.1-8b
tasks:
  - mmlu
  - gsm8k
  - hellaswag
  - arc_challenge

cluster: h100
workspace: ai2/olmo-eval-debug
budget: ai2/oe-base
gpus: 1
priority: normal
timeout: 24h

Usage:

# Run from config file
uv run olmo-eval beaker launch -f eval_config.yaml --dry-run

# Override specific values
uv run olmo-eval beaker launch -f eval_config.yaml --gpus 4

# Add additional models via CLI
uv run olmo-eval beaker launch -f eval_config.yaml -m olmo-2-7b

Multi-model comparison config:

name: eval-model-comparison
models:
  - llama3.1-8b
  - olmo-2-7b
  - mistral-7b
tasks:
  - mmlu
  - gsm8k
  - hellaswag
cluster: h100
workspace: ai2/olmo-eval-debug
budget: ai2/oe-base
gpus: 1

Per-task priorities in config (examples/configs/prioritized_tasks.yaml):

Use @priority suffix on tasks to run different tasks at different priority levels. Tasks with different priorities create separate Beaker experiments:

name: eval-prioritized
models:
  - llama3.1-8b
  - olmo-2-7b
tasks:
  # High priority - run first
  - mmlu@high
  - gsm8k@high
  # Normal priority
  - hellaswag@normal
  - arc_challenge@normal
  # Low priority - run when resources available
  - winogrande@low
  - arc_easy@low
cluster: h100
workspace: ai2/olmo-eval-debug
budget: ai2/oe-base
gpus: 1
timeout: 24h

This creates 3 experiments (one per priority level, with both models in each):

eval-prioritized-high:   models=[llama3.1-8b, olmo-2-7b], tasks=[mmlu, gsm8k]
eval-prioritized-normal: models=[llama3.1-8b, olmo-2-7b], tasks=[hellaswag, arc_challenge]
eval-prioritized-low:    models=[llama3.1-8b, olmo-2-7b], tasks=[winogrande, arc_easy]

Large model config:

name: eval-70b-full
models:
  - meta-llama/Llama-3.1-70B-Instruct
tasks:
  - mmlu
  - gsm8k
  - hellaswag
cluster: h100
workspace: ai2/olmo-eval-debug
budget: ai2/oe-base
gpus: 4
priority: high
preemptible: false
timeout: 48h
retries: 2
description: "Full evaluation suite for Llama 70B"

Config file fields:

Field	Type	Required	Description
`name`	string	yes	Experiment name
`models`	list	yes	List of model names or presets
`tasks`	list	yes	List of task specs (with optional `@priority`)
`cluster`	string	yes	Cluster alias or full name
`gpus`	int	no	Default GPUs per model instance (auto-detected based on provider)
`max_gpus_per_node`	int	no	Max GPUs per node, splits tasks if exceeded (default: `8`)
`priority`	string	no	Default priority (default: `normal`)
`preemptible`	bool	no	Allow preemption (default: `true`)
`timeout`	string	no	Job timeout (default: `24h`)
`retries`	int	no	Retry count on failure
`workspace`	string	yes	Beaker workspace
`budget`	string	yes	Beaker budget
`beaker_image`	string	no	Container image to use (config-only)
`description`	string	no	Optional Beaker description
`groups`	list	no	Beaker groups to add experiments to

See examples/beaker/configs/ for more configuration examples.

Cluster Aliases

# List available cluster aliases
uv run olmo-eval beaker clusters

Programmatic API

from olmo_eval.launch import BeakerJobConfig, BeakerLauncher

config = BeakerJobConfig(
    name="eval-llama3-mmlu",
    command=["uv", "run", "olmo-eval", "run", "-m", "llama3.1-8b", "-t", "mmlu"],
    cluster="h100",
    num_gpus=1,
)

launcher = BeakerLauncher()
experiment = launcher.launch(config)
print(f"Launched: {launcher.beaker.experiment.url(experiment)}")

Docker Image Management

Docker images provide the runtime environment (Python, PyTorch, CUDA) but do NOT include:

Source code - Gantry mounts your git repository at runtime
Inference providers - Installed at job startup from each model's resolved provider config

This approach allows you to:

Use any git commit without rebuilding images
Keep images small and cacheable

Building Images

Images are tagged with CUDA and PyTorch versions: cu{version}-trc{version}-{arch}

# Build with defaults
./scripts/build_image.sh

# Specific CUDA + PyTorch version
./scripts/build_image.sh --cuda-version 12.8.1 --torch-version 2.9.0

# Production build
./scripts/build_image.sh --platform linux/amd64

# See supported CUDA+PyTorch pairs
./scripts/build_image.sh --help

Supported CUDA versions: 12.6.1, 12.8.0, 12.8.1, 12.9.1 PyTorch version: Configurable via --torch-version Configuration: See scripts/build_config.sh

Pushing Images

# Push most recent build
./scripts/beaker/push_beaker_image.sh

# Preview without pushing
./scripts/beaker/push_beaker_image.sh --dry-run

The script auto-detects the image name from the tag (e.g., olmo-eval-cu128-trc291-amd64)

What's in the Image

The image contains:

Python 3.12 (via uv)
PyTorch with CUDA support
System dependencies (git, uv, ca-certificates)

The image does NOT contain:

olmo-eval source code (provided by gantry at runtime)
olmo-eval dependencies like click, datasets, rich, etc. (installed at job startup)
Storage backends like boto3, psycopg (installed at job startup if needed)
Inference providers like vllm, transformers, litellm (installed at job startup)

Installing Inference Providers at Runtime

Inference providers are NOT baked into images. They are installed at job startup from the resolved provider configuration for each model:

# In config file
models:
  - llama3.1-8b
  - gpt-4o

# Or force the provider kind via a harness override
uv run olmo-eval beaker launch -n "eval" \
    --harness default \
    -o provider.kind=vllm_server \
    -m llama3.1-8b \
    -t mmlu \
    --cluster h100 \
    -w "ai2/olmo-eval-debug" \
    -B "ai2/oe-base"

# Manual installation inside container
uv pip install -e '.[vllm]'  # includes vllm[runai]

Task-Specific Dependencies

Tasks can declare runtime dependencies that are installed at job startup (see Tasks). Dependencies are automatically merged, deduplicated, and installed after the inference provider.

You can also add or override dependencies via the CLI:

# Add dependencies to a task via -o flag
uv run olmo-eval beaker launch -n "eval" -m llama3.1-8b \
    -t humaneval:3shot:bpb -o 'dependencies=["code-sandbox==1.0", "git+https://github.com/user/[email protected]"]' \
    --cluster h100 \
    -w "ai2/olmo-eval-debug" \
    -B "ai2/oe-base"

# Dependencies from multiple tasks are merged
uv run olmo-eval beaker launch -n "eval" -m llama3.1-8b \
    -t humaneval:3shot:bpb -o 'dependencies=["pkg1"]' \
    -t mbpp:3shot:bpb -o 'dependencies=["pkg2"]' \
    --cluster h100 \
    -w "ai2/olmo-eval-debug" \
    -B "ai2/oe-base"

Development

This repo uses uv with a checked-in uv.lock for reproducible installs. The default dependency groups (dev + vllm) are installed automatically, which covers storage, beaker, hf, and the vLLM inference provider.

# Install dependencies from the lockfile
uv sync --frozen

# Install pre-commit hooks
make setup

# Run linter / formatter
make lint
make fix    # auto-fix

# Run tests (and type checks)
make test
make verify

# Update the lockfile after editing pyproject.toml
uv lock

CI runs uv sync --frozen and uv run --frozen ..., so any change to pyproject.toml must be accompanied by a refreshed uv.lock.

Name		Name	Last commit message	Last commit date
Latest commit History 418 Commits
.github		.github
alembic		alembic
examples/beaker/configs		examples/beaker/configs
plans		plans
scripts		scripts
src/olmo_eval		src/olmo_eval
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CLAUDE.md		CLAUDE.md
DEVELOPMENT.md		DEVELOPMENT.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

olmo-eval

Overview

Quick Start

Run Your First Eval

Key Concepts

Tasks

Suites

Aggregation

Formatters

Scorers

Metrics

Model Presets

Harness

Using Harness via CLI

HarnessConfig

Scaffolds

Inference Metrics

Auxiliary Providers and Local Judge Models

Defining Tools

Custom Harness Config File

Programmatic Usage

Adding New Tasks

Quick Start: Minimal Task Example

Task Class Overview

TaskConfig Reference

Data Sources

Common Patterns

Adding Variants

Tool-Augmented Evaluation

Querying Results

Basic Queries

Instance-Level Predictions

Output Formats

Database Configuration

AI2 Users (Recommended)

All Environment Variables

Advanced Usage

Multi-GPU and Tool-Augmented Evaluation

Debugging and Inspection

Task Inspection (uv run olmo-eval task inspect)

Runtime Inspection Flags

Mock Provider for Testing

External Evals

Defining an External Eval

Running External Evals

ExternalEvalResult

Sandboxes

Configuration

SandboxConfig Fields

Using SandboxManager

Sandbox Modes

Launching on Beaker

Installation

CLI Usage

Advanced Usage

Local Judge Models

Per-Task Priorities

Experiment Groups

Inference Provider Configuration

CLI Options

Per-Task Overrides

Secret Environment Overrides

YAML Configuration

Cluster Aliases

Programmatic API

Docker Image Management

Building Images

Pushing Images

What's in the Image

Installing Inference Providers at Runtime

Task-Specific Dependencies

Development

About

Resources

License

Uh oh!

Task Inspection (`uv run olmo-eval task inspect`)

Packages