Repo2RLEnv

Turn any GitHub repository into a verifiable RL environment for training and evaluation.

Repo2RLEnv synthesizes verifiable training and evaluation data from existing repositories, exports it into a uniform spec, and pushes it straight to the Hugging Face Hub. The output spec is Harbor's, so every dataset you produce drops directly into any Harbor-compatible runtime — no glue code.

Quickstart

# Install (pick one)
uv add repo2rlenv                                 # add to a uv-managed project
uvx repo2rlenv --help                             # one-shot, no install
pip install repo2rlenv                            # classic

# Auth: nothing to set up if you've done `gh auth login` and `huggingface-cli login`.
# Otherwise:  export GITHUB_TOKEN=... ; export HF_TOKEN=...

# Generate a dataset locally
repo2rlenv generate \
  --repo <owner>/<repo> \
  --pipeline pr_runtime \
  --pipeline-opt limit=5 \
  --llm anthropic/claude-sonnet-4-6 \
  --out ./datasets/<dataset-name>

# Validate (fast structural check) and publish
repo2rlenv validate ./datasets/<dataset-name>
repo2rlenv push ./datasets/<dataset-name> <your-org>/<dataset-name>

# Anyone can pull + run a published dataset on a fresh machine
repo2rlenv pull <your-org>/<dataset-name> ./datasets/<dataset-name>
harbor run -p ./datasets/<dataset-name> -a oracle --env docker

→ Explore and visualize any Harbor dataset pushed to the Hub: Harbor Visualizer

Full walkthrough in docs/quickstart.md.

How it works

Repo2RLEnv runs synthesis pipelines that read real repositories — source code, merged PRs, commits, CVEs — and use them as a seed to generate RL environments: tasks with a concrete, solvable objective and a programmatic reward (no human grading).

Input: any repo. Output: a runnable RL environment you can point any LLM or coding agent at.

# every pipeline shares one contract: read a repo, emit verifiable tasks
class Pipeline(Protocol):
    name: ClassVar[PipelineName]
    def run(self, out_dir: Path) -> PipelineResult: ...   # writes tasks/<id>/

Generate from a repo, then run any agent against the result — the environment is scored automatically:

# 1. synthesize an environment from a repo
repo2rlenv generate --repo pallets/click --pipeline pr_runtime \
  --pipeline-opt limit=10 --llm anthropic/claude-sonnet-4-6 --out ./env-click

# 2. run an agent inside the sandbox (swap -a / -m for any of 25+ harnesses)
export ANTHROPIC_API_KEY=...   OPENAI_API_KEY=...
harbor run -p ./env-click -a claude-code -m anthropic/claude-sonnet-4-6 --ae ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY --env docker
harbor run -p ./env-click -a openhands   -m openai/gpt-4o               --ae OPENAI_API_KEY=$OPENAI_API_KEY     --env docker
harbor run -p ./env-click -a codex       -m openai/o3                   --ae OPENAI_API_KEY=$OPENAI_API_KEY     --env docker
harbor run -p ./env-click -a hermes      -m anthropic/claude-sonnet-4-6 --ae ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY --env docker

Each agent's per-task reward lands in /logs/verifier/reward.json, ready for training or eval.

Pipelines

A pipeline turns a repo into Harbor tasks. Two are stable and recommended for production; six are experimental — usable today (the CLI prints a warning before they run), with interfaces and output quality still evolving.

Stable

pr_diff mines merged pull-request diffs into lightweight, text-only tasks. The agent proposes an edit, and a verifier scores it against the real merged diff — on format, the files it touched, how much it changed, and (via an LLM judge) whether it's semantically right. No per-repo setup: every task ships a thin python:3.12-slim image. → Reference dataset: AdithyaSK/repo2rlenv-pr-diff (100 oracle-verified tasks).

pr_runtime is the SWE-bench-style flagship. It mines merged PRs and actually runs the repo's test suite inside a Docker sandbox: the tests the PR fixed must go from failing to passing under the gold patch, while the rest keep passing. That makes it the strongest, least-gameable signal of the set. → Reference dataset: AdithyaSK/repo2rlenv-pr-runtime (100 oracle-verified tasks).

→ All reference datasets: Verifiable RL Environments collection

Experimental

These run normally but emit a warning first — pin a release if you depend on them. Each links to its own page; the gist:

commit_runtime — mines commit history directly, catching fixes that never went through a PR. Reference dataset: AdithyaSK/repo2rlenv-commit-runtime (52 oracle-verified envs).
cve_patches — security tasks from public CVEs, mapped to their fix commits.
mutation_bugs — injects synthetic bugs into real code; the agent must restore the tests to green.
code_instruct — generates a problem + executable verifier from a real source file.
equivalence_tests — the agent reimplements a real function; generated tests check it matches the original.
refactor_synthesis — mines refactor commits and verifies behavior is preserved.

At a glance

Pipeline	Stability	Reward signal	Sandbox	LLM use	Languages
`pr_diff`	stable	`diff_similarity`	thin	at verify — judges the solution	any
`pr_runtime`	stable	`test_execution` + `diff_similarity`	✅	at env build — one-time, cached	Py · Go · Node · Rust
`commit_runtime`	experimental	`test_execution` + `diff_similarity`	✅	at env build — one-time, cached	Py · Go · Node · Rust
`cve_patches`	experimental	`test_execution` + `diff_similarity`	✅	at env build — one-time, cached	Py · Go · Node · Rust
`mutation_bugs`	experimental	`test_execution`	✅	at synthesis — writes the task	Py
`code_instruct`	experimental	`test_execution`	✅	at synthesis — writes the task	Py
`equivalence_tests`	experimental	`test_execution`	✅	at synthesis — writes the task	Py
`refactor_synthesis`	experimental	`test_execution` + `diff_similarity`	✅	at env build — one-time, cached	Py

What the columns mean

Reward signal — the verifiable signal emitted per task. test_execution = the repo's own tests gate the reward (F2P/P2P or pytest pass rate); diff_similarity = the agent's output is scored against the oracle diff (format, file targeting, region overlap, LLM judge). Pipelines that emit both use test_execution as the primary training signal.
Sandbox — whether the task runs inside Docker. ✅ = a per-repo image is built once by the bootstrap phase and cached; thin = no bootstrap, just a generic python:3.12-slim image.
LLM use — when a language model is invoked, which sets where your API cost goes:
- at env build — only during bootstrap (constructing the Docker image); cached, so generation itself is LLM-free.
- at synthesis — the model authors the task (problem + verifier) for every task generated.
- at verify — the model judges the agent's solution at scoring time (one reward component), and degrades gracefully when no key is set.
Languages — source languages the pipeline supports.

→ Full reference — per-pipeline options, reward design, and dataset cards: docs/pipelines/.

Bootstrap

Sandbox pipelines need a working Docker environment for the target repo. Repo2RLEnv's bootstrap phase builds it automatically — an LLM agent iterates shell commands inside a fresh container until the repo builds and its test suite collects, then commits and content-addresses the image. The expensive step runs once per (repo, ref); every downstream task reuses the cache. pr_diff skips it entirely.

repo2rlenv bootstrap --repo <owner>/<repo> --llm anthropic/claude-sonnet-4-6

Design, cache layout, cost tracking: docs/reference/BOOTSTRAP.md.

What you get out

A dataset that:

Is verifiable — every task carries an executable test (test_execution) or a stored oracle diff (diff_similarity); your trainer picks the reward type.
Is content-addressed — a content_hash over each task; identical artifacts ⇒ identical hash.
Trains anywhere via Harbor — TRL, SkyRL, Prime-RL, Tinker, Miles, Slime, harbor.rl.
Evaluates with any agent harness — Claude Code, OpenHands, Codex CLI, Gemini CLI, …
Is language-agnostic by spec — runtime pipelines emit a Dockerfile + shell verifier; pr_diff is pure text and works for any language.
Publishes natively to the Hub — repo2rlenv push writes a Harbor-compatible registry.json so consumers harbor download (or repo2rlenv pull) with zero glue.
Supports private repos end-to-end — gh auth token resolved automatically; build secrets declared by name; verifier-time secrets forbidden by spec.

Under the hood

Our focus is synthesis — we don't reimplement sandboxes, agent harnesses, or a registry. Tasks are emitted in the Harbor format (with a small [metadata.repo2env] block for provenance: pipeline, base commit, PR URL, content hash, reward kinds), so they run on Harbor's existing stack — Local Docker / Modal / Daytona / E2B / Runloop, 25+ agent harnesses, parallel execution, and the publishing CLI.

Contributing a pipeline

Pipelines are pluggable by design — adding a synthesis strategy is the main way to extend Repo2RLEnv:

Implement the Pipeline protocol (name + run() -> PipelineResult) in src/repo2rlenv/pipelines/.
Register it in PIPELINES and add its options model; new pipelines start experimental = True.
uv run pytest tests/test_pipeline_contract.py enforces the contract.

Full cookbook (oracle invariant, reward design, QA gate): docs/contributing/ADDING_A_PIPELINE.md. Issues and PRs welcome — see CONTRIBUTING.md.

Documentation

🚀 docs/quickstart.md — install → first dataset → push, in 10 minutes
📖 docs/pipelines/ — one page per pipeline (when to use, oracle shape, options)
📚 Reference contracts:
- REWARD_SCHEMA.md — reward.txt + reward.json fields for every pipeline
- SPEC.md — input/output contract
- API.md — Python API for src/repo2rlenv/
- AUTH.md — GitHub / HF / LLM auth resolution
- ENV.md — every environment variable the tool reads, in one place
- BOOTSTRAP.md — LLM-iterated per-repo Docker image
- AGENTS.md — Harbor agent harnesses + RL trace plumbing
🛠 CONTRIBUTING.md — dev setup, PR conventions, release flow
🧪 ADDING_A_PIPELINE.md — cookbook for shipping a new pipeline
🔭 Harbor Visualizer — explore and inspect any Harbor dataset pushed to the Hub

Adjacent projects

Harbor — the task format + runtime we adopt as our output spec
RepoLaunch (Microsoft) — LLM-agent env setup; our bootstrap is an independent reimplementation
OpenReward — ORS protocol + extra trainer integrations above Harbor
SWE-Gym — RL-environment framing for SWE-bench-style tasks
verifiers (Prime Intellect), OpenEnv (Meta + HF) — adjacent standardization efforts

Every pipeline that draws from external work carries an Acknowledgment block in its .py file. No code is copied — implementations are independent and Apache-2.0 licensed.

License

Apache 2.0. The original PR/commit contents remain under their respective source-repo licenses; datasets redistribute public commits for ML research under fair use.

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
.github		.github
assets		assets
docs		docs
src/repo2rlenv		src/repo2rlenv
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Repo2RLEnv

Quickstart

How it works

Pipelines

Stable

Experimental

At a glance

Bootstrap

What you get out

Under the hood

Contributing a pipeline

Documentation

Adjacent projects

License

About

Uh oh!

Releases 11

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Repo2RLEnv

Quickstart

How it works

Pipelines

Stable

Experimental

At a glance

Bootstrap

What you get out

Under the hood

Contributing a pipeline

Documentation

Adjacent projects

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 11

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages