Turn any GitHub repository into a verifiable RL environment for training and evaluation.
Quickstart · Pipelines · Output · Docs
Repo2RLEnv synthesizes verifiable training and evaluation data from existing repositories, exports it into a uniform spec, and pushes it straight to the Hugging Face Hub. The output spec is Harbor's, so every dataset you produce drops directly into any Harbor-compatible runtime — no glue code.
# Install (pick one)
uv add repo2rlenv # add to a uv-managed project
uvx repo2rlenv --help # one-shot, no install
pip install repo2rlenv # classic
# Auth: nothing to set up if you've done `gh auth login` and `huggingface-cli login`.
# Otherwise: export GITHUB_TOKEN=... ; export HF_TOKEN=...
# Generate a dataset locally
repo2rlenv generate \
--repo <owner>/<repo> \
--pipeline pr_runtime \
--pipeline-opt limit=5 \
--llm anthropic/claude-sonnet-4-6 \
--out ./datasets/<dataset-name>
# Validate (fast structural check) and publish
repo2rlenv validate ./datasets/<dataset-name>
repo2rlenv push ./datasets/<dataset-name> <your-org>/<dataset-name>
# Anyone can pull + run a published dataset on a fresh machine
repo2rlenv pull <your-org>/<dataset-name> ./datasets/<dataset-name>
harbor run -p ./datasets/<dataset-name> -a oracle --env docker→ Explore and visualize any Harbor dataset pushed to the Hub: Harbor Visualizer
Full walkthrough in docs/quickstart.md.
Repo2RLEnv runs synthesis pipelines that read real repositories — source code, merged PRs, commits, CVEs — and use them as a seed to generate RL environments: tasks with a concrete, solvable objective and a programmatic reward (no human grading).
Input: any repo. Output: a runnable RL environment you can point any LLM or coding agent at.
# every pipeline shares one contract: read a repo, emit verifiable tasks
class Pipeline(Protocol):
name: ClassVar[PipelineName]
def run(self, out_dir: Path) -> PipelineResult: ... # writes tasks/<id>/Generate from a repo, then run any agent against the result — the environment is scored automatically:
# 1. synthesize an environment from a repo
repo2rlenv generate --repo pallets/click --pipeline pr_runtime \
--pipeline-opt limit=10 --llm anthropic/claude-sonnet-4-6 --out ./env-click
# 2. run an agent inside the sandbox (swap -a / -m for any of 25+ harnesses)
export ANTHROPIC_API_KEY=... OPENAI_API_KEY=...
harbor run -p ./env-click -a claude-code -m anthropic/claude-sonnet-4-6 --ae ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY --env docker
harbor run -p ./env-click -a openhands -m openai/gpt-4o --ae OPENAI_API_KEY=$OPENAI_API_KEY --env docker
harbor run -p ./env-click -a codex -m openai/o3 --ae OPENAI_API_KEY=$OPENAI_API_KEY --env docker
harbor run -p ./env-click -a hermes -m anthropic/claude-sonnet-4-6 --ae ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY --env dockerEach agent's per-task reward lands in /logs/verifier/reward.json, ready for training or eval.
A pipeline turns a repo into Harbor tasks. Two are stable and recommended for production; six are experimental — usable today (the CLI prints a warning before they run), with interfaces and output quality still evolving.
pr_diff mines merged pull-request diffs into lightweight, text-only tasks. The agent proposes an edit, and a verifier scores it against the real merged diff — on format, the files it touched, how much it changed, and (via an LLM judge) whether it's semantically right. No per-repo setup: every task ships a thin python:3.12-slim image.
→ Reference dataset: AdithyaSK/repo2rlenv-pr-diff (100 oracle-verified tasks).
pr_runtime is the SWE-bench-style flagship. It mines merged PRs and actually runs the repo's test suite inside a Docker sandbox: the tests the PR fixed must go from failing to passing under the gold patch, while the rest keep passing. That makes it the strongest, least-gameable signal of the set.
→ Reference dataset: AdithyaSK/repo2rlenv-pr-runtime (100 oracle-verified tasks).
→ All reference datasets: Verifiable RL Environments collection
These run normally but emit a warning first — pin a release if you depend on them. Each links to its own page; the gist:
commit_runtime— mines commit history directly, catching fixes that never went through a PR. Reference dataset:AdithyaSK/repo2rlenv-commit-runtime(52 oracle-verified envs).cve_patches— security tasks from public CVEs, mapped to their fix commits.mutation_bugs— injects synthetic bugs into real code; the agent must restore the tests to green.code_instruct— generates a problem + executable verifier from a real source file.equivalence_tests— the agent reimplements a real function; generated tests check it matches the original.refactor_synthesis— mines refactor commits and verifies behavior is preserved.
| Pipeline | Stability | Reward signal | Sandbox | LLM use | Languages |
|---|---|---|---|---|---|
pr_diff |
stable | diff_similarity |
thin | at verify — judges the solution | any |
pr_runtime |
stable | test_execution + diff_similarity |
✅ | at env build — one-time, cached | Py · Go · Node · Rust |
commit_runtime |
experimental | test_execution + diff_similarity |
✅ | at env build — one-time, cached | Py · Go · Node · Rust |
cve_patches |
experimental | test_execution + diff_similarity |
✅ | at env build — one-time, cached | Py · Go · Node · Rust |
mutation_bugs |
experimental | test_execution |
✅ | at synthesis — writes the task | Py |
code_instruct |
experimental | test_execution |
✅ | at synthesis — writes the task | Py |
equivalence_tests |
experimental | test_execution |
✅ | at synthesis — writes the task | Py |
refactor_synthesis |
experimental | test_execution + diff_similarity |
✅ | at env build — one-time, cached | Py |
What the columns mean
- Reward signal — the verifiable signal emitted per task.
test_execution= the repo's own tests gate the reward (F2P/P2P or pytest pass rate);diff_similarity= the agent's output is scored against the oracle diff (format, file targeting, region overlap, LLM judge). Pipelines that emit both usetest_executionas the primary training signal. - Sandbox — whether the task runs inside Docker.
✅= a per-repo image is built once by the bootstrap phase and cached;thin= no bootstrap, just a genericpython:3.12-slimimage. - LLM use — when a language model is invoked, which sets where your API cost goes:
- at env build — only during bootstrap (constructing the Docker image); cached, so generation itself is LLM-free.
- at synthesis — the model authors the task (problem + verifier) for every task generated.
- at verify — the model judges the agent's solution at scoring time (one reward component), and degrades gracefully when no key is set.
- Languages — source languages the pipeline supports.
→ Full reference — per-pipeline options, reward design, and dataset cards: docs/pipelines/.
Sandbox pipelines need a working Docker environment for the target repo. Repo2RLEnv's bootstrap phase builds it automatically — an LLM agent iterates shell commands inside a fresh container until the repo builds and its test suite collects, then commits and content-addresses the image. The expensive step runs once per (repo, ref); every downstream task reuses the cache. pr_diff skips it entirely.
repo2rlenv bootstrap --repo <owner>/<repo> --llm anthropic/claude-sonnet-4-6Design, cache layout, cost tracking: docs/reference/BOOTSTRAP.md.
A dataset that:
- Is verifiable — every task carries an executable test (
test_execution) or a stored oracle diff (diff_similarity); your trainer picks the reward type. - Is content-addressed — a
content_hashover each task; identical artifacts ⇒ identical hash. - Trains anywhere via Harbor — TRL, SkyRL, Prime-RL, Tinker, Miles, Slime, harbor.rl.
- Evaluates with any agent harness — Claude Code, OpenHands, Codex CLI, Gemini CLI, …
- Is language-agnostic by spec — runtime pipelines emit a Dockerfile + shell verifier;
pr_diffis pure text and works for any language. - Publishes natively to the Hub —
repo2rlenv pushwrites a Harbor-compatibleregistry.jsonso consumersharbor download(orrepo2rlenv pull) with zero glue. - Supports private repos end-to-end —
gh auth tokenresolved automatically; build secrets declared by name; verifier-time secrets forbidden by spec.
Our focus is synthesis — we don't reimplement sandboxes, agent harnesses, or a registry. Tasks are emitted in the Harbor format (with a small [metadata.repo2env] block for provenance: pipeline, base commit, PR URL, content hash, reward kinds), so they run on Harbor's existing stack — Local Docker / Modal / Daytona / E2B / Runloop, 25+ agent harnesses, parallel execution, and the publishing CLI.
Pipelines are pluggable by design — adding a synthesis strategy is the main way to extend Repo2RLEnv:
- Implement the
Pipelineprotocol (name+run() -> PipelineResult) insrc/repo2rlenv/pipelines/. - Register it in
PIPELINESand add its options model; new pipelines startexperimental = True. uv run pytest tests/test_pipeline_contract.pyenforces the contract.
Full cookbook (oracle invariant, reward design, QA gate): docs/contributing/ADDING_A_PIPELINE.md. Issues and PRs welcome — see CONTRIBUTING.md.
- 🚀
docs/quickstart.md— install → first dataset → push, in 10 minutes - 📖
docs/pipelines/— one page per pipeline (when to use, oracle shape, options) - 📚 Reference contracts:
REWARD_SCHEMA.md—reward.txt+reward.jsonfields for every pipelineSPEC.md— input/output contractAPI.md— Python API forsrc/repo2rlenv/AUTH.md— GitHub / HF / LLM auth resolutionENV.md— every environment variable the tool reads, in one placeBOOTSTRAP.md— LLM-iterated per-repo Docker imageAGENTS.md— Harbor agent harnesses + RL trace plumbing
- 🛠
CONTRIBUTING.md— dev setup, PR conventions, release flow - 🧪
ADDING_A_PIPELINE.md— cookbook for shipping a new pipeline - 🔭 Harbor Visualizer — explore and inspect any Harbor dataset pushed to the Hub
- Harbor — the task format + runtime we adopt as our output spec
- RepoLaunch (Microsoft) — LLM-agent env setup; our
bootstrapis an independent reimplementation - OpenReward — ORS protocol + extra trainer integrations above Harbor
- SWE-Gym — RL-environment framing for SWE-bench-style tasks
- verifiers (Prime Intellect), OpenEnv (Meta + HF) — adjacent standardization efforts
Every pipeline that draws from external work carries an Acknowledgment block in its .py file. No code is copied — implementations are independent and Apache-2.0 licensed.
Apache 2.0. The original PR/commit contents remain under their respective source-repo licenses; datasets redistribute public commits for ML research under fair use.
