Replication artifact for the SWE 699 final report (George Mason University, Spring 2026).
Stdlib-only script (Python 3.8+, no uv or LLM key) that computes the paper's result numbers from data/ in one script.
python3 compute_all_results.pyThe expected stdout is documented in EXPECTED-OUTPUT.md. The Data: line near the top shows local clone path and therefore varies.
- uv (Python version, package, and venv manager):
curl -LsSf https://astral.sh/uv/install.sh | sh(macOS/Linux),brew install uv, or docs for Windows. - git 2.5+.
uv syncTo revert any local changes after a run, run git restore . or re-clone.
Copy the whole codeblock to use as one command, or copy individual lines to see each step on its own. Stops on the first failure.
(
set -e
# Skip `relo collect`, it needs GITLAB_TOKEN, previous outputs are in data/ase/1_collected/.
# uv run relo collect ase --merged-before 2026-04-17
uv run relo filter ase
uv run relo label ase
uv run relox split ase
uv run relo optimize ase --mr-range most_recent_100 -y
uv run relo optimize ase --mr-range all -y
uv run relo categorize ase --snapshot tuned_100
uv run relox review ase
uv run relox evaluate ase
uv run python scripts/pooled_stats.py # paired McNemar / Holm-Bonferroni / CMH
uv run python scripts/pooled_aggregates.py # weighted ROUGE-L / ChrF++
uv run python scripts/rule_counts.py # per-dimension rule counts
)Compare the relox evaluate ase stdout to EXPECTED-OUTPUT.md. Per-condition recall must match exactly; ROUGE-L / ChrF++ may differ in the 4th decimal. EXPECTED-OUTPUT.md also lists headline numbers for mermaid and storybook (the shipped data covers all 3 subjects; swap ase for mermaid or storybook to re-run the per-project commands on those).
uv run pytest tests/# Install Claude Code: https://docs.claude.com/en/docs/claude-code
claude /login # one-time OAuth setup
export REVIEWLORE_BACKEND=claude-code
uv run relox review ase --condition tuned_100 --force # makes real LLM calls
uv run relox evaluate ase
unset REVIEWLORE_BACKEND # back to replayBackend precedence (highest wins): --backend flag > REVIEWLORE_BACKEND env var > reviewlore.yaml (backend.name).
Live mode overwrites the shipped envelopes under data/<project>/{6_reviews,7_judgments}/; git restore data/ to revert. To target SHAs outside the original test set, run git -C subject-repos/<project> fetch origin <sha> first. A full 3-project x 3-condition live re-run is ~25 hours wall and ~$480 in Claude Code subscription model-usage.
Every file and directory in this artifact, with a 1-2 sentence description.
.gitattributes-- marks bundled bare-repo pack and index files as binary so git does not try to diff them..gitignore-- excludes local runtime scratch (caches, virtualenvs), regenerated sample worktrees, trial / live output trees, and platform metadata.README.md-- this file.EXPECTED-OUTPUT.md-- expected stdout forpython3 compute_all_results.py(saved data verification) andrelox evaluate ase(full-pipeline), plus per-subject headline recall numbers for storybook, mermaid, and ase.compute_all_results.py-- no-dependency saved-data verification; walksdata/and prints the paper's result numbers with labels. Stdlib only, runs on Python 3.8+.pyproject.toml-- Python project metadata: Python 3.13 floor, runtime deps (httpx, click, pyyaml, NLTK, sacrebleu, etc.), dev deps (pytest), and the[project.scripts]entries that exposerelo(tuning CLI) andrelox(experiment CLI).uv.lock-- fully-pinned dependency lockfile consumed byuv syncfor bit-identical environments.reviewlore.yaml-- single source of truth for runtime config: default backend (replay), per-stage models / budgets / prompt paths, optimizer batch size, concurrency limits, filter rules, conditions list, and per-project (platform, repo, language, subject_repo) entries.
prompts/reviewer_system.md(v3) -- generic reviewer system prompt. The tuned conditions append the project's tuned snapshot (tuned_100.mdortuned_200.md) at runtime.prompts/judge_system.md(v3) -- judge system prompt. v3 implements coverage-based matching (many-to-one and one-to-many both allowed) and emitsverdicts.jsonwith per-commentmatch/no_matchdecisions plus reasoning.prompts/optimizer_extract.md(v2) -- per-MR extraction prompt; one call per tuning MR produces candidate rules from human review comments + diff.prompts/optimizer_consolidate.md(v2) -- batch consolidation prompt; merges, dedupes, and prunes candidate rules in chronological batches of 10 to produce the running rule set.prompts/filter_actionability.md(v1) -- Stage 3 actionability filter prompt run byrelo filter; classifies each surviving comment as actionable or not.prompts/label_bitsai.md(v1) -- BitsAI-CR taxonomy labeling prompt; tags each filtered comment with one of {Code Defect, Code Style, Maintenance & Readability, Performance}.prompts/categorize_rules.md(v1) -- post-hoc categorizer prompt; assigns BitsAI-CR sub-category, generalizability (project-specific vs generalizable), and lintability (lintable / partially_lintable / requires_llm) to each rule in a tuned snapshot.
__init__.py-- exposes__version__.backends/__init__.py-- empty package marker.backends/base.py-- agent backend abstraction:AgentResultdataclass,AgentBackendprotocol, JSON-from-text extraction (envelope + 4-strategy fallback parser), and the in-process backend registry (register_backend,get_backend).backends/claude_code.py-- liveclaude-codebackend: spawns the Claude Code CLI as a subprocess, passes prompts via stdin, captures the--output-format jsonenvelope, and returns anAgentResult.backends/replay.py-- replay backend used by the artifact: reads savedraw.jsonenvelopes, runs the same parser the live backend runs, returns indistinguishableAgentResults, raisesReplayMissingErroron cache miss whenstrict=true(the artifact default).experiment/__init__.py-- empty package marker.experiment/cli.py--reloxClick entry point (review, judge, evaluate, validate, status, case-studies).experiment/review.py-- reviewer agent driver: condition-aware prompt assembly (generic vs tuned with the rule snapshot appended) and per-revision invocation with retry/recovery.experiment/judge.py-- judge agent driver: AI-comment formatting with explicitindexfields, output validation, and per-revision invocation with retry.experiment/review_judge.py-- review + judge orchestrator: per-revision worktree lifecycle, in-process worktree lock, thread-pooled parallel execution, skip-existing semantics, force-rerun support.experiment/evaluate.py-- recall computation, ROUGE-L / ChrF++ scoring, threshold calibration against judge agreement, McNemar's exact tests, paired-overlap tables, per-category recall, cost + token aggregation. Includes the shallow-bundle two-dot fallback for_load_touched_files_per_revso the artifact's bare repos resolve reachability.experiment/case_studies.py-- minesrecall_vs_similarity.json(cases where judge verdict and ROUGE-L disagree, bucketed A/B/C/D) andlost_comments.json(the c-cell of the McNemar table: comments that generic caught but tuned-100 missed).experiment/validate.py-- verdict-level case-cohort sampling for blind judge validation: stratifies on (project, judge_verdict), seeded-shuffles each pool, builds the manifest, supports immutable-floor extension and census fallback for rare classes.experiment/validate_audit.py-- judge-validation audit helpers: same-line no-match auto-confirm, rater-label rollup, per-class agreement, Cohen's kappa, M-of-N counts.experiment/split.py-- temporal split into tuning/test sets with per-revision testability assessment (non-squash merge + reachable commit + non-final-revision comments).experiment/status.py-- pipeline status reporting forrelox status; combines tuner status with experiment-specific stages (split, review, judge, evaluate).experiment/survey.py-- stub for the project-discovery survey step; CP4 deferred this in favor of the existing 298-repo survey results documented in the paper.tuner/__init__.py-- empty package marker.tuner/cli.py--reloClick entry point (collect, filter, label, split, optimize, categorize, status).tuner/config.py-- YAML config loader, dataclass models forConfig/BackendConfig/PromptConfig/ProjectConfig,resolve_output_dir(CLI flag >RELO_OUTenv var > yamloutput_dir),resolve_project,get_prompt.tuner/dirs.py-- single source of truth for the pipeline's numbered output directory names (1_collected, ...,9_evaluation).tuner/state.py-- atomic JSON writes (write-temp-then-rename), metadata read/write, staleness detection.tuner/collect.py-- MR/PR collection from GitHub GraphQL and GitLab REST APIs; pinsoriginalLinein the GraphQL query so outdated review threads keep their anchor line numbers.tuner/filter.py-- three-stage comment filter: bot/structural removal, resolution-state tagging, and the LLM Stage 3 actionability classifier.tuner/label.py-- BitsAI-CR taxonomy labeling pass over the filtered comments.tuner/optimize.py-- two-phase rule induction: parallel per-MR extraction (Phase 2a) followed by sequential, batch-of-10 consolidation (Phase 2b) that emits the tuned snapshots.tuner/categorize.py-- post-hoc categorizer: per-rule BitsAI sub-category + generalizability + lintability axes, run once per project per snapshot for the CP6 RQ2 expansion and lint-vs-LLM Discussion section.tuner/status.py-- pipeline status reporting forrelo status.
tests/__init__.py,tests/backends/__init__.py,tests/experiment/__init__.py,tests/tuner/__init__.py-- empty package markers.tests/backends/test_replay.py-- replay backend round-trip,ReplayMissingErroron cache miss, permissive (strict=false) mode, registry presence, envelope-and-marker parity with the live backend.tests/experiment/conftest.py-- shared fixtures: synthetic data/ trees and synthetic git repos so validate.py tests run without touching real subject clones.tests/experiment/test_evaluate_reachability.py-- regression test for the shallow-bundle two-dot fallback in_load_touched_files_per_rev; reproduces the orphan-history scenario the artifact's bare bundles trigger.tests/experiment/test_evaluate_tokens.py-- pins the token aggregation fallback that re-derivesinput_tokens/output_tokensfrom the nested envelopeusageblock (the CP4/CP5 top-level fields were always 0).tests/experiment/test_judge.py-- judge output validation / normalization and index-aware comment formatting.tests/experiment/test_validate.py-- 18 tests covering case-cohort sampling: seed determinism, verdict-pool shuffle stability, census fallback, target-raise extension semantics, drift guards, manifest atomic write, CLI smoke test.tests/tuner/test_collect_normalize.py-- pinsoriginalLineinto the GitHub GraphQL query and asserts theline -> originalLinefallback in the comment normalizer; a regression here would resurface the outdated-line-number bug the judge-fix track addressed.
scripts/pooled_stats.py-- pooled and stratified statistical tests for paired conditions: per-project pairwise McNemar with Holm-Bonferroni correction, pooled McNemar, Cochran-Mantel-Haenszel test, ensemble recall promotion thresholds.scripts/pooled_aggregates.py-- pooled, weighted non-LLM metric aggregates (ROUGE-L, ChrF++) across projects, weighted by per-stratumn_comments.scripts/rule_counts.py-- per-BitsAI-dimension rule counts per project per volume; reads tuned snapshots under5_optimization/snapshots/.scripts/reconstruct_sample_worktrees.sh-- bash helper that provisions eachdata/judge_validation/samples/sample_NN/worktree/directory from the corresponding bundled bare subject-repo at the recordedhead_sha; also patches eachworkspace.code-workspaceto re-add the worktree folder entry. Run from the artifact root.
For each project <P> in {storybook, mermaid, ase}:
data/<P>/1_collected/-- raw GitHub/GitLab MR/PR data, onepr_<N>.jsonper merged PR (250 per project; 220 for ASE, since the project did not have 250 qualifying merged MRs in the collection window).metadata.jsonrecords the collection timestamp + collection config.data/<P>/2_filtered/-- output of the three-stage filter (bot / resolution / actionability), onepr_<N>.jsonper surviving PR.data/<P>/3_labeled/-- BitsAI-CR taxonomy labels per surviving thread, mirroring the2_filtered/layout.data/<P>/4_split/-- temporal tuning/test split:tuning.txt/test.txtare flat PR-number lists,testable_revisions.jsonlists the per-revision testability metadata (head_sha, merge_base_sha, comment counts) the reviewer + judge pipeline iterates over,metadata.jsonrecords the split timestamp and cutoff.data/<P>/5_optimization/-- optimizer state and outputs:extraction/-- per-MR candidate rules from Phase 2a (onepr_<N>.jsonper tuning MR plusmetadata.json).consolidation_100/-- Phase 2b batch outputs for the tuned-100 snapshot (batch_01.jsonthroughbatch_NN.jsonplusmetadata.json).consolidation_200/-- Phase 2b batch outputs for the tuned-200 snapshot.snapshots/tuned_100.{json,md},snapshots/tuned_200.{json,md}-- final consolidated rule sets in machine and human form.snapshots/tuned_100_categorized.{json,md}plus_metadata.jsonand_raw.json-- output ofrelo categorizewith the lintability / generalizability axes.snapshots/full_pass/snapshot_NNN.{json,md}-- intermediate snapshots taken every 25 MRs during the full-200 consolidation pass.
data/<P>/6_reviews/<condition>/pr_<N>/rev_<sha>/-- reviewer outputs per (condition, PR, revision):raw.json-- the saved Claude Code envelope (input to the replay backend; never modified by replication runs).review.json-- parsed reviewer output: comments withfile_path,line_number,severity,message, and the explicitindexfield added during judge-fix-2.context.json-- the full reviewer prompt context (file diff, existing human comments, etc.).system_prompt.md-- the exact reviewer system prompt used for this run (generic or tuned snapshot).
data/<P>/7_judgments/<condition>/pr_<N>/rev_<sha>/-- judge outputs per (condition, PR, revision):raw.json-- saved judge envelope.verdicts.json-- per-human-comment verdict (matchorno_match),matched_ai_comment_index,reasoning.
data/<P>/9_evaluation/-- aggregated metrics:metrics.json-- recall, ROUGE-L, ChrF++, McNemar, threshold calibration, paired overlap, per-category recall, cost, tokens. This is the file the quick-start regenerates.comparison.md-- Markdown rendering of the same numbers.metadata.json-- evaluate run timestamp + provenance.
Plus the cross-project judge validation tree:
data/judge_validation/manifest.json-- sampling manifest: 18 sample folders, 50 verdicts across 6 strata ((project, judge_verdict)).data/judge_validation/agreement.json,data/judge_validation/agreement_report.md-- rater-label rollup: per-class agreement, Cohen's kappa, M-of-N counts.data/judge_validation/samples/sample_NN/-- one folder per sampled revision:README.md-- per-sample human-rater instructions, prefixed with the artifact-note banner explaining the strippedworktree/.sample_info.json-- which (project, judge_verdict) cells this sample contributes to and the inclusion-driving sample keys.context.json-- copy of the reviewer context for this revision.review.json-- copy of the reviewer output (with the explicitindexfield) for this revision.verdicts.json-- the rater-fillable verdicts file, frozen atjudge_verdict_at_sampling.judge_system_prompt.md-- copy of the judge prompt used at sampling time.workspace.code-workspace-- VS Code multi-root workspace pre-configured for the rater task; the artifact strips the worktree folder entry, so runscripts/reconstruct_sample_worktrees.shfirst if you want the SCM diff view.
subject-repos/storybook/,subject-repos/mermaid/,subject-repos/ase/-- shallow bare git clones containing only the head and merge-base SHAs of each test revision. Each SHA is reachable viarefs/replication/<sha>sogit gccannot prune it. Theoriginremote points at the upstream URL (https://github.com/storybookjs/storybook.git,https://github.com/mermaid-js/mermaid.git,https://gitlab.com/ase/ase.git) so live-mode reruns cangit fetchadditional history on demand.
ReplayMissingError: a step tried to replay a saved LLM output that does not exist on disk. This usually means the command is being run on a project, condition, or revision that was not part of the original experiment. Use the canned commands in this README (which target only revisions present in the test set), or switch to live mode.uv: command not found: install uv via the curl/powershell/brew instructions in the Prerequisites section, then re-open the shell.- Subject repo missing SHA in live mode: the bundled bare repositories contain only the test SHAs. To target a SHA outside the original test set, run
git -C subject-repos/<project> fetch origin <sha>first. relox evaluatereportsReachability: 0 revs mapped: the shipped artifact uses a two-dot diff fallback in_load_touched_files_per_revso the shallow bare bundles resolve; if you see zero mapped revs, you are running the artifact against a modifiedevaluate.pythat has lost the fallback. Runuv run pytest tests/experiment/test_evaluate_reachability.pyto confirm.- VS Code workspace flashes "missing folder" on a sample folder: the artifact strips
data/judge_validation/samples/sample_NN/worktree/and removes the corresponding entry fromworkspace.code-workspace. Runbash scripts/reconstruct_sample_worktrees.shfrom the artifact root before opening the workspace.