Boundary-Guided Replay

This repository contains the anonymous AAAI-27 submission package for:

Boundary-Guided Replay: A Mechanism Study of Success-Failure Boundary Learning

The included artifact contains the reusable BGR core, versioned experiment configs, per-seed result summaries, generated paper tables/figures, OpenVLA audit scripts, environment snapshots, and a SHA-256 submission manifest. The main evidence includes a 30-seed synthetic mechanism check, active-estimator validation, a completed 30-seed procedural grid-margin full-baseline comparison, a held-out grid replication, replicated 30-seed OpenML diabetes, blood-transfusion, and phoneme supervised margin-replay checks, a 30-seed robot-suffix coverage comparison, a held-out suffix full-baseline replication, and a held-out suffix BGR-vs-uniform replication. The package also includes a 30-seed suffix stress sweep over teacher quality, clutter, feasibility, and boundary sharpness. OpenVLA/LIBERO results are included as recovery-curve, selection, and data-plumbing audits rather than robotics fine-tuning claims. The packaged action-label/TFDS plumbing audit validates 2,048-transition matched BGR/random exports with 7D actions and 8D state, but does not claim a stable policy gain.

The anonymous submission archive contains submission_manifest.json plus the files it declares. Only those archive entries should be treated as the anonymous submission artifact. The manifest hashes the declared payload files; the manifest itself is packaged as the verifier entry point and is intentionally not self-hashed. Mirrored raw Slurm logs are diagnostic run records and are not part of the anonymous submission artifact. Cluster commands below are provenance recipes; any remote input paths they mention are not reviewer evidence unless their generated summaries are declared in submission_manifest.json.

To assemble only those files after the package gate passes:

PYTHONPATH=src:. python3 scripts/check_submission_package.py --root . --write-submission-zip bgr-aaai27-anonymous.zip

Reviewer Navigation

Start with paper/main.pdf for the anonymous manuscript, then use results/README.md#submission-evidence-index for the evidence map. The primary evidence is the 30-seed synthetic mechanism check, the active-estimator validation, the 30-seed grid-margin comparison, the held-out grid replication, the replicated OpenML diabetes, blood-transfusion, and phoneme margin-replay checks, the 30-seed robot-suffix coverage comparison, the held-out suffix full-baseline replication, the held-out suffix BGR-vs-uniform replication, the suffix stress sweep, and paper/figures/significance_tests.csv. OpenVLA/LIBERO entries are scoped audits and should not be read as robotics fine-tuning claims.

For the primary paired comparisons, competing methods share experiment configs, replayable-state pools, evaluation radius grids, learner/update budgets, and paired seeds; the intended intervention is the replay state/radius selection rule. Baseline rows in the generated tables come from the same scripts and summary artifacts listed in the evidence index.

Claim-Evidence Map

Paper claim	Primary artifact evidence	Verification hook
Controlled synthetic recovery-margin training validates the intended BGR sampler before higher-cost runs.	`results/toy_30seed_v1/summary.csv`, `results/toy_15seed_v1/summary.csv`, `paper/figures/significance_tests.csv`	`scripts/check_paper_claims.py` checks the synthetic RAUC, AULC, clean-success, and sign-test prose; `scripts/check_submission_package.py` checks the 30-seed synthetic comparison.
Boundary-centered replay expands recovery margins in the main procedural setting.	`results/grid_margin_full_30seed_v1/summary.csv`, `results/grid_margin_full_replication_30seed_v1/summary.csv`, `paper/figures/grid_margin_full_table.tex`, `paper/figures/grid_margin_learning_curve.pdf`, `paper/figures/significance_tests.csv`	`scripts/check_paper_claims.py` checks the numeric prose; `scripts/check_submission_package.py` checks paired seeds, ledgers, generated tables, and manifest hashes.
Boundary-radius replay gives scoped positive results on pre-existing supervised datasets.	`results/openml_margin_scout_v0/summary.csv`, `results/openml_diabetes_margin_30seed_v1/summary.csv`, `results/openml_diabetes_margin_replication_30seed_v1/summary.csv`, `results/openml_numeric_external_fixed_target2_30seed_v1/summary.csv`, `results/openml_numeric_external_fixed_target2_replication_30seed_v1/summary.csv`, `results/openml_positive_target_sensitivity_30seed_v1/summary.csv`, `results/openml_positive_target_sensitivity_replication_30seed_v1/summary.csv`, `results/openml_blood_transfusion_margin_replication_30seed_v1/summary.csv`, `results/openml_phoneme_margin_replication_30seed_v1/summary.csv`	`scripts/check_paper_claims.py` checks the OpenML diabetes, blood-transfusion, phoneme, external-suite macro, and target-sensitivity caveat prose; `scripts/check_acceptance_readiness.py` checks the original and held-out gates; `scripts/check_submission_package.py` verifies packaged artifact references.
The feasibility witness is a scoped interface assumption rather than free supervision.	`results/grid_margin_witness_sensitivity_30seed_v1/summary.csv`, `results/suffix_stress_sensitivity_30seed_v1/summary.csv`	`scripts/check_paper_claims.py` checks the grid-margin witness-noise prose; `scripts/check_submission_package.py` keeps witness-scope language required in the manuscript.
Active boundary probing estimates useful critical radii at a small fixed rollout budget.	`paper/figures/estimator_stats.csv`, `paper/figures/estimator_table.tex`, `results/estimator_pair_30seed_v1/summary.csv`, `paper/figures/significance_tests.csv`	`scripts/check_paper_claims.py` checks the estimator prose; `scripts/check_submission_package.py` checks the generated estimator table and the 30-seed estimator confirmation.
Radius-level boundary sampling is the important BGR ablation in the grid-margin benchmark.	`results/grid_margin_ablation_30seed_v1/summary.csv`, `results/grid_margin_ablation_replication_30seed_v1/summary.csv`, `results/grid_margin_ablation_15seed_v1/summary.csv`, `paper/figures/grid_margin_ablation_table.tex`, `paper/figures/significance_tests.csv`	`scripts/check_paper_claims.py` checks the ablation prose against the original and held-out 30-seed summaries; `scripts/check_submission_package.py` checks the 30-seed mechanism confirmation and regenerates aggregate tables and significance artifacts exactly.
Coverage-aware BGR-Suffix is positive manipulation-style evidence but not a final robotics claim.	`results/suffix_coverage_full_30seed_v1/summary.csv`, `results/suffix_coverage_full_replication_30seed_v1/summary.csv`, `results/suffix_strategy_coverage_30seed_v1/summary.csv`, `results/suffix_strategy_coverage_replication_30seed_v1/summary.csv`, `results/suffix_strategy_ablation_30seed_v1/summary.csv`, `results/suffix_stress_sensitivity_30seed_v1/summary.csv`, `paper/figures/suffix_stress_sensitivity_stats.csv`, `paper/figures/significance_tests.csv`	`scripts/check_paper_claims.py` checks the full-baseline RAUC rows, strategy ablation, stress sweep, clean, transfer, AULC, and median-r80 caveat prose.
The independent standard-environment diagnostic is a limitation, not a support claim.	`results/frozenlake_recovery_focused_30seed_v1/summary.csv`, `results/frozenlake_recovery_focused_30seed_v1/results.json`, `results/lunarlander_recovery_probe_4seed_v1/summary.csv`, `results/bsuite_deepsea_recovery_probe_4seed_v1/summary.csv`, `results/bsuite_catch_recovery_probe_30seed_v1/summary.csv`, `results/bsuite_mountaincar_recovery_probe_4seed_v1/summary.csv`, `results/bsuite_cartpole_recovery_probe_4seed_v1/summary.csv`, `results/reacher_recovery_probe_12seed_v1/summary.csv`	`scripts/check_paper_claims.py` checks the FrozenLake, LunarLander, compressed bsuite DeepSea/Catch/Cartpole scope row, and Reacher limitation prose and verifies that paired signs, ablations, or robustness metrics do not support a clean BGR win.
The learned-policy OpenVLA/LIBERO path is an audit, not a robotics fine-tuning claim.	`results/libero_probe_v2/summary.csv`, `results/openvla_teacher_replay_manifest_v1/summary.json`, `results/openvla_action_tfds_validation_v1/summary.json`, and the packaged OpenVLA-OFT audit summaries listed below.	`scripts/check_submission_package.py` enforces paper-facing audit wording and keeps paper-negative scale-diagnostic outputs out of the anonymous manifest.

Grid-margin robustness/scope diagnostic artifacts:

30-seed radius-level ablation/table: results/grid_margin_ablation_30seed_v1/summary.csv, paper/figures/grid_margin_ablation_table.tex
held-out 30-seed radius-level ablation replication: results/grid_margin_ablation_replication_30seed_v1/summary.csv
original 15-seed radius-level ablation: results/grid_margin_ablation_15seed_v1/summary.csv
learning-curve stats/figure/source: paper/figures/grid_margin_learning_curve_stats.csv, paper/figures/grid_margin_learning_curve.pdf, results/grid_margin_full_15seed_v1/results.json
30-seed target-margin sweep table/source: paper/figures/grid_margin_target_sensitivity_stats.csv, results/grid_margin_target_sensitivity_30seed_v1/summary.csv; 15-seed provenance: results/grid_margin_target_sensitivity_15seed_v1/summary.csv
30-seed learning-rate sweep table/source: paper/figures/grid_margin_learning_rate_sensitivity_stats.csv, results/grid_margin_learning_rate_sensitivity_30seed_v1/summary.csv; 15-seed provenance: results/grid_margin_learning_rate_sensitivity_15seed_v1/summary.csv
30-seed regime sweep table/source: paper/figures/grid_margin_regime_sensitivity_stats.csv, results/grid_margin_regime_sensitivity_30seed_v1/summary.csv; 15-seed provenance: results/grid_margin_regime_sensitivity_15seed_v1/summary.csv
30-seed stress sweep table/source: paper/figures/grid_margin_stress_sensitivity_stats.csv, results/grid_margin_stress_sensitivity_30seed_v1/summary.csv; 15-seed provenance: results/grid_margin_stress_sensitivity_15seed_v1/summary.csv
30-seed witness sensitivity diagnostic: results/grid_margin_witness_sensitivity_30seed_v1/summary.csv
30-seed suffix stress sweep table/source: paper/figures/suffix_stress_sensitivity_stats.csv, results/suffix_stress_sensitivity_30seed_v1/summary.csv; 15-seed provenance: results/suffix_stress_sensitivity_15seed_v1/summary.csv

OpenVLA-OFT packaged audit summaries:

OpenVLA selection/audit stats: paper/figures/openvla_stats.csv
OpenVLA recovery audit source: results/libero_openvla_recovery_v1/summary.csv
OpenVLA selection audit source: results/libero_openvla_boundary_selection_balanced_v1/aggregate.csv
OpenVLA action-label/TFDS validation source: results/openvla_action_tfds_validation_v1/summary.json
official-checkpoint sanity audit: results/openvla_oft_sanity_eval_sanity_v1/summary.csv
1,000-step balanced2048 data-plumbing audit: results/openvla_oft_eval_balanced2048_step1000_v1/summary.csv
p1024 clean adaptation audit: results/openvla_oft_goal_adapt_eval_cleanmix_p1024_step50100_lr1em6_identitylora_officialtrainstats_v1/summary.csv
p1024 original perturbation audit: results/openvla_oft_perturb_eval_cleanmix_p1024_step50100_lr1em6_identitylora_officialtrainstats_v1/summary.csv
p1024 offset-3 perturbation audit: results/openvla_oft_perturb_eval_cleanmix_p1024_step50100_lr1em6_identitylora_officialtrainstats_offset3_7trials_v1/summary.csv
p2048 clean adaptation audit: results/openvla_oft_goal_adapt_eval_cleanmix_p2048_step50100_lr1em6_identitylora_officialtrainstats_v1/summary.csv
p2048 original perturbation audit: results/openvla_oft_perturb_eval_cleanmix_p2048_step50100_lr1em6_identitylora_officialtrainstats_v1/summary.csv
p2048 offset-3 perturbation audit: results/openvla_oft_perturb_eval_cleanmix_p2048_step50100_lr1em6_identitylora_officialtrainstats_offset3_7trials_v1/summary.csv
p2048 10-trial perturbation variance audit: results/openvla_oft_perturb_eval_cleanmix_p2048_step50100_lr1em6_identitylora_officialtrainstats_10trials_v1/summary.csv
p2048 full-goal clean identity audit: results/openvla_oft_clean_eval_cleanmix_p2048_step50100_lr1em6_identitylora_officialtrainstats_fullgoal10x10_v1/summary.csv
p2048 full-goal visual perturbation audit: results/openvla_oft_perturb_eval_cleanmix_p2048_step50100_lr1em6_identitylora_officialtrainstats_fullgoal10x10_v1/summary.csv
p2048 300-step image-augmentation continuation audit: results/openvla_oft_perturb_eval_cleanmix_p2048_step50300_lr5em7_identitylora_imageaug_officialtrainstats_fullgoal10x10_v1/summary.csv
p2048 1,000-step low-LR image-augmentation continuation audit: results/openvla_oft_perturb_eval_cleanmix_p2048_step51000_lr1em7_identitylora_imageaug_officialtrainstats_fullgoal10x10_v1/summary.csv
p2048 weighted perturbation curriculum audit: results/openvla_oft_perturb_eval_cleanmix_p2048unique_perturbrepeat3_prereg_step50500_lr5em7_identitylora_imageaug_officialtrainstats_fullgoal10x10_perturb_v1/summary.csv
p2048 perturb-only anchored audit: results/openvla_oft_perturb_eval_p2048unique_perturbonly_anchor_prereg_perturbonly_proxanchor_l2_5em0_step50300_lr2em7_identitylora_imageaug_officialtrainstats_fullgoal10x10_perturb_v1/summary.csv

Repository Layout

src/bgr/                 Core BGR data structures, estimators, metrics, and samplers
scripts/                 Experiment and plotting entry points
configs/                 Versioned experiment configs
tests/                   Unit tests for estimator/priority/metrics behavior
paper/                   AAAI-27 manuscript, generated tables/figures, and official AuthorKit27

Verification Commands

python3 -m pip install -e .
PYTHONPATH=src:. python3 -m unittest discover -s tests
PYTHONPATH=src:. python3 scripts/check_paper_claims.py --paper paper/main.tex --results-dir results --figures-dir paper/figures
PYTHONPATH=src:. python3 scripts/check_submission_package.py --root .

Reproducibility Metadata

The repository is MIT licensed. Runtime environment snapshots can be collected on the cluster and checked against results/environment_v1:

The included submission artifact records required-file integrity in submission_manifest.json. The package gate verifies rendered PDFs, PDF metadata hygiene, required artifacts, generated table synchronization, double-blind hygiene, README framing, and the SHA-256 manifest with:

PYTHONPATH=src:. python3 scripts/check_submission_package.py --root .

After an intentional update to any required artifact, regenerate the manifest before rerunning the package gate:

PYTHONPATH=src:. python3 scripts/check_submission_package.py --root . --write-required-manifest

~/remote_srun.sh --dry-run --partition compute --gres '' --cpus 2 --mem 8G --time 00:10:00 /work/anonymous/bgr env PYTHONPATH=src:. python scripts/collect_environment.py --out runs/environment_v1/compute_environment.json
~/remote_srun.sh --github-test --git-pull --log --partition compute --gres '' --cpus 2 --mem 8G --time 00:10:00 /work/anonymous/bgr env PYTHONPATH=src:. python scripts/collect_environment.py --out runs/environment_v1/compute_environment.json

~/remote_srun.sh --dry-run --partition gpu --gres gpu:1 --cpus 4 --mem 16G --time 00:10:00 /work/anonymous/bgr env MUJOCO_GL=egl PYOPENGL_PLATFORM=egl PYTHONPATH=src:. python scripts/collect_environment.py --out runs/environment_v1/gpu_environment.json
~/remote_srun.sh --github-test --git-pull --log --partition gpu --gres gpu:1 --cpus 4 --mem 16G --time 00:10:00 /work/anonymous/bgr env MUJOCO_GL=egl PYOPENGL_PLATFORM=egl PYTHONPATH=src:. python scripts/collect_environment.py --out runs/environment_v1/gpu_environment.json

Tier-0 Experiment

Heavy or repeated experiments should run on the cluster through ~/remote_srun.sh; use /work/anonymous/bgr as the remote project directory to avoid home-directory disk pressure.

Local smoke run after python3 -m pip install -e .:

PYTHONPATH=src:. python3 scripts/run_toy_experiment.py --config configs/toy_smoke.yaml --out runs/toy_smoke

Dry run:

~/remote_srun.sh --dry-run /work/anonymous/bgr python scripts/run_toy_experiment.py --config configs/toy_bgr_15seed.yaml --out runs/toy_15seed_v1

Real run:

~/remote_srun.sh --github-test --git-pull --log /work/anonymous/bgr python scripts/run_toy_experiment.py --config configs/toy_bgr_15seed.yaml --out runs/toy_15seed_v1

Active Estimator Validation

This run isolates the recovery-curve estimator from policy training by comparing fixed-budget probes against dense reference curves.

~/remote_srun.sh --dry-run --partition compute --gres '' --cpus 4 --mem 12G --time 01:00:00 /work/anonymous/bgr env PYTHONPATH=src:. python scripts/run_estimator_experiment.py --config configs/estimator_bgr_full.yaml --out runs/estimator_full_v1
~/remote_srun.sh --github-test --git-pull --log --partition compute --gres '' --cpus 4 --mem 12G --time 01:00:00 /work/anonymous/bgr env PYTHONPATH=src:. python scripts/run_estimator_experiment.py --config configs/estimator_bgr_full.yaml --out runs/estimator_full_v1

LIBERO Simulator Probe

The cluster has LIBERO and robosuite available. This probe validates resettable LIBERO task states and object-pose perturbations on GPU/EGL; it is not a policy-success experiment.

~/remote_srun.sh --dry-run --github-test --git-pull --log --partition gpu --gres gpu:1 --cpus 4 --mem 16G --time 01:00:00 /work/anonymous/bgr env MUJOCO_GL=egl PYOPENGL_PLATFORM=egl PYTHONPATH=src:. python scripts/probe_libero_suffix_states.py --suite libero_goal --task-ids 0,1,2,3,4 --init-state-ids 0,1,2 --radii 0.0,0.25,0.5,0.75,1.0 --trials-per-radius 4 --settle-steps 5 --image-size 64 --out runs/libero_probe_v2
~/remote_srun.sh --github-test --git-pull --log --partition gpu --gres gpu:1 --cpus 4 --mem 16G --time 01:00:00 /work/anonymous/bgr env MUJOCO_GL=egl PYOPENGL_PLATFORM=egl PYTHONPATH=src:. python scripts/probe_libero_suffix_states.py --suite libero_goal --task-ids 0,1,2,3,4 --init-state-ids 0,1,2 --radii 0.0,0.25,0.5,0.75,1.0 --trials-per-radius 4 --settle-steps 5 --image-size 64 --out runs/libero_probe_v2

Existing closed-loop OpenVLA/LIBERO object-task rollouts can be converted into BGR-style recovery curves:

~/remote_srun.sh --dry-run --partition compute --gres '' --cpus 2 --mem 8G --time 00:10:00 /work/anonymous/bgr env PYTHONPATH=src:. python scripts/summarize_libero_openvla_recovery.py --input-dir /work/anonymous/dreamaudit_jobs/artifacts/libero_openvla_observation_object3_h220_bash --out runs/libero_openvla_recovery_v1 --source-name libero_openvla_observation_object3_h220_bash
~/remote_srun.sh --github-test --git-pull --log --partition compute --gres '' --cpus 2 --mem 8G --time 00:10:00 /work/anonymous/bgr env PYTHONPATH=src:. python scripts/summarize_libero_openvla_recovery.py --input-dir /work/anonymous/dreamaudit_jobs/artifacts/libero_openvla_observation_object3_h220_bash --out runs/libero_openvla_recovery_v1 --source-name libero_openvla_observation_object3_h220_bash

Existing OpenVLA perturbation-selection artifacts can also be summarized as a boundary-discovery diagnostic:

~/remote_srun.sh --github-test --git-pull --log --partition compute --gres '' --cpus 2 --mem 8G --time 00:10:00 /work/anonymous/bgr env PYTHONPATH=src:. python scripts/summarize_openvla_boundary_selection.py --proposal-dir /work/anonymous/dreamaudit_jobs/artifacts/libero_openvla_observation_proposal_guided_h160 --proposal-dir /work/anonymous/dreamaudit_jobs/artifacts/libero_openvla_observation_proposal_guided_seed2_h160 --proposal-dir /work/anonymous/dreamaudit_jobs/artifacts/libero_openvla_observation_proposal_guided_seed3_h160 --random-dir /work/anonymous/dreamaudit_jobs/artifacts/libero_openvla_observation_random_balanced_seed1b_skip_lp2_h160 --random-dir /work/anonymous/dreamaudit_jobs/artifacts/libero_openvla_observation_random_balanced_seed2b_skip_lp2_h160 --random-dir /work/anonymous/dreamaudit_jobs/artifacts/libero_openvla_observation_random_balanced_seed3b_skip_lp2_h160 --random-dir /work/anonymous/dreamaudit_jobs/artifacts/libero_openvla_observation_random_balanced_seed4b_skip_lp2_h160 --random-dir /work/anonymous/dreamaudit_jobs/artifacts/libero_openvla_observation_random_balanced_seed5b_skip_lp2_h160 --out runs/libero_openvla_boundary_selection_v1

Robot Suffix Strategy Comparison

This diagnostic compares BGR-Suffix radius distributions while keeping the same replay-state estimator.

~/remote_srun.sh --dry-run --github-test --git-pull --log --partition compute --gres '' --cpus 4 --mem 12G --time 04:00:00 /work/anonymous/bgr env PYTHONPATH=src:. python scripts/run_suffix_experiment.py --config configs/suffix_full_15seed.yaml --out runs/suffix_full_15seed_v1
~/remote_srun.sh --github-test --git-pull --log --partition compute --gres '' --cpus 4 --mem 12G --time 04:00:00 /work/anonymous/bgr env PYTHONPATH=src:. python scripts/run_suffix_experiment.py --config configs/suffix_full_15seed.yaml --out runs/suffix_full_15seed_v1
~/remote_srun.sh --dry-run --github-test --git-pull --log --partition compute --gres '' --cpus 4 --mem 12G --time 04:00:00 /work/anonymous/bgr env PYTHONPATH=src:. python scripts/run_suffix_experiment.py --config configs/suffix_strategy_coverage_15seed.yaml --out runs/suffix_strategy_coverage_15seed_v1
~/remote_srun.sh --github-test --git-pull --log --partition compute --gres '' --cpus 4 --mem 12G --time 04:00:00 /work/anonymous/bgr env PYTHONPATH=src:. python scripts/run_suffix_experiment.py --config configs/suffix_strategy_coverage_15seed.yaml --out runs/suffix_strategy_coverage_15seed_v1
~/remote_srun.sh --dry-run --github-test --git-pull --log --partition compute --gres '' --cpus 4 --mem 12G --time 04:00:00 /work/anonymous/bgr env PYTHONPATH=src:. python scripts/run_suffix_experiment.py --config configs/suffix_strategy_coverage_30seed.yaml --out runs/suffix_strategy_coverage_30seed_v1
~/remote_srun.sh --github-test --git-pull --log --partition compute --gres '' --cpus 4 --mem 12G --time 04:00:00 /work/anonymous/bgr env PYTHONPATH=src:. python scripts/run_suffix_experiment.py --config configs/suffix_strategy_coverage_30seed.yaml --out runs/suffix_strategy_coverage_30seed_v1
~/remote_srun.sh --dry-run --github-test --git-pull --log --partition compute --gres '' --cpus 4 --mem 12G --time 02:00:00 /work/anonymous/bgr env PYTHONPATH=src:. python scripts/run_suffix_experiment.py --config configs/suffix_strategy.yaml --out runs/suffix_strategy_v1
~/remote_srun.sh --github-test --git-pull --log --partition compute --gres '' --cpus 4 --mem 12G --time 02:00:00 /work/anonymous/bgr env PYTHONPATH=src:. python scripts/run_suffix_experiment.py --config configs/suffix_strategy.yaml --out runs/suffix_strategy_v1

Procedural Grid Recovery

The grid benchmarks are dependency-light procedural decision benchmarks with generated obstacle maps, replayable mid-path states, Manhattan-radius perturbations, and an exact shortest-path feasibility witness.

~/remote_srun.sh --dry-run --partition compute --gres '' --cpus 2 --mem 8G --time 01:00:00 /work/anonymous/bgr env PYTHONPATH=src:. python scripts/run_grid_experiment.py --config configs/grid_bgr.yaml --out runs/grid_fast
~/remote_srun.sh --github-test --git-pull --log --partition compute --gres '' --cpus 2 --mem 8G --time 01:00:00 /work/anonymous/bgr env PYTHONPATH=src:. python scripts/run_grid_experiment.py --config configs/grid_bgr.yaml --out runs/grid_fast

The positive procedural benchmark is grid_margin_bgr, which evaluates state-conditioned margin expansion on grid-backed replay states. The completed 30-seed full-baseline config is the primary grid comparison; the stored learning-curve history remains 15-seed, while the rendered ablation and sensitivity tables use the packaged 30-seed confirmations:

~/remote_srun.sh --dry-run --partition compute --gres '' --cpus 4 --mem 12G --time 02:00:00 /work/anonymous/bgr env PYTHONPATH=src:. python scripts/run_grid_margin_experiment.py --config configs/grid_margin_bgr_full.yaml --out runs/grid_margin_full
~/remote_srun.sh --github-test --git-pull --log --partition compute --gres '' --cpus 4 --mem 12G --time 02:00:00 /work/anonymous/bgr env PYTHONPATH=src:. python scripts/run_grid_margin_experiment.py --config configs/grid_margin_bgr_full.yaml --out runs/grid_margin_full
~/remote_srun.sh --dry-run --partition compute --gres '' --cpus 4 --mem 12G --time 04:00:00 /work/anonymous/bgr env PYTHONPATH=src:. python scripts/run_grid_margin_experiment.py --config configs/grid_margin_full_15seed.yaml --out runs/grid_margin_full_15seed_v1
~/remote_srun.sh --github-test --git-pull --log --partition compute --gres '' --cpus 4 --mem 12G --time 04:00:00 /work/anonymous/bgr env PYTHONPATH=src:. python scripts/run_grid_margin_experiment.py --config configs/grid_margin_full_15seed.yaml --out runs/grid_margin_full_15seed_v1
PYTHONPATH=src:. python scripts/run_grid_margin_trial.py --config configs/grid_margin_full_30seed.yaml --out results/grid_margin_full_30seed_v1 --method bgr --seed 0
PYTHONPATH=src:. python scripts/merge_grid_margin_trials.py --config configs/grid_margin_full_30seed.yaml --out results/grid_margin_full_30seed_v1

The target-margin sensitivity sweep checks whether the grid-margin BGR result is tied to the reported target_margin=0.38 setting:

~/remote_srun.sh --dry-run --partition compute --gres '' --cpus 4 --mem 12G --time 04:00:00 /work/anonymous/bgr env PYTHONPATH=src:. python scripts/run_grid_margin_target_sensitivity.py --config configs/grid_margin_target_sensitivity_15seed.yaml --out runs/grid_margin_target_sensitivity_15seed_v1
~/remote_srun.sh --github-test --git-pull --log --partition compute --gres '' --cpus 4 --mem 12G --time 04:00:00 /work/anonymous/bgr env PYTHONPATH=src:. python scripts/run_grid_margin_target_sensitivity.py --config configs/grid_margin_target_sensitivity_15seed.yaml --out runs/grid_margin_target_sensitivity_15seed_v1

A learning-rate sensitivity sweep is retained as a scope diagnostic. It tests the same paired BGR/uniform setup at learning_rate values 0.015, 0.03, and 0.06:

~/remote_srun.sh --dry-run --partition compute --gres '' --cpus 4 --mem 12G --time 04:00:00 /work/anonymous/bgr env PYTHONPATH=src:. python scripts/run_grid_margin_learning_rate_sensitivity.py --config configs/grid_margin_learning_rate_sensitivity_15seed.yaml --out runs/grid_margin_learning_rate_sensitivity_15seed_v1
~/remote_srun.sh --github-test --git-pull --log --partition compute --gres '' --cpus 4 --mem 12G --time 04:00:00 /work/anonymous/bgr env PYTHONPATH=src:. python scripts/run_grid_margin_learning_rate_sensitivity.py --config configs/grid_margin_learning_rate_sensitivity_15seed.yaml --out runs/grid_margin_learning_rate_sensitivity_15seed_v1

Grid-margin ablations isolate BGR priority terms and boundary-centered radius sampling:

~/remote_srun.sh --dry-run --partition compute --gres '' --cpus 4 --mem 12G --time 04:00:00 /work/anonymous/bgr env PYTHONPATH=src:. python scripts/run_grid_margin_experiment.py --config configs/grid_margin_ablation_15seed.yaml --out runs/grid_margin_ablation_15seed_v1
~/remote_srun.sh --github-test --git-pull --log --partition compute --gres '' --cpus 4 --mem 12G --time 04:00:00 /work/anonymous/bgr env PYTHONPATH=src:. python scripts/run_grid_margin_experiment.py --config configs/grid_margin_ablation_15seed.yaml --out runs/grid_margin_ablation_15seed_v1

A grid-regime sensitivity runner is retained as a diagnostic. The packaged obstacle_prob/max_offset sweep mostly reproduces the nominal margin dynamics, so it should not be treated as separate robustness evidence without stronger regime changes. The rendered table uses the 30-seed regime diagnostic source, with 30/0 paired wins for BGR on final RAUC, RAUC AULC, clean success, and median r80 in each tested regime:

~/remote_srun.sh --dry-run --partition compute --gres '' --cpus 4 --mem 12G --time 04:00:00 /work/anonymous/bgr env PYTHONPATH=src:. python scripts/run_grid_margin_regime_sensitivity.py --config configs/grid_margin_regime_sensitivity_15seed.yaml --out runs/grid_margin_regime_sensitivity_15seed_v1
~/remote_srun.sh --github-test --git-pull --log --partition compute --gres '' --cpus 4 --mem 12G --time 04:00:00 /work/anonymous/bgr env PYTHONPATH=src:. python scripts/run_grid_margin_regime_sensitivity.py --config configs/grid_margin_regime_sensitivity_15seed.yaml --out runs/grid_margin_regime_sensitivity_15seed_v1
PYTHONPATH=src:. python3 scripts/run_grid_margin_regime_sensitivity.py --config configs/grid_margin_regime_sensitivity_30seed.yaml --out results/grid_margin_regime_sensitivity_30seed_v1

The grid-margin stress sweep changes the latent recovery geometry rather than only the obstacle layout. It tests sharp low-margin states, diffuse recovery boundaries, and lower feasible-radius mass:

~/remote_srun.sh --dry-run --partition compute --gres '' --cpus 4 --mem 12G --time 04:00:00 /work/anonymous/bgr env PYTHONPATH=src:. python scripts/run_grid_margin_stress_sensitivity.py --config configs/grid_margin_stress_sensitivity_15seed.yaml --out runs/grid_margin_stress_sensitivity_15seed_v1
~/remote_srun.sh --github-test --git-pull --log --partition compute --gres '' --cpus 4 --mem 12G --time 04:00:00 /work/anonymous/bgr env PYTHONPATH=src:. python scripts/run_grid_margin_stress_sensitivity.py --config configs/grid_margin_stress_sensitivity_15seed.yaml --out runs/grid_margin_stress_sensitivity_15seed_v1

The artifact results/grid_margin_stress_sensitivity_15seed_v1 run is positive across all three stress cases: BGR final RAUC is 0.362--0.457 versus uniform 0.324--0.430, and all RAUC/AULC paired sign tests are 15/0 ($p=0.0001$).

The tabular grid-policy configs are retained as negative diagnostics:

~/remote_srun.sh --github-test --git-pull --log --partition compute --gres '' --cpus 4 --mem 12G --time 00:45:00 /work/anonymous/bgr env PYTHONPATH=src:. python scripts/run_grid_experiment.py --config configs/grid_policy_mixed.yaml --out runs/grid_policy_mixed_v1
~/remote_srun.sh --github-test --git-pull --log --partition compute --gres '' --cpus 4 --mem 12G --time 00:45:00 /work/anonymous/bgr env PYTHONPATH=src:. python scripts/run_grid_experiment.py --config configs/grid_policy_coverage.yaml --out runs/grid_policy_coverage_v1

See results/README.md for the packaged run ledger. The original tabular grid policy benchmark is retained as a negative diagnostic because broad replay saturates it after clean pretraining.

Result Aggregation and Paper Figures

python3 scripts/aggregate_results.py --results-dir results --out-dir paper/figures
python3 scripts/analyze_significance.py --results-dir results --out-csv paper/figures/significance_tests.csv --out-tex paper/figures/significance_table.tex
PYTHONPATH=src:. python3 scripts/check_paper_claims.py --paper paper/main.tex --results-dir results --figures-dir paper/figures

This writes CSV summaries, a LaTeX summary table, and bar-chart figures used by paper/main.tex. The claim check verifies the headline numeric prose against the generated CSV/JSON artifacts.

OpenVLA/LIBERO Audit Summary

Detailed OpenVLA commands, Slurm IDs, copied artifacts, and summaries are in results/README.md. The top-level claim is scoped: these runs audit recovery curves, perturbation selection, and OpenVLA-OFT data plumbing rather than robotics fine-tuning performance.

The packaged useful audit scale is p1024/p2048 clean-mix adaptation with official training statistics, identity LoRA initialization, and low learning rate. At p1024, BGR and matched random tie clean at 14/15; pooling the original and offset-3 visual perturbation evals gives BGR 0.8550 vs. 0.8400 for random, still trailing the unadapted official checkpoint at 0.8700. At p2048, the full-goal identity audit gives 99/100 clean successes for BGR, matched random, and the official checkpoint. The 10-task visual perturbation audit gives BGR 367/400 perturbed successes, tying official and trailing matched random by one episode (368/400). The 300-step image-augmentation continuation gives BGR and matched random 368/400 perturbed successes each, only one episode above official (367/400), while BGR trails both on identity. The 1,000-step low-learning-rate continuation is also negative: BGR gives 366/400 non-identity perturbation successes, trailing official at 367/400 and matched random at 370/400. The follow-up weighted perturbation curriculum is also negative: BGR and official tie at 367/400 non-identity successes while matched random reaches 370/400. The perturb-only anchored route preserves identity at 99/100 for all three methods and gives BGR 371/400 non-identity successes, but matched random reaches 372/400 and official reaches 367/400, so it also fails the fixed promotion gate.

AAAI Sources

The official AAAI-27 page lists the 2026 author-submission timetable for the February 16--23, 2027 conference and links the AAAI-27 author kit: abstracts are due July 21, 2026, full papers July 28, 2026, and supplementary material/code July 31, 2026. The kit in paper/AuthorKit27 was downloaded from https://aaai.org/authorkit27/ on 2026-06-01.

Name		Name	Last commit message	Last commit date
Latest commit History 567 Commits
configs		configs
docs		docs
paper		paper
results		results
scripts		scripts
src/bgr		src/bgr
tests		tests
tools		tools
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CURSOR_REMOTE_SRUN_MANUAL_GENERIC.md		CURSOR_REMOTE_SRUN_MANUAL_GENERIC.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
remote_srun.sh		remote_srun.sh
spec.md		spec.md
submission_manifest.json		submission_manifest.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Boundary-Guided Replay

Reviewer Navigation

Claim-Evidence Map

Repository Layout

Verification Commands

Reproducibility Metadata

Tier-0 Experiment

Active Estimator Validation

LIBERO Simulator Probe

Robot Suffix Strategy Comparison

Procedural Grid Recovery

Result Aggregation and Paper Figures

OpenVLA/LIBERO Audit Summary

AAAI Sources

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Boundary-Guided Replay

Reviewer Navigation

Claim-Evidence Map

Repository Layout

Verification Commands

Reproducibility Metadata

Tier-0 Experiment

Active Estimator Validation

LIBERO Simulator Probe

Robot Suffix Strategy Comparison

Procedural Grid Recovery

Result Aggregation and Paper Figures

OpenVLA/LIBERO Audit Summary

AAAI Sources

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages