Fix off-by-one, log-zero, and start_loss-zero bugs in reweight() by MaxGhenis · Pull Request #91 · PolicyEngine/microcalibrate

MaxGhenis · 2026-04-17T12:40:14Z

Summary

Three latent bugs in src/microcalibrate/reweight.py:

CRITICAL (finding Add repo documentation, tests #1 in bug-hunt report): The dense training loop guarded its gradient step with if i != max_epochs - 1 where max_epochs = epochs - 1, which actually skipped the penultimate epoch (i == epochs - 2) while still stepping on the final epoch. The returned final_weights therefore drifted one step away from the final tracked row. Every epoch now steps, and the tracker always ends on the final epoch so logged estimates correspond to the pre-step state of the returned weights.
HIGH (finding Add changelog system #5): The sparse L0 loop computed (l.item() - start_loss) / start_loss unconditionally; when start_loss was ~0 (trivial/pre-calibrated data with l0_lambda=0) this raised ZeroDivisionError inside the tqdm postfix. A magnitude guard now short-circuits to 0.0.
MED (finding Set up changelog #8): np.log(original_weights + random_noise) produced -inf (and downstream NaN gradients) whenever an initial weight was zero and noise_level was zero, and the L0 branch hit the same issue even with nonzero noise because it logs the raw weights. Both call sites now clamp inputs to >= 1e-12.

Test plan

Add tests/test_reweight_regression.py covering all three fixes: tracker includes the final epoch, N vs N-1 epochs produce different weights (every epoch steps), sparse loop does not crash when start_loss == 0, zero initial weights in the L0 path do not produce non-finite weights.
All existing tests pass (uv run pytest tests -x -q -> 19 passed).

🤖 Generated with Claude Code

Three latent bugs in src/microcalibrate/reweight.py: 1. The dense training loop guarded its gradient step with `if i != max_epochs - 1` where `max_epochs = epochs - 1`, which actually skipped the penultimate epoch (i == epochs - 2) while still stepping on the final epoch. The returned final_weights therefore drifted one step away from the final tracked row. Every epoch now steps, and the tracker always ends on the final epoch so logged estimates correspond to the pre-step state of the returned weights. 2. `np.log(original_weights + random_noise)` produced -inf (and downstream NaN gradients) whenever an initial weight was zero and noise_level was zero, and the L0 branch hit the same issue even with nonzero noise because it logs the raw weights. Both call sites now clamp inputs to >= 1e-12. 3. The sparse L0 loop computed `(l.item() - start_loss) / start_loss` unconditionally; when start_loss happened to be ~0 (trivially calibrated data with l0_lambda 0) this raised ZeroDivisionError inside the tqdm postfix. A small-magnitude guard now short-circuits to 0.0 for the display value. Adds tests/test_reweight_regression.py covering all three fixes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vercel · 2026-04-17T12:40:19Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
microcalibrate	Ready	Preview, Comment	Apr 17, 2026 0:40am

MaxGhenis

LGTM (cannot self-approve; posting as comment).

Off-by-one fix: removed the max_epochs = epochs - 1 guard so every epoch steps. Loop semantic is now clean and documented in the code: pre-step state of epoch i is logged (when tracked); post-last-step state is returned. The asymmetry between the final logged row and the returned weights is acknowledged in the inline comment — a reasonable choice vs. running an extra forward pass after the last step.
is_final_epoch guard ensures the tracker always contains epoch epochs - 1, correctly fixing the diagnostic/returned-state disagreement.
np.maximum(..., 1e-12) added on both the dense and sparse log-weight initialisations — consistent.
start_loss divide-by-zero guard uses abs(start_loss) < 1e-12 which is correct for both zero and near-zero starts.

Minor (non-blocking) notes:

test_all_epochs_step compares weights after N and N-1 epochs. Under the original bug the two runs still produced different weights (different epochs skipped), so this test doesn't strictly regress the off-by-one. test_final_epoch_matches_tracker also doesn't regress it when tracking_n | (epochs - 1) (the default case epochs=25, tracking_n=2). For a tighter regression test, consider epochs=30 so (epochs-1) % tracking_n != 0. The fix itself is correct; this is just test tightness.
Pre-existing test_evaluate_holdout_robustness flakiness is unrelated to this PR — this PR doesn't touch RNG seeding (that's #93), so the underlying unseeded-numpy cause is present on main too.
Unrelated repo infra: .python-version still pins 3.11 while pyproject.toml requires >=3.13. Worth a follow-up cleanup but out of scope here.

vercel Bot deployed to Preview April 17, 2026 12:40 View deployment

MaxGhenis commented Apr 17, 2026

View reviewed changes

MaxGhenis mentioned this pull request Apr 17, 2026

Seed the NumPy RNG used for initial weight noise in reweight() #93

Merged

2 tasks

MaxGhenis merged commit 71e19de into main Apr 17, 2026
8 of 9 checks passed

MaxGhenis deleted the fix/reweight-small branch April 17, 2026 16:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix off-by-one, log-zero, and start_loss-zero bugs in reweight()#91

Fix off-by-one, log-zero, and start_loss-zero bugs in reweight()#91
MaxGhenis merged 1 commit into
mainfrom
fix/reweight-small

MaxGhenis commented Apr 17, 2026

Uh oh!

vercel Bot commented Apr 17, 2026 •

edited

Loading

Uh oh!

MaxGhenis left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MaxGhenis commented Apr 17, 2026

Summary

Test plan

Uh oh!

vercel Bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaxGhenis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel Bot commented Apr 17, 2026 •

edited

Loading