Skip to content

Conversation

@RektPunk
Copy link
Contributor

@RektPunk RektPunk commented Nov 10, 2025

Related to #7

Changes

  • Create construct_scorecard method
  • Fix get_leafs method

Requesting Feedback

  • Logic for Leaf Index Mapping: generating a relative leaf index within each tree (tree_index). This index is computed using groupby('tree_index').cumcount() on the leaf nodes extracted from trees_to_dataframe(), ensuring alignment with the output of predict(pred_leaf=True).

References

https://stackoverflow.com/questions/78416214/how-many-trees-do-i-actually-have-in-my-lightgbm-model

Summary by Sourcery

Implement scorecard construction in the LightGBM constructor, fix leaf extraction and indexing logic, and enhance the scorecard with observation percentage metrics

New Features:

  • Implement construct_scorecard to generate per-tree scorecards with event counts, rates, WOE, IV, and evidence contributions

Bug Fixes:

  • Simplify get_leafs by using predict(start_iteration, num_iteration) and correct base_score subtraction
  • Fix extract_leaf_weights to assign a relative leaf index within each tree

Enhancements:

  • Add CountPct calculation to report the percentage of observations per split

@sourcery-ai
Copy link

sourcery-ai bot commented Nov 10, 2025

Reviewer's Guide

This PR simplifies tree contribution extraction, enhances leaf indexing in extract_leaf_weights, and fully implements construct_scorecard to generate a detailed LightGBM scorecard with event binning and WOE/IV metrics.

Sequence diagram for construct_scorecard workflow

sequenceDiagram
participant lgb_constructor
participant LightGBMBooster
participant pd_DataFrame
participant Utils

lgb_constructor->>LightGBMBooster: predict(X, pred_leaf=True)
lgb_constructor->>pd_DataFrame: binning by leaf_idx and label
lgb_constructor->>lgb_constructor: extract_leaf_weights()
lgb_constructor->>pd_DataFrame: merge leaf weights and binning table
lgb_constructor->>Utils: calculate_weight_of_evidence(scorecard)
lgb_constructor->>Utils: calculate_information_value(scorecard)
lgb_constructor->>pd_DataFrame: finalize scorecard structure
lgb_constructor-->>caller: return lgb_scorecard
Loading

ER diagram for scorecard binning and leaf weights

erDiagram
    SCORECARD {
      int Tree
      int Node
      string Feature
      string Sign
      float Split
      int Count
      float CountPct
      int NonEvents
      int Events
      float EventRate
      float WOE
      float IV
      float XAddEvidence
    }
    LEAF_WEIGHTS {
      int Tree
      int Node
      float XAddEvidence
    }
    BINNING_TABLE {
      int tree
      int leaf_idx
      int Events
      int NonEvents
      int Count
      float EventRate
    }
    LEAF_WEIGHTS ||--o| SCORECARD : merges into
    BINNING_TABLE ||--o| SCORECARD : merges into
Loading

Class diagram for updated lgb_constructor methods

classDiagram
class lgb_constructor {
  +get_leafs(X)
  +extract_leaf_weights() : pd.DataFrame
  +construct_scorecard() : pd.DataFrame
  +create_points(...)
  -base_score
  -model
  -booster_
  -X
  -y
  -lgb_scorecard
}

class pd_DataFrame
class calculate_weight_of_evidence
class calculate_information_value

lgb_constructor --> pd_DataFrame : returns
lgb_constructor --> calculate_weight_of_evidence : uses
lgb_constructor --> calculate_information_value : uses
Loading

File-Level Changes

Change Details Files
Refactor get_leafs to streamline tree margin computation
  • Replaced incremental raw_score calls with single predict using start_iteration and num_iteration
  • Removed manual cumulative score differencing logic
  • Conditionally adjusted base_score application for first tree
xbooster/lgb_constructor.py
Enhance extract_leaf_weights with relative leaf indexing
  • Added relative_leaf_index via groupby.cumcount() per tree
  • Updated merge_and_format to map Node from relative_leaf_index
xbooster/lgb_constructor.py
Implement construct_scorecard end-to-end
  • Predicted leaf indices and validated output shape
  • Built per-tree binning tables with Events, NonEvents, Count, EventRate
  • Merged binning data with XAddEvidence from extract_leaf_weights
  • Sorted results and calculated WOE, IV, and CountPct
  • Returned finalized lgb_scorecard with ordered columns
xbooster/lgb_constructor.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes and they look great!

Prompt for AI Agents
Please address the comments from this code review:

## Individual Comments

### Comment 1
<location> `xbooster/lgb_constructor.py:212-216` </location>
<code_context>
             ["tree_index", "node_index", "value"]
         ].copy()
+        # Make leaf index relative within each tree
+        leaf_nodes["relative_leaf_index"] = leaf_nodes.groupby("tree_index").cumcount()

         # Helper function to merge decision nodes with leaf values
</code_context>

<issue_to_address>
**suggestion (bug_risk):** Using cumcount for relative_leaf_index assumes leaf_nodes are sorted as intended.

Ensure leaf_nodes is sorted correctly within each tree before using cumcount to prevent incorrect relative indices.

```suggestion
        leaf_nodes = tree_df[tree_df["split_feature"].isna()][
            ["tree_index", "node_index", "value"]
        ].copy()
        # Ensure leaf_nodes are sorted by tree_index and node_index before assigning relative_leaf_index
        leaf_nodes = leaf_nodes.sort_values(["tree_index", "node_index"]).reset_index(drop=True)
        leaf_nodes["relative_leaf_index"] = leaf_nodes.groupby("tree_index").cumcount()
```
</issue_to_address>

### Comment 2
<location> `xbooster/lgb_constructor.py:281-291` </location>
<code_context>
+                axis=1,
+            )
+            # Create a binning table
+            binning_table = (
+                index_and_label.groupby("leaf_idx").agg(["sum", "count"]).reset_index()
+            ).astype(float)
+            binning_table.columns = ["leaf_idx", "Events", "Count"]  # type: ignore
+            binning_table["tree"] = i
+            binning_table["NonEvents"] = binning_table["Count"] - binning_table["Events"]
</code_context>

<issue_to_address>
**suggestion:** Directly renaming columns after groupby/agg may be fragile if the aggregation changes.

To avoid issues if aggregation changes, assign column names based on the aggregation output or reference columns by name rather than relying on order.

```suggestion
            # Create a binning table
            binning_table = (
                index_and_label.groupby("leaf_idx").agg({"label": ["sum", "count"]}).reset_index()
            ).astype(float)
            # Flatten MultiIndex columns
            binning_table.columns = ["leaf_idx", "Events", "Count"]
            binning_table["tree"] = i
            binning_table["NonEvents"] = binning_table["Count"] - binning_table["Events"]
            binning_table["EventRate"] = binning_table["Events"] / binning_table["Count"]
            binning_table = binning_table[
                ["tree", "leaf_idx", "Events", "NonEvents", "Count", "EventRate"]
            ]
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines 212 to +216
leaf_nodes = tree_df[tree_df["split_feature"].isna()][
["tree_index", "node_index", "value"]
].copy()
# Make leaf index relative within each tree
leaf_nodes["relative_leaf_index"] = leaf_nodes.groupby("tree_index").cumcount()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): Using cumcount for relative_leaf_index assumes leaf_nodes are sorted as intended.

Ensure leaf_nodes is sorted correctly within each tree before using cumcount to prevent incorrect relative indices.

Suggested change
leaf_nodes = tree_df[tree_df["split_feature"].isna()][
["tree_index", "node_index", "value"]
].copy()
# Make leaf index relative within each tree
leaf_nodes["relative_leaf_index"] = leaf_nodes.groupby("tree_index").cumcount()
leaf_nodes = tree_df[tree_df["split_feature"].isna()][
["tree_index", "node_index", "value"]
].copy()
# Ensure leaf_nodes are sorted by tree_index and node_index before assigning relative_leaf_index
leaf_nodes = leaf_nodes.sort_values(["tree_index", "node_index"]).reset_index(drop=True)
leaf_nodes["relative_leaf_index"] = leaf_nodes.groupby("tree_index").cumcount()

Comment on lines +281 to +291
# Create a binning table
binning_table = (
index_and_label.groupby("leaf_idx").agg(["sum", "count"]).reset_index()
).astype(float)
binning_table.columns = ["leaf_idx", "Events", "Count"] # type: ignore
binning_table["tree"] = i
binning_table["NonEvents"] = binning_table["Count"] - binning_table["Events"]
binning_table["EventRate"] = binning_table["Events"] / binning_table["Count"]
binning_table = binning_table[
["tree", "leaf_idx", "Events", "NonEvents", "Count", "EventRate"]
]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Directly renaming columns after groupby/agg may be fragile if the aggregation changes.

To avoid issues if aggregation changes, assign column names based on the aggregation output or reference columns by name rather than relying on order.

Suggested change
# Create a binning table
binning_table = (
index_and_label.groupby("leaf_idx").agg(["sum", "count"]).reset_index()
).astype(float)
binning_table.columns = ["leaf_idx", "Events", "Count"] # type: ignore
binning_table["tree"] = i
binning_table["NonEvents"] = binning_table["Count"] - binning_table["Events"]
binning_table["EventRate"] = binning_table["Events"] / binning_table["Count"]
binning_table = binning_table[
["tree", "leaf_idx", "Events", "NonEvents", "Count", "EventRate"]
]
# Create a binning table
binning_table = (
index_and_label.groupby("leaf_idx").agg({"label": ["sum", "count"]}).reset_index()
).astype(float)
# Flatten MultiIndex columns
binning_table.columns = ["leaf_idx", "Events", "Count"]
binning_table["tree"] = i
binning_table["NonEvents"] = binning_table["Count"] - binning_table["Events"]
binning_table["EventRate"] = binning_table["Events"] / binning_table["Count"]
binning_table = binning_table[
["tree", "leaf_idx", "Events", "NonEvents", "Count", "EventRate"]
]

@xRiskLab
Copy link
Owner

Task: Verify Base Score and Margin Summation

We need to verify that LightGBM's get_leafs() correctly returns per-tree margins that sum to the total raw prediction, similar to the verification approach from Issue #2.

What to verify:

  1. Per-tree margins from get_leafs() sum correctly to model's total raw prediction
  2. The base_score calculation (np.log(y.mean() / (1 - y.mean()))) is appropriate for LightGBM
  3. Add a test using a "two-tree trick" to isolate and verify the base margin behavior

Current status:

  • Fixed get_leafs() to return margins without incorrect base_score subtraction
  • Example validates that margins sum correctly: np.allclose(margin_sum, raw_pred)
  • All tests pass

Recommended before merge:

  • Add explicit test that validates margin summation (similar to two-tree test from Issue Base Score - Wrong scale #2)
  • Document how LightGBM handles base predictions in tree structure
  • Verify that construct_scorecard() correctly interprets the margin values

- Remove incorrect base_score subtraction from get_leafs()
- Add two-tree test to validate margin summation (similar to Issue #2)
- Update documentation explaining LightGBM margin behavior
- Simplify example to show correct margin verification

Fixes margin calculation where tree predictions now correctly sum to total raw score without additional base_score adjustment.
@xRiskLab
Copy link
Owner

✅ Base Score Verification - COMPLETED

The verification tasks have been completed in commit 8c220f7:

What was fixed:

  1. ✅ Removed incorrect base_score subtraction from get_leafs()
  2. ✅ Added test_two_tree_base_score_validation() - the two-tree trick test
  3. ✅ Enhanced documentation in get_leafs() and __init__ explaining LightGBM margin behavior
  4. ✅ Simplified example to show correct margin verification

Test results:

  • All 11 LightGBM tests passing
  • New two-tree test validates:
    • Per-tree margins sum to total raw prediction ✓
    • Constructor's get_leafs() matches LightGBM API ✓
    • Base score calculation matches log-odds ✓

What to review next:
@RektPunk - Please review the new commit to ensure the fix aligns with your original implementation. The key change is that LightGBM's per-tree predictions already sum correctly without needing to add/subtract base_score separately (unlike XGBoost).

You can run the tests with: uv run pytest tests/test_lgb_constructor.py::test_two_tree_base_score_validation -v

@RektPunk
Copy link
Contributor Author

RektPunk commented Nov 12, 2025

Thank you for creating excellent tests and examples, @xRiskLab. I really appreciate it!!

I am wondering if removing the base score from the first tree is compatible with the xBooster logic that was fixed in this commit: 8c220f7#diff-918f8f30807774e4c31ebfadd85846f60849cb5abe698e612e6be3a0f35755b9L174.

I had thought the base score was added to the first tree, as referenced here: microsoft/LightGBM#3058 (comment).

- Add note in get_leafs() distinguishing sklearn API from internal booster
- Reference LightGBM issue #3058 for internal booster behavior
- Add lightgbm-getting-started.ipynb demonstrating the difference empirically
@xRiskLab
Copy link
Owner

After looking at the ticket, I think the topic concerns internal LightGBM booster object which works differently than sklearn API.

From what I understand, there is no initial score that sklearn API uses, that is why we see such huge values of the first tree (when using internal booster with init_score).

If we use internal LightGBM object, then we actually can see the deviation from our log odds we pass (but parameters must match to obtain the same results).

Proposal: Add this note for sklearn API and reference the GH ticket in the function call.

I was able to reproduce behavior in the notebook lightgbm-getting-started.ipynb (just added in commit 54b27a8). Please have a look and let me know if this aligns with the expected behavior.

@RektPunk
Copy link
Contributor Author

Yes, your analysis is spot on. You have clearly identified the difference in behavior between the internal LightGBM booster object and the initial score handling in the scikit-learn API. The behavior reproduced in the notebook from commit 54b27a8 aligns with the expected behavior. I totally agree with your proposal. Tysm @xRiskLab

@RektPunk
Copy link
Contributor Author

Hi @xRiskLab , Could you please confirm if there are any further changes needed? If not, is it ready to merge now?

@xRiskLab
Copy link
Owner

🎉 Release Candidate 0.2.7rc1 Ready

This PR has been extended with complete LightGBM scorecard implementation:

✅ Completed

  • Critical Bug Fix: Fixed leaf ID mapping in extract_leaf_weights() (55% Gini improvement: 0.40 → 0.90)
  • Full Scorecard Support: Implemented create_points(), predict_score(), and predict_scores()
  • Base Score Normalization: Proper handling with use_base_score parameter
  • Documentation: Updated README and CHANGELOG with comprehensive examples
  • Testing: All 106 tests passing

📊 Performance

  • Scorecard Gini: 0.9020
  • Model Gini: 0.9021
  • Discrimination preserved ✅

📝 Next Steps

Ready for pre-release as v0.2.7rc1 and PyPI publication.

cc @RektPunk - Your initial implementation was the foundation for this complete solution!

xRiskLab added a commit that referenced this pull request Nov 23, 2025
- Fix critical leaf ID mapping bug in extract_leaf_weights()
- Implement create_points() with proper base_score normalization
- Implement predict_score() and predict_scores()
- Remove WOE support from create_points()
- Update documentation (README and CHANGELOG)
- All 106 tests passing

Related to PR #8
@xRiskLab xRiskLab merged commit 9bd287b into xRiskLab:main Nov 23, 2025
9 of 10 checks passed
@xRiskLab
Copy link
Owner

⚠️ Update: Release v0.2.7rc2 (Complete Implementation)

Issue Found: The squash merge only included the original PR commits, missing the complete implementation.

Resolution: Released v0.2.7rc2 with the complete implementation:

  • create_points() with use_base_score parameter
  • predict_score() and predict_scores()
  • ✅ Critical leaf ID mapping fix
  • ✅ All 106 tests passing

📦 Installation

pip install xbooster==0.2.7rc2

🔗 Links

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants