[Feature] create `construct_scorecard` and fix `get_leafs` and in `lgb_constructor` #8

RektPunk · 2025-11-10T03:37:32Z

Related to #7

Changes

Create construct_scorecard method
Fix get_leafs method

Requesting Feedback

Logic for Leaf Index Mapping: generating a relative leaf index within each tree (tree_index). This index is computed using groupby('tree_index').cumcount() on the leaf nodes extracted from trees_to_dataframe(), ensuring alignment with the output of predict(pred_leaf=True).

References

https://stackoverflow.com/questions/78416214/how-many-trees-do-i-actually-have-in-my-lightgbm-model

Summary by Sourcery

Implement scorecard construction in the LightGBM constructor, fix leaf extraction and indexing logic, and enhance the scorecard with observation percentage metrics

New Features:

Implement construct_scorecard to generate per-tree scorecards with event counts, rates, WOE, IV, and evidence contributions

Bug Fixes:

Simplify get_leafs by using predict(start_iteration, num_iteration) and correct base_score subtraction
Fix extract_leaf_weights to assign a relative leaf index within each tree

Enhancements:

Add CountPct calculation to report the percentage of observations per split

sourcery-ai · 2025-11-10T03:37:39Z

Reviewer's Guide

This PR simplifies tree contribution extraction, enhances leaf indexing in extract_leaf_weights, and fully implements construct_scorecard to generate a detailed LightGBM scorecard with event binning and WOE/IV metrics.

Sequence diagram for construct_scorecard workflow

sequenceDiagram
participant lgb_constructor
participant LightGBMBooster
participant pd_DataFrame
participant Utils

lgb_constructor->>LightGBMBooster: predict(X, pred_leaf=True)
lgb_constructor->>pd_DataFrame: binning by leaf_idx and label
lgb_constructor->>lgb_constructor: extract_leaf_weights()
lgb_constructor->>pd_DataFrame: merge leaf weights and binning table
lgb_constructor->>Utils: calculate_weight_of_evidence(scorecard)
lgb_constructor->>Utils: calculate_information_value(scorecard)
lgb_constructor->>pd_DataFrame: finalize scorecard structure
lgb_constructor-->>caller: return lgb_scorecard

ER diagram for scorecard binning and leaf weights

erDiagram
    SCORECARD {
      int Tree
      int Node
      string Feature
      string Sign
      float Split
      int Count
      float CountPct
      int NonEvents
      int Events
      float EventRate
      float WOE
      float IV
      float XAddEvidence
    }
    LEAF_WEIGHTS {
      int Tree
      int Node
      float XAddEvidence
    }
    BINNING_TABLE {
      int tree
      int leaf_idx
      int Events
      int NonEvents
      int Count
      float EventRate
    }
    LEAF_WEIGHTS ||--o| SCORECARD : merges into
    BINNING_TABLE ||--o| SCORECARD : merges into

Class diagram for updated lgb_constructor methods

classDiagram
class lgb_constructor {
  +get_leafs(X)
  +extract_leaf_weights() : pd.DataFrame
  +construct_scorecard() : pd.DataFrame
  +create_points(...)
  -base_score
  -model
  -booster_
  -X
  -y
  -lgb_scorecard
}

class pd_DataFrame
class calculate_weight_of_evidence
class calculate_information_value

lgb_constructor --> pd_DataFrame : returns
lgb_constructor --> calculate_weight_of_evidence : uses
lgb_constructor --> calculate_information_value : uses

File-Level Changes

Change	Details	Files
Refactor get_leafs to streamline tree margin computation	Replaced incremental raw_score calls with single predict using start_iteration and num_iteration Removed manual cumulative score differencing logic Conditionally adjusted base_score application for first tree	`xbooster/lgb_constructor.py`
Enhance extract_leaf_weights with relative leaf indexing	Added relative_leaf_index via groupby.cumcount() per tree Updated merge_and_format to map Node from relative_leaf_index	`xbooster/lgb_constructor.py`
Implement construct_scorecard end-to-end	Predicted leaf indices and validated output shape Built per-tree binning tables with Events, NonEvents, Count, EventRate Merged binning data with XAddEvidence from extract_leaf_weights Sorted results and calculated WOE, IV, and CountPct Returned finalized lgb_scorecard with ordered columns	`xbooster/lgb_constructor.py`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey there - I've reviewed your changes and they look great!

Prompt for AI Agents

Please address the comments from this code review:

## Individual Comments

### Comment 1
<location> `xbooster/lgb_constructor.py:212-216` </location>
<code_context>
             ["tree_index", "node_index", "value"]
         ].copy()
+        # Make leaf index relative within each tree
+        leaf_nodes["relative_leaf_index"] = leaf_nodes.groupby("tree_index").cumcount()

         # Helper function to merge decision nodes with leaf values
</code_context>

<issue_to_address>
**suggestion (bug_risk):** Using cumcount for relative_leaf_index assumes leaf_nodes are sorted as intended.

Ensure leaf_nodes is sorted correctly within each tree before using cumcount to prevent incorrect relative indices.

```suggestion
        leaf_nodes = tree_df[tree_df["split_feature"].isna()][
            ["tree_index", "node_index", "value"]
        ].copy()
        # Ensure leaf_nodes are sorted by tree_index and node_index before assigning relative_leaf_index
        leaf_nodes = leaf_nodes.sort_values(["tree_index", "node_index"]).reset_index(drop=True)
        leaf_nodes["relative_leaf_index"] = leaf_nodes.groupby("tree_index").cumcount()
```
</issue_to_address>

### Comment 2
<location> `xbooster/lgb_constructor.py:281-291` </location>
<code_context>
+                axis=1,
+            )
+            # Create a binning table
+            binning_table = (
+                index_and_label.groupby("leaf_idx").agg(["sum", "count"]).reset_index()
+            ).astype(float)
+            binning_table.columns = ["leaf_idx", "Events", "Count"]  # type: ignore
+            binning_table["tree"] = i
+            binning_table["NonEvents"] = binning_table["Count"] - binning_table["Events"]
</code_context>

<issue_to_address>
**suggestion:** Directly renaming columns after groupby/agg may be fragile if the aggregation changes.

To avoid issues if aggregation changes, assign column names based on the aggregation output or reference columns by name rather than relying on order.

```suggestion
            # Create a binning table
            binning_table = (
                index_and_label.groupby("leaf_idx").agg({"label": ["sum", "count"]}).reset_index()
            ).astype(float)
            # Flatten MultiIndex columns
            binning_table.columns = ["leaf_idx", "Events", "Count"]
            binning_table["tree"] = i
            binning_table["NonEvents"] = binning_table["Count"] - binning_table["Events"]
            binning_table["EventRate"] = binning_table["Events"] / binning_table["Count"]
            binning_table = binning_table[
                ["tree", "leaf_idx", "Events", "NonEvents", "Count", "EventRate"]
            ]
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2025-11-10T03:38:18Z

xbooster/lgb_constructor.py

        leaf_nodes = tree_df[tree_df["split_feature"].isna()][
            ["tree_index", "node_index", "value"]
        ].copy()
+        # Make leaf index relative within each tree
+        leaf_nodes["relative_leaf_index"] = leaf_nodes.groupby("tree_index").cumcount()


suggestion (bug_risk): Using cumcount for relative_leaf_index assumes leaf_nodes are sorted as intended.

Ensure leaf_nodes is sorted correctly within each tree before using cumcount to prevent incorrect relative indices.

Suggested change

leaf_nodes = tree_df[tree_df["split_feature"].isna()][

["tree_index", "node_index", "value"]

].copy()

# Make leaf index relative within each tree

leaf_nodes["relative_leaf_index"] = leaf_nodes.groupby("tree_index").cumcount()

leaf_nodes = tree_df[tree_df["split_feature"].isna()][

["tree_index", "node_index", "value"]

].copy()

# Ensure leaf_nodes are sorted by tree_index and node_index before assigning relative_leaf_index

leaf_nodes = leaf_nodes.sort_values(["tree_index", "node_index"]).reset_index(drop=True)

leaf_nodes["relative_leaf_index"] = leaf_nodes.groupby("tree_index").cumcount()

sourcery-ai · 2025-11-10T03:38:19Z

xbooster/lgb_constructor.py

+            # Create a binning table
+            binning_table = (
+                index_and_label.groupby("leaf_idx").agg(["sum", "count"]).reset_index()
+            ).astype(float)
+            binning_table.columns = ["leaf_idx", "Events", "Count"]  # type: ignore
+            binning_table["tree"] = i
+            binning_table["NonEvents"] = binning_table["Count"] - binning_table["Events"]
+            binning_table["EventRate"] = binning_table["Events"] / binning_table["Count"]
+            binning_table = binning_table[
+                ["tree", "leaf_idx", "Events", "NonEvents", "Count", "EventRate"]
+            ]


suggestion: Directly renaming columns after groupby/agg may be fragile if the aggregation changes.

To avoid issues if aggregation changes, assign column names based on the aggregation output or reference columns by name rather than relying on order.

Suggested change

# Create a binning table

binning_table = (

index_and_label.groupby("leaf_idx").agg(["sum", "count"]).reset_index()

).astype(float)

binning_table.columns = ["leaf_idx", "Events", "Count"] # type: ignore

binning_table["tree"] = i

binning_table["NonEvents"] = binning_table["Count"] - binning_table["Events"]

binning_table["EventRate"] = binning_table["Events"] / binning_table["Count"]

binning_table = binning_table[

["tree", "leaf_idx", "Events", "NonEvents", "Count", "EventRate"]

]

# Create a binning table

binning_table = (

index_and_label.groupby("leaf_idx").agg({"label": ["sum", "count"]}).reset_index()

).astype(float)

# Flatten MultiIndex columns

binning_table.columns = ["leaf_idx", "Events", "Count"]

binning_table["tree"] = i

binning_table["NonEvents"] = binning_table["Count"] - binning_table["Events"]

binning_table["EventRate"] = binning_table["Events"] / binning_table["Count"]

binning_table = binning_table[

["tree", "leaf_idx", "Events", "NonEvents", "Count", "EventRate"]

]

xRiskLab · 2025-11-11T20:54:59Z

Task: Verify Base Score and Margin Summation

We need to verify that LightGBM's get_leafs() correctly returns per-tree margins that sum to the total raw prediction, similar to the verification approach from Issue #2.

What to verify:

Per-tree margins from get_leafs() sum correctly to model's total raw prediction
The base_score calculation (np.log(y.mean() / (1 - y.mean()))) is appropriate for LightGBM
Add a test using a "two-tree trick" to isolate and verify the base margin behavior

Current status:

Fixed get_leafs() to return margins without incorrect base_score subtraction
Example validates that margins sum correctly: np.allclose(margin_sum, raw_pred)
All tests pass

Recommended before merge:

Add explicit test that validates margin summation (similar to two-tree test from Issue Base Score - Wrong scale #2)
Document how LightGBM handles base predictions in tree structure
Verify that construct_scorecard() correctly interprets the margin values

- Remove incorrect base_score subtraction from get_leafs() - Add two-tree test to validate margin summation (similar to Issue #2) - Update documentation explaining LightGBM margin behavior - Simplify example to show correct margin verification Fixes margin calculation where tree predictions now correctly sum to total raw score without additional base_score adjustment.

xRiskLab · 2025-11-11T20:59:50Z

✅ Base Score Verification - COMPLETED

The verification tasks have been completed in commit 8c220f7:

What was fixed:

✅ Removed incorrect base_score subtraction from get_leafs()
✅ Added test_two_tree_base_score_validation() - the two-tree trick test
✅ Enhanced documentation in get_leafs() and __init__ explaining LightGBM margin behavior
✅ Simplified example to show correct margin verification

Test results:

All 11 LightGBM tests passing
New two-tree test validates:
- Per-tree margins sum to total raw prediction ✓
- Constructor's get_leafs() matches LightGBM API ✓
- Base score calculation matches log-odds ✓

What to review next:
@RektPunk - Please review the new commit to ensure the fix aligns with your original implementation. The key change is that LightGBM's per-tree predictions already sum correctly without needing to add/subtract base_score separately (unlike XGBoost).

You can run the tests with: uv run pytest tests/test_lgb_constructor.py::test_two_tree_base_score_validation -v

RektPunk · 2025-11-12T02:05:04Z

Thank you for creating excellent tests and examples, @xRiskLab. I really appreciate it!!

I am wondering if removing the base score from the first tree is compatible with the xBooster logic that was fixed in this commit: 8c220f7#diff-918f8f30807774e4c31ebfadd85846f60849cb5abe698e612e6be3a0f35755b9L174.

I had thought the base score was added to the first tree, as referenced here: microsoft/LightGBM#3058 (comment).

- Add note in get_leafs() distinguishing sklearn API from internal booster - Reference LightGBM issue #3058 for internal booster behavior - Add lightgbm-getting-started.ipynb demonstrating the difference empirically

xRiskLab · 2025-11-12T21:59:08Z

After looking at the ticket, I think the topic concerns internal LightGBM booster object which works differently than sklearn API.

From what I understand, there is no initial score that sklearn API uses, that is why we see such huge values of the first tree (when using internal booster with init_score).

If we use internal LightGBM object, then we actually can see the deviation from our log odds we pass (but parameters must match to obtain the same results).

Proposal: Add this note for sklearn API and reference the GH ticket in the function call.

I was able to reproduce behavior in the notebook lightgbm-getting-started.ipynb (just added in commit 54b27a8). Please have a look and let me know if this aligns with the expected behavior.

RektPunk · 2025-11-13T01:05:21Z

Yes, your analysis is spot on. You have clearly identified the difference in behavior between the internal LightGBM booster object and the initial score handling in the scikit-learn API. The behavior reproduced in the notebook from commit 54b27a8 aligns with the expected behavior. I totally agree with your proposal. Tysm @xRiskLab

RektPunk · 2025-11-22T13:38:43Z

Hi @xRiskLab , Could you please confirm if there are any further changes needed? If not, is it ready to merge now?

xRiskLab · 2025-11-23T11:55:13Z

🎉 Release Candidate 0.2.7rc1 Ready

This PR has been extended with complete LightGBM scorecard implementation:

✅ Completed

Critical Bug Fix: Fixed leaf ID mapping in extract_leaf_weights() (55% Gini improvement: 0.40 → 0.90)
Full Scorecard Support: Implemented create_points(), predict_score(), and predict_scores()
Base Score Normalization: Proper handling with use_base_score parameter
Documentation: Updated README and CHANGELOG with comprehensive examples
Testing: All 106 tests passing

📊 Performance

Scorecard Gini: 0.9020
Model Gini: 0.9021
Discrimination preserved ✅

📝 Next Steps

Ready for pre-release as v0.2.7rc1 and PyPI publication.

cc @RektPunk - Your initial implementation was the foundation for this complete solution!

- Fix critical leaf ID mapping bug in extract_leaf_weights() - Implement create_points() with proper base_score normalization - Implement predict_score() and predict_scores() - Remove WOE support from create_points() - Update documentation (README and CHANGELOG) - All 106 tests passing Related to PR #8

xRiskLab · 2025-11-23T12:07:25Z

⚠️ Update: Release v0.2.7rc2 (Complete Implementation)

Issue Found: The squash merge only included the original PR commits, missing the complete implementation.

Resolution: Released v0.2.7rc2 with the complete implementation:

✅ create_points() with use_base_score parameter
✅ predict_score() and predict_scores()
✅ Critical leaf ID mapping fix
✅ All 106 tests passing

📦 Installation

pip install xbooster==0.2.7rc2

🔗 Links

GitHub Release: https://github.com/xRiskLab/xBooster/releases/tag/v0.2.7rc2
PyPI: https://pypi.org/project/xbooster/0.2.7rc2/

RektPunk added 4 commits November 10, 2025 12:06

make get leaf simplify

d39ef88

make constuct scorecard

08a6e93

diff base score for 0 tree

7eff6ec

fix typo

6073c6e

sourcery-ai bot reviewed Nov 10, 2025

View reviewed changes

make test happy

a7b392f

docs: clarify sklearn API vs internal booster base_score behavior

54b27a8

- Add note in get_leafs() distinguishing sklearn API from internal booster - Reference LightGBM issue #3058 for internal booster behavior - Add lightgbm-getting-started.ipynb demonstrating the difference empirically

xRiskLab merged commit 9bd287b into xRiskLab:main Nov 23, 2025
9 of 10 checks passed

RektPunk deleted the feat/lgb branch November 23, 2025 13:28

RektPunk mentioned this pull request Nov 27, 2025

Feature Request: Support for LightGBM Model Integration #7

Closed

[Feature] create construct_scorecard and fix get_leafs and in lgb_constructor #8

[Feature] create construct_scorecard and fix get_leafs and in lgb_constructor #8

Uh oh!

Conversation

RektPunk commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Requesting Feedback

References

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for construct_scorecard workflow

ER diagram for scorecard binning and leaf weights

Class diagram for updated lgb_constructor methods

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

xRiskLab commented Nov 11, 2025

Task: Verify Base Score and Margin Summation

Uh oh!

xRiskLab commented Nov 11, 2025

✅ Base Score Verification - COMPLETED

Uh oh!

RektPunk commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xRiskLab commented Nov 12, 2025

Uh oh!

RektPunk commented Nov 13, 2025

Uh oh!

RektPunk commented Nov 22, 2025

Uh oh!

xRiskLab commented Nov 23, 2025

🎉 Release Candidate 0.2.7rc1 Ready

✅ Completed

📊 Performance

📝 Next Steps

Uh oh!

Uh oh!

xRiskLab commented Nov 23, 2025

⚠️ Update: Release v0.2.7rc2 (Complete Implementation)

📦 Installation

🔗 Links

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Feature] create `construct_scorecard` and fix `get_leafs` and in `lgb_constructor` #8

[Feature] create `construct_scorecard` and fix `get_leafs` and in `lgb_constructor` #8

RektPunk commented Nov 10, 2025 •

edited

Loading

sourcery-ai bot commented Nov 10, 2025 •

edited

Loading

RektPunk commented Nov 12, 2025 •

edited

Loading