-
Notifications
You must be signed in to change notification settings - Fork 14
[Feature] create construct_scorecard and fix get_leafs and in lgb_constructor
#8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Reviewer's GuideThis PR simplifies tree contribution extraction, enhances leaf indexing in extract_leaf_weights, and fully implements construct_scorecard to generate a detailed LightGBM scorecard with event binning and WOE/IV metrics. Sequence diagram for construct_scorecard workflowsequenceDiagram
participant lgb_constructor
participant LightGBMBooster
participant pd_DataFrame
participant Utils
lgb_constructor->>LightGBMBooster: predict(X, pred_leaf=True)
lgb_constructor->>pd_DataFrame: binning by leaf_idx and label
lgb_constructor->>lgb_constructor: extract_leaf_weights()
lgb_constructor->>pd_DataFrame: merge leaf weights and binning table
lgb_constructor->>Utils: calculate_weight_of_evidence(scorecard)
lgb_constructor->>Utils: calculate_information_value(scorecard)
lgb_constructor->>pd_DataFrame: finalize scorecard structure
lgb_constructor-->>caller: return lgb_scorecard
ER diagram for scorecard binning and leaf weightserDiagram
SCORECARD {
int Tree
int Node
string Feature
string Sign
float Split
int Count
float CountPct
int NonEvents
int Events
float EventRate
float WOE
float IV
float XAddEvidence
}
LEAF_WEIGHTS {
int Tree
int Node
float XAddEvidence
}
BINNING_TABLE {
int tree
int leaf_idx
int Events
int NonEvents
int Count
float EventRate
}
LEAF_WEIGHTS ||--o| SCORECARD : merges into
BINNING_TABLE ||--o| SCORECARD : merges into
Class diagram for updated lgb_constructor methodsclassDiagram
class lgb_constructor {
+get_leafs(X)
+extract_leaf_weights() : pd.DataFrame
+construct_scorecard() : pd.DataFrame
+create_points(...)
-base_score
-model
-booster_
-X
-y
-lgb_scorecard
}
class pd_DataFrame
class calculate_weight_of_evidence
class calculate_information_value
lgb_constructor --> pd_DataFrame : returns
lgb_constructor --> calculate_weight_of_evidence : uses
lgb_constructor --> calculate_information_value : uses
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes and they look great!
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location> `xbooster/lgb_constructor.py:212-216` </location>
<code_context>
["tree_index", "node_index", "value"]
].copy()
+ # Make leaf index relative within each tree
+ leaf_nodes["relative_leaf_index"] = leaf_nodes.groupby("tree_index").cumcount()
# Helper function to merge decision nodes with leaf values
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Using cumcount for relative_leaf_index assumes leaf_nodes are sorted as intended.
Ensure leaf_nodes is sorted correctly within each tree before using cumcount to prevent incorrect relative indices.
```suggestion
leaf_nodes = tree_df[tree_df["split_feature"].isna()][
["tree_index", "node_index", "value"]
].copy()
# Ensure leaf_nodes are sorted by tree_index and node_index before assigning relative_leaf_index
leaf_nodes = leaf_nodes.sort_values(["tree_index", "node_index"]).reset_index(drop=True)
leaf_nodes["relative_leaf_index"] = leaf_nodes.groupby("tree_index").cumcount()
```
</issue_to_address>
### Comment 2
<location> `xbooster/lgb_constructor.py:281-291` </location>
<code_context>
+ axis=1,
+ )
+ # Create a binning table
+ binning_table = (
+ index_and_label.groupby("leaf_idx").agg(["sum", "count"]).reset_index()
+ ).astype(float)
+ binning_table.columns = ["leaf_idx", "Events", "Count"] # type: ignore
+ binning_table["tree"] = i
+ binning_table["NonEvents"] = binning_table["Count"] - binning_table["Events"]
</code_context>
<issue_to_address>
**suggestion:** Directly renaming columns after groupby/agg may be fragile if the aggregation changes.
To avoid issues if aggregation changes, assign column names based on the aggregation output or reference columns by name rather than relying on order.
```suggestion
# Create a binning table
binning_table = (
index_and_label.groupby("leaf_idx").agg({"label": ["sum", "count"]}).reset_index()
).astype(float)
# Flatten MultiIndex columns
binning_table.columns = ["leaf_idx", "Events", "Count"]
binning_table["tree"] = i
binning_table["NonEvents"] = binning_table["Count"] - binning_table["Events"]
binning_table["EventRate"] = binning_table["Events"] / binning_table["Count"]
binning_table = binning_table[
["tree", "leaf_idx", "Events", "NonEvents", "Count", "EventRate"]
]
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| leaf_nodes = tree_df[tree_df["split_feature"].isna()][ | ||
| ["tree_index", "node_index", "value"] | ||
| ].copy() | ||
| # Make leaf index relative within each tree | ||
| leaf_nodes["relative_leaf_index"] = leaf_nodes.groupby("tree_index").cumcount() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (bug_risk): Using cumcount for relative_leaf_index assumes leaf_nodes are sorted as intended.
Ensure leaf_nodes is sorted correctly within each tree before using cumcount to prevent incorrect relative indices.
| leaf_nodes = tree_df[tree_df["split_feature"].isna()][ | |
| ["tree_index", "node_index", "value"] | |
| ].copy() | |
| # Make leaf index relative within each tree | |
| leaf_nodes["relative_leaf_index"] = leaf_nodes.groupby("tree_index").cumcount() | |
| leaf_nodes = tree_df[tree_df["split_feature"].isna()][ | |
| ["tree_index", "node_index", "value"] | |
| ].copy() | |
| # Ensure leaf_nodes are sorted by tree_index and node_index before assigning relative_leaf_index | |
| leaf_nodes = leaf_nodes.sort_values(["tree_index", "node_index"]).reset_index(drop=True) | |
| leaf_nodes["relative_leaf_index"] = leaf_nodes.groupby("tree_index").cumcount() |
| # Create a binning table | ||
| binning_table = ( | ||
| index_and_label.groupby("leaf_idx").agg(["sum", "count"]).reset_index() | ||
| ).astype(float) | ||
| binning_table.columns = ["leaf_idx", "Events", "Count"] # type: ignore | ||
| binning_table["tree"] = i | ||
| binning_table["NonEvents"] = binning_table["Count"] - binning_table["Events"] | ||
| binning_table["EventRate"] = binning_table["Events"] / binning_table["Count"] | ||
| binning_table = binning_table[ | ||
| ["tree", "leaf_idx", "Events", "NonEvents", "Count", "EventRate"] | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Directly renaming columns after groupby/agg may be fragile if the aggregation changes.
To avoid issues if aggregation changes, assign column names based on the aggregation output or reference columns by name rather than relying on order.
| # Create a binning table | |
| binning_table = ( | |
| index_and_label.groupby("leaf_idx").agg(["sum", "count"]).reset_index() | |
| ).astype(float) | |
| binning_table.columns = ["leaf_idx", "Events", "Count"] # type: ignore | |
| binning_table["tree"] = i | |
| binning_table["NonEvents"] = binning_table["Count"] - binning_table["Events"] | |
| binning_table["EventRate"] = binning_table["Events"] / binning_table["Count"] | |
| binning_table = binning_table[ | |
| ["tree", "leaf_idx", "Events", "NonEvents", "Count", "EventRate"] | |
| ] | |
| # Create a binning table | |
| binning_table = ( | |
| index_and_label.groupby("leaf_idx").agg({"label": ["sum", "count"]}).reset_index() | |
| ).astype(float) | |
| # Flatten MultiIndex columns | |
| binning_table.columns = ["leaf_idx", "Events", "Count"] | |
| binning_table["tree"] = i | |
| binning_table["NonEvents"] = binning_table["Count"] - binning_table["Events"] | |
| binning_table["EventRate"] = binning_table["Events"] / binning_table["Count"] | |
| binning_table = binning_table[ | |
| ["tree", "leaf_idx", "Events", "NonEvents", "Count", "EventRate"] | |
| ] |
Task: Verify Base Score and Margin SummationWe need to verify that LightGBM's What to verify:
Current status:
Recommended before merge:
|
- Remove incorrect base_score subtraction from get_leafs() - Add two-tree test to validate margin summation (similar to Issue #2) - Update documentation explaining LightGBM margin behavior - Simplify example to show correct margin verification Fixes margin calculation where tree predictions now correctly sum to total raw score without additional base_score adjustment.
✅ Base Score Verification - COMPLETEDThe verification tasks have been completed in commit What was fixed:
Test results:
What to review next: You can run the tests with: |
|
Thank you for creating excellent tests and examples, @xRiskLab. I really appreciate it!! I am wondering if removing the base score from the first tree is compatible with the I had thought the base score was added to the first tree, as referenced here: microsoft/LightGBM#3058 (comment). |
- Add note in get_leafs() distinguishing sklearn API from internal booster - Reference LightGBM issue #3058 for internal booster behavior - Add lightgbm-getting-started.ipynb demonstrating the difference empirically
|
After looking at the ticket, I think the topic concerns internal LightGBM booster object which works differently than sklearn API. From what I understand, there is no initial score that sklearn API uses, that is why we see such huge values of the first tree (when using internal booster with init_score). If we use internal LightGBM object, then we actually can see the deviation from our log odds we pass (but parameters must match to obtain the same results). Proposal: Add this note for sklearn API and reference the GH ticket in the function call. I was able to reproduce behavior in the notebook |
|
Yes, your analysis is spot on. You have clearly identified the difference in behavior between the internal LightGBM booster object and the initial score handling in the scikit-learn API. The behavior reproduced in the notebook from commit 54b27a8 aligns with the expected behavior. I totally agree with your proposal. Tysm @xRiskLab |
|
Hi @xRiskLab , Could you please confirm if there are any further changes needed? If not, is it ready to merge now? |
🎉 Release Candidate 0.2.7rc1 ReadyThis PR has been extended with complete LightGBM scorecard implementation: ✅ Completed
📊 Performance
📝 Next StepsReady for pre-release as v0.2.7rc1 and PyPI publication. cc @RektPunk - Your initial implementation was the foundation for this complete solution! |
- Fix critical leaf ID mapping bug in extract_leaf_weights() - Implement create_points() with proper base_score normalization - Implement predict_score() and predict_scores() - Remove WOE support from create_points() - Update documentation (README and CHANGELOG) - All 106 tests passing Related to PR #8
|
Related to #7
Changes
construct_scorecardmethodget_leafsmethodRequesting Feedback
References
https://stackoverflow.com/questions/78416214/how-many-trees-do-i-actually-have-in-my-lightgbm-model
Summary by Sourcery
Implement scorecard construction in the LightGBM constructor, fix leaf extraction and indexing logic, and enhance the scorecard with observation percentage metrics
New Features:
Bug Fixes:
Enhancements: