-
Notifications
You must be signed in to change notification settings - Fork 14
[Refactor] make faster construct_scorecard method for xbooster/lgb_constructor.py
#10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Reviewer's guide (collapsed on small PRs)Reviewer's GuideOptimizes construct_scorecard in xbooster/lgb_constructor.py by replacing per-tree looping and concatenation with a single vectorized Pandas/NumPy aggregation and updating the merge keys accordingly. File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey - I've found 1 issue
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location> `xbooster/lgb_constructor.py:297-298` </location>
<code_context>
- # Aggregate indices, leafs, and counts of events and non-events
- df_binning_table = pd.concat([df_binning_table, binning_table], axis=0)
+ # Aggregate indices, leafs, and counts of events and non-events
+ tree_leaf_idx_long = pd.DataFrame(tree_leaf_idx).melt(var_name="Tree", value_name="Node")
+ tree_leaf_idx_long["label"] = np.tile(labels.values, tree_leaf_idx.shape[1])
+ binning_table = (
+ tree_leaf_idx_long.groupby(["Tree", "Node"])["label"]
</code_context>
<issue_to_address>
**issue (bug_risk):** This refactor changes label alignment semantics from index-based to purely positional, which may subtly alter behavior if `labels` has a non-trivial index.
The previous `pd.concat(..., axis=1)` + groupby relied on index-aware alignment between `labels` and `tree_leaf_idx`. Swapping to `labels.values` + `np.tile` assumes pure positional alignment and ignores the label index, so any non-default or filtered index on `labels` could yield silently incorrect event/non-event counts. If index alignment matters, either reset the index on `labels` before using `.values`, or construct the long DataFrame via a concat that preserves the index instead of manual tiling.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
construct_scorecard method for xbooster/lgb_constructor.pyconstruct_scorecard method for xbooster/lgb_constructor.py
|
Great work on these performance optimizations! All your changes from PRs #10, #11, #13, and #14 have been merged into the What's included:
Impact:
The RC branch ( Thank you for these excellent contributions! |
Description
This PR optimizes the
construct_scorecardmethod by replacing the iterative tree-by-tree processing with a vectorized approach using Pandas and NumPy.Previously, the code used a Python loop to iterate through each tree, performing individual grouping and concatenation. This created significant overhead, especially as the number of trees increased. The new implementation leverages
melt()and a singlegroupby()operation to process all trees simultaneously.Key Changes
Performance Benchmark
Measured using
%%timeiton the example ipynb:Summary by Sourcery
Optimize LightGBM scorecard construction by replacing per-tree looping with a vectorized aggregation over all trees and merging results directly on tree/node identifiers.
Enhancements: