-
Notifications
You must be signed in to change notification settings - Fork 14
[Refactor] make faster _convert_tree_to_points use np.vectorize instead of merge in xbooster/lgb_constructor.py
#11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Reviewer's guide (collapsed on small PRs)Reviewer's GuideRefactors Flow diagram for optimized _convert_tree_to_points vectorized scoringflowchart TD
A[input_X_DataFrame] --> B[get_leafs_X_leaf_indices]
B --> C[get_shape_n_samples_n_trees]
C --> D[init_points_matrix_zeros_n_samples_by_n_trees]
C --> E[get_leaf_idx_values_from_X_leaf_indices]
D --> F[for_each_tree_index_t_0_to_n_trees_minus_1]
E --> F
subgraph Per_tree_vectorized_mapping
F --> G[filter_lgb_scorecard_with_points_where_Tree_equals_t]
G --> H[build_mapping_dict_Node_to_Points]
H --> I[apply_np_vectorize_mapping_to_leaf_idx_values_column_t]
I --> J[assign_result_to_points_matrix_all_rows_column_t]
end
J --> K[after_loop_build_result_DataFrame_from_points_matrix]
K --> L[set_result_index_to_X_index]
L --> M[set_result_columns_to_Score_0_to_Score_n_trees_minus_1]
M --> N[compute_total_Score_as_rowwise_sum_of_points_matrix]
N --> O[add_Score_column_to_result]
O --> P[return_result]
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey - I've found 2 issues, and left some high level feedback:
- The new lookup uses
Tree == twheretis a zero-based column index, whereas the previous version derivedtree_numberfrom the column name; please double-check thatlgb_scorecard_with_points['Tree']uses the same zero-based indexing and ordering asX_leaf_indices.columnsto avoid misaligned scores. - The old implementation applied
.round(4)to leaf indices before merging, but the new code does exact dictionary lookups onleaf_idx_values; if leaf/node identifiers are floats, you may need to preserve the rounding (or another normalization) to avoid missing matches.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The new lookup uses `Tree == t` where `t` is a zero-based column index, whereas the previous version derived `tree_number` from the column name; please double-check that `lgb_scorecard_with_points['Tree']` uses the same zero-based indexing and ordering as `X_leaf_indices.columns` to avoid misaligned scores.
- The old implementation applied `.round(4)` to leaf indices before merging, but the new code does exact dictionary lookups on `leaf_idx_values`; if leaf/node identifiers are floats, you may need to preserve the rounding (or another normalization) to avoid missing matches.
## Individual Comments
### Comment 1
<location> `xbooster/lgb_constructor.py:500-502` </location>
<code_context>
+ n_samples, n_trees = X_leaf_indices.shape
+ points_matrix = np.zeros((n_samples, n_trees))
+ leaf_idx_values = X_leaf_indices.values
+ for t in range(n_trees):
# Get points for this tree
- subset_points_df = self.lgb_scorecard_with_points[
- self.lgb_scorecard_with_points["Tree"] == int(tree_number)
- ].copy()
-
- # Merge leaf indices with points
- merged_df = pd.merge(
- X_leaf_indices[[col]].round(4),
- subset_points_df[["Node", "Points"]],
- left_on=col,
- right_on="Node",
- how="left",
- )
- result[f"Score_{tree_number}"] = merged_df["Points"]
+ tree_points = self.lgb_scorecard_with_points[
+ self.lgb_scorecard_with_points["Tree"] == t
+ ]
</code_context>
<issue_to_address>
**issue (bug_risk):** Using the loop index `t` as the `Tree` filter may be brittle if column order / naming changes.
Previously, `tree_number` came from the column name and was used directly in the `Tree` filter, so the mapping depended only on the naming convention. Now the code assumes the column order in `X_leaf_indices` matches `Tree == t` for `t in range(n_trees)`. If columns are reordered, some trees are missing, or `Tree` values are not `0..n_trees-1`, scores could be assigned to the wrong tree with no error. Consider restoring the name-based mapping or otherwise explicitly aligning column names to `Tree` values instead of relying on positional order.
</issue_to_address>
### Comment 2
<location> `xbooster/lgb_constructor.py:506-507` </location>
<code_context>
+ self.lgb_scorecard_with_points["Tree"] == t
+ ]
+ # Mapping dictionary instead of merge
+ mapping_dict = dict(zip(tree_points["Node"], tree_points["Points"]))
+ points_matrix[:, t] = np.vectorize(mapping_dict.get)(leaf_idx_values[:, t])
+ result = pd.DataFrame(
</code_context>
<issue_to_address>
**issue (bug_risk):** Dropping the previous rounding on leaf indices may introduce subtle key-mismatch issues.
Previously, `X_leaf_indices[[col]].round(4)` ensured leaf indices matched the `Node` keys despite floating point noise. Now, `mapping_dict.get(leaf_idx_values[:, t])` relies on raw float values, so slight precision differences could cause failed lookups and `NaN`s, altering results. If the rounding was added for this reason, consider restoring a consistent normalization step (e.g., rounding or casting to int) before using these values as dict keys.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
_convert_tree_to_points use np.vectorize instead of merge_convert_tree_to_points use np.vectorize instead of merge
_convert_tree_to_points use np.vectorize instead of merge_convert_tree_to_points use np.vectorize instead of merge in xbooster/lgb_constructor.py
|
Great work on these performance optimizations! All your changes from PRs #10, #11, #13, and #14 have been merged into the What's included:
Impact:
The RC branch ( Thank you for these excellent contributions! |
Description
This PR optimizes the scoring engine by replacing the inefficient iterative pd.merge operations with Vectorized Dictionary Mapping.
The previous implementation performed a
pd.mergewithin a loop for every tree, which caused performance bottlenecks and memory overhead during the scoring of large datasets. The new approach pre-extracts leaf indices as a NumPy array and applies score mapping usingmap, resulting in a speedup.Key Changes
Performance Benchmark
Measured using
%%timeiton the example ipynb:Summary by Sourcery
Optimize tree-to-points conversion in the LightGBM constructor for faster score computation by replacing per-tree DataFrame merges with a vectorized mapping approach.
Enhancements: