-
Notifications
You must be signed in to change notification settings - Fork 14
[Refactor] make xgb_construct fast
#14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Reviewer's GuideRefactors xgb_constructor to compute leaf margins, binning statistics, and per-tree score conversion using vectorized NumPy/Pandas operations, simplifying some joins and updating tests to match the new data types. Sequence diagram for vectorized _convert_tree_to_points scoringsequenceDiagram
actor User
participant XGBConstructor
participant Booster as booster_
participant Scorecard as xgb_scorecard_with_points
participant Numpy as numpy
User->>XGBConstructor: _convert_tree_to_points(X)
XGBConstructor->>XGBConstructor: validate xgb_scorecard_with_points
XGBConstructor->>XGBConstructor: get_leafs(X, output_type)
XGBConstructor->>Booster: predict(xgb_features, pred_leaf=True)
Booster-->>XGBConstructor: leaf index matrix
XGBConstructor->>Numpy: initialize points_matrix
loop for each tree t
XGBConstructor->>Scorecard: filter rows where Tree == t
Scorecard-->>XGBConstructor: tree_points[Node, Points]
XGBConstructor->>XGBConstructor: build mapping_dict Node->Points
XGBConstructor->>Numpy: np.vectorize(mapping_dict.get) over leaf indices for tree t
Numpy-->>XGBConstructor: points for tree t
XGBConstructor->>Numpy: assign to points_matrix[:, t]
end
XGBConstructor->>XGBConstructor: build result DataFrame from points_matrix
XGBConstructor->>XGBConstructor: compute total Score as row sum
XGBConstructor-->>User: result DataFrame with Score_i and Score columns
Class diagram for refactored xgb_constructor methodsclassDiagram
class XGBConstructor {
- booster_
- xgb_scorecard
- xgb_scorecard_with_points
+ get_leafs(X, output_type)
+ extract_leaf_weights()
+ construct_scorecard()
+ _convert_tree_to_points(X)
+ predict_score(X)
}
XGBConstructor : get_leafs
XGBConstructor : - vectorized margin extraction into tree_results
XGBConstructor : - build df_leafs via np.column_stack
XGBConstructor : construct_scorecard
XGBConstructor : - predict tree_leaf_idx for all trees
XGBConstructor : - reshape to tree_leaf_idx_long via melt
XGBConstructor : - groupby Tree, Node to compute Events, Count
XGBConstructor : - derive NonEvents, EventRate
XGBConstructor : - merge with extract_leaf_weights on Tree, Node
XGBConstructor : _convert_tree_to_points
XGBConstructor : - validate xgb_scorecard_with_points not None
XGBConstructor : - get_leafs with output_type leaf_index
XGBConstructor : - build points_matrix with numpy
XGBConstructor : - per tree mapping Node->Points via dict
XGBConstructor : - construct result DataFrame with per tree scores and total Score
Flow diagram for vectorized construct_scorecard aggregationflowchart LR
A[Start construct_scorecard] --> B[Get xgb_features_and_labels]
B --> C[Predict tree_leaf_idx for all trees]
C --> D[Validate tree_leaf_idx shape]
D --> E[Convert tree_leaf_idx to DataFrame]
E --> F[Melt to long format tree_leaf_idx_long with columns Tree and Node]
F --> G[Attach labels by tiling across trees]
G --> H[Group by Tree, Node and aggregate label to sum and count]
H --> I[Rename to Events and Count]
I --> J[Compute NonEvents = Count - Events]
J --> K[Compute EventRate = Events / Count]
K --> L[Select columns Tree, Node, Events, NonEvents, Count, EventRate as df_binning_table]
L --> M[Call extract_leaf_weights to get df_x_add_evidence]
M --> N[Merge df_x_add_evidence with df_binning_table on Tree and Node]
N --> O[Assign to xgb_scorecard]
O --> P[Sort xgb_scorecard by Tree and Node]
P --> Q[Reset index]
Q --> R[End construct_scorecard]
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey - I've found 2 issues, and left some high level feedback:
- In
_convert_tree_to_points, the previousif self.xgb_scorecard_with_points is not Noneguard was removed; if this attribute can be unset when the method is called, this will now raise instead of behaving as before, so consider either restoring a guard or explicitly enforcing/validating that the scorecard is already built. - The new
_convert_tree_to_pointsloop assumes that tree indices inself.xgb_scorecard_with_points['Tree']are 0..n_rounds-1 and aligned with the column order inX_leaf_weights; if tree numbering ever changes (e.g., 1-based or non-contiguous), this will break, so consider deriving the tree index from the column labels as before or validating the alignment.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `_convert_tree_to_points`, the previous `if self.xgb_scorecard_with_points is not None` guard was removed; if this attribute can be unset when the method is called, this will now raise instead of behaving as before, so consider either restoring a guard or explicitly enforcing/validating that the scorecard is already built.
- The new `_convert_tree_to_points` loop assumes that tree indices in `self.xgb_scorecard_with_points['Tree']` are 0..n_rounds-1 and aligned with the column order in `X_leaf_weights`; if tree numbering ever changes (e.g., 1-based or non-contiguous), this will break, so consider deriving the tree index from the column labels as before or validating the alignment.
## Individual Comments
### Comment 1
<location> `xbooster/xgb_constructor.py:543-546` </location>
<code_context>
- )
- result[f"Score_{tree_number}"] = merged_df["Points"]
- result = pd.concat([result, result.sum(axis=1).rename("Score")], axis=1)
+ n_samples, n_rounds = X_leaf_weights.shape
+ points_matrix = np.zeros((n_samples, n_rounds))
+ leaf_idx_values = X_leaf_weights.values
+ for t in range(n_rounds):
+ # Get points for this tree
+ tree_points = self.xgb_scorecard_with_points[
</code_context>
<issue_to_address>
**issue (bug_risk):** Handle case where `self.xgb_scorecard_with_points` is `None` to avoid runtime errors.
The previous version returned an empty result when `self.xgb_scorecard_with_points` was `None` because the loop was guarded by a check. The new version unconditionally indexes `self.xgb_scorecard_with_points`, causing a runtime error if it’s unset. If this method can be called before the scorecard is computed, add an explicit guard (early return or clear error) to keep the behavior intentional and predictable.
</issue_to_address>
### Comment 2
<location> `xbooster/xgb_constructor.py:552-553` </location>
<code_context>
+ self.xgb_scorecard_with_points["Tree"] == t
+ ]
+ # Mapping dictionary instead of merge
+ mapping_dict = dict(zip(tree_points["Node"], tree_points["Points"]))
+ points_matrix[:, t] = np.vectorize(mapping_dict.get)(leaf_idx_values[:, t])
+
+ result = pd.DataFrame(
</code_context>
<issue_to_address>
**issue (bug_risk):** Handle missing node mappings and consider numeric precision when looking up points.
`mapping_dict.get` inside `np.vectorize` will return `None` for leaf indices missing from `tree_points['Node']`, which then gets assigned into a float `points_matrix` and can raise a `TypeError`. Also, the previous code rounded `X_leaf_weights[[col]]` before merging to avoid float-comparison issues for non-integer leaf indices; direct lookup on raw float values here may miss matches due to tiny numeric differences. Please (a) use an explicit default (e.g. `mapping_dict.get(x, 0.0)`) and (b) apply the same rounding/casting strategy to `leaf_idx_values` (or the dict keys) as in `construct_scorecard` so the mapping stays consistent.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey - I've found 3 issues, and left some high level feedback:
- In
_convert_tree_to_points, you repeatedly filterself.xgb_scorecard_with_pointsbyTreeand then applynp.vectorizeinside the loop; consider pre-building a dict of{tree: {node: points}}mappings once and doing simple dictionary lookups to avoid repeated filtering and vectorization overhead. - The new
construct_scorecardlogic relies on the implicit row ordering produced bypd.DataFrame(tree_leaf_idx).melt(...)andnp.tile(labels, tree_leaf_idx.shape[1]); for robustness and readability, it would help to make the alignment explicit (e.g., by constructing the long-formatTree/Node/labelarrays directly with NumPy reshaping or by adding a brief comment describing the assumed order).
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `_convert_tree_to_points`, you repeatedly filter `self.xgb_scorecard_with_points` by `Tree` and then apply `np.vectorize` inside the loop; consider pre-building a dict of `{tree: {node: points}}` mappings once and doing simple dictionary lookups to avoid repeated filtering and vectorization overhead.
- The new `construct_scorecard` logic relies on the implicit row ordering produced by `pd.DataFrame(tree_leaf_idx).melt(...)` and `np.tile(labels, tree_leaf_idx.shape[1])`; for robustness and readability, it would help to make the alignment explicit (e.g., by constructing the long-format `Tree`/`Node`/`label` arrays directly with NumPy reshaping or by adding a brief comment describing the assumed order).
## Individual Comments
### Comment 1
<location> `xbooster/xgb_constructor.py:555-557` </location>
<code_context>
+ self.xgb_scorecard_with_points["Tree"] == t
+ ]
+ # Mapping dictionary instead of merge
+ mapping_dict = dict(zip(tree_points["Node"], tree_points["Points"]))
+ points_matrix[:, t] = np.vectorize(mapping_dict.get)(leaf_idx_values[:, t])
+
+ result = pd.DataFrame(
</code_context>
<issue_to_address>
**suggestion (performance):** `np.vectorize` on a Python dict lookup may become a bottleneck
Each `np.vectorize(mapping_dict.get)` call does a Python dict lookup per element, which can be a major bottleneck for large `n_samples * n_rounds`. Since `leaf_idx_values[:, t]` is already a NumPy array, consider a more truly vectorized approach, e.g. prebuilding a dense NumPy array indexed by node id (if ids are small ints) or using `pd.Series(leaf_idx_values[:, t]).map(mapping_dict).to_numpy()` to leverage C-level loops instead of Python-level iteration.
```suggestion
# Mapping dictionary instead of merge
mapping_dict = dict(zip(tree_points["Node"], tree_points["Points"]))
# Use pandas vectorized mapping to avoid Python-level loops
points_matrix[:, t] = (
pd.Series(leaf_idx_values[:, t])
.map(mapping_dict)
.to_numpy()
)
```
</issue_to_address>
### Comment 2
<location> `tests/test_xgb_regression.py:126-130` </location>
<code_context>
assert pd.api.types.is_integer_dtype(scorecard["Tree"])
assert pd.api.types.is_integer_dtype(scorecard["Node"])
- assert pd.api.types.is_float_dtype(scorecard["Count"])
+ assert pd.api.types.is_integer_dtype(scorecard["Count"])
assert pd.api.types.is_float_dtype(scorecard["EventRate"])
</code_context>
<issue_to_address>
**suggestion (testing):** Extend the structure test to assert dtypes for all newly-derived columns (`Events`, `NonEvents`, `EventRate`, `Count`).
Since `construct_scorecard` now derives `Events`, `NonEvents`, and `EventRate` together in one grouped aggregation, it would be good to assert their dtypes here as well:
- `Events` and `NonEvents` use integer dtypes
- `EventRate` is float (and optionally constrained to [0, 1])
That way, future changes to the aggregation or column typing will be caught by this test.
```suggestion
# Verify data types
assert pd.api.types.is_integer_dtype(scorecard["Tree"])
assert pd.api.types.is_integer_dtype(scorecard["Node"])
# Aggregated count-based columns should be integer
assert pd.api.types.is_integer_dtype(scorecard["Count"])
assert pd.api.types.is_integer_dtype(scorecard["Events"])
assert pd.api.types.is_integer_dtype(scorecard["NonEvents"])
# EventRate is a derived ratio -> float in [0, 1]
assert pd.api.types.is_float_dtype(scorecard["EventRate"])
assert (scorecard["EventRate"] >= 0.0).all()
assert (scorecard["EventRate"] <= 1.0).all()
```
</issue_to_address>
### Comment 3
<location> `tests/test_xgb_regression.py:130` </location>
<code_context>
def test_construct_scorecard_output_structure(self, sample_data, trained_model):
</code_context>
<issue_to_address>
**suggestion (testing):** Add a test to verify that `Events + NonEvents == Count` across the entire scorecard.
Since the binning table is now built vectorially, this test should also assert the core accounting relationship:
```python
scorecard = trained_model.construct_scorecard()
assert ((scorecard["Events"] + scorecard["NonEvents"]) == scorecard["Count"]).all()
```
This will help detect regressions in the melt/groupby logic and column handling.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
|
Great work on these performance optimizations! All your changes from PRs #10, #11, #13, and #14 have been merged into the What's included:
Impact:
The RC branch ( Thank you for these excellent contributions! |
Description
This PR optimizes the
xbooster/xgb_constructor.py.Key Changes & Benchmarks
Similar to #10, #11, #13
Summary by Sourcery
Optimize construction of XGBoost leaf outputs and scorecards for better performance and correctness.
Enhancements:
Tests: