Skip to content

Conversation

@RektPunk
Copy link
Contributor

@RektPunk RektPunk commented Dec 21, 2025

Description

This PR optimizes the xbooster/xgb_constructor.py.

Key Changes & Benchmarks

Similar to #10, #11, #13

Summary by Sourcery

Optimize construction of XGBoost leaf outputs and scorecards for better performance and correctness.

Enhancements:

  • Vectorize leaf prediction aggregation in get_leafs to avoid per-tree DataFrame concatenation.
  • Refactor scorecard construction to use a single grouped aggregation over all tree/leaf indices instead of looping per tree.
  • Replace per-column merge-based point calculation in _convert_tree_to_points with vectorized mapping using precomputed matrices.

Tests:

  • Update regression test to assert that the scorecard Count column is stored as an integer type.

@sourcery-ai
Copy link

sourcery-ai bot commented Dec 21, 2025

Reviewer's Guide

Refactors xgb_constructor to compute leaf margins, binning statistics, and per-tree score conversion using vectorized NumPy/Pandas operations, simplifying some joins and updating tests to match the new data types.

Sequence diagram for vectorized _convert_tree_to_points scoring

sequenceDiagram
    actor User
    participant XGBConstructor
    participant Booster as booster_
    participant Scorecard as xgb_scorecard_with_points
    participant Numpy as numpy

    User->>XGBConstructor: _convert_tree_to_points(X)
    XGBConstructor->>XGBConstructor: validate xgb_scorecard_with_points
    XGBConstructor->>XGBConstructor: get_leafs(X, output_type)
    XGBConstructor->>Booster: predict(xgb_features, pred_leaf=True)
    Booster-->>XGBConstructor: leaf index matrix
    XGBConstructor->>Numpy: initialize points_matrix
    loop for each tree t
        XGBConstructor->>Scorecard: filter rows where Tree == t
        Scorecard-->>XGBConstructor: tree_points[Node, Points]
        XGBConstructor->>XGBConstructor: build mapping_dict Node->Points
        XGBConstructor->>Numpy: np.vectorize(mapping_dict.get) over leaf indices for tree t
        Numpy-->>XGBConstructor: points for tree t
        XGBConstructor->>Numpy: assign to points_matrix[:, t]
    end
    XGBConstructor->>XGBConstructor: build result DataFrame from points_matrix
    XGBConstructor->>XGBConstructor: compute total Score as row sum
    XGBConstructor-->>User: result DataFrame with Score_i and Score columns
Loading

Class diagram for refactored xgb_constructor methods

classDiagram
    class XGBConstructor {
        - booster_
        - xgb_scorecard
        - xgb_scorecard_with_points
        + get_leafs(X, output_type)
        + extract_leaf_weights()
        + construct_scorecard()
        + _convert_tree_to_points(X)
        + predict_score(X)
    }

    XGBConstructor : get_leafs
    XGBConstructor : - vectorized margin extraction into tree_results
    XGBConstructor : - build df_leafs via np.column_stack

    XGBConstructor : construct_scorecard
    XGBConstructor : - predict tree_leaf_idx for all trees
    XGBConstructor : - reshape to tree_leaf_idx_long via melt
    XGBConstructor : - groupby Tree, Node to compute Events, Count
    XGBConstructor : - derive NonEvents, EventRate
    XGBConstructor : - merge with extract_leaf_weights on Tree, Node

    XGBConstructor : _convert_tree_to_points
    XGBConstructor : - validate xgb_scorecard_with_points not None
    XGBConstructor : - get_leafs with output_type leaf_index
    XGBConstructor : - build points_matrix with numpy
    XGBConstructor : - per tree mapping Node->Points via dict
    XGBConstructor : - construct result DataFrame with per tree scores and total Score
Loading

Flow diagram for vectorized construct_scorecard aggregation

flowchart LR
    A[Start construct_scorecard] --> B[Get xgb_features_and_labels]
    B --> C[Predict tree_leaf_idx for all trees]
    C --> D[Validate tree_leaf_idx shape]
    D --> E[Convert tree_leaf_idx to DataFrame]
    E --> F[Melt to long format tree_leaf_idx_long with columns Tree and Node]
    F --> G[Attach labels by tiling across trees]
    G --> H[Group by Tree, Node and aggregate label to sum and count]
    H --> I[Rename to Events and Count]
    I --> J[Compute NonEvents = Count - Events]
    J --> K[Compute EventRate = Events / Count]
    K --> L[Select columns Tree, Node, Events, NonEvents, Count, EventRate as df_binning_table]
    L --> M[Call extract_leaf_weights to get df_x_add_evidence]
    M --> N[Merge df_x_add_evidence with df_binning_table on Tree and Node]
    N --> O[Assign to xgb_scorecard]
    O --> P[Sort xgb_scorecard by Tree and Node]
    P --> Q[Reset index]
    Q --> R[End construct_scorecard]
Loading

File-Level Changes

Change Details Files
Vectorize leaf-margin extraction in get_leafs to avoid per-tree column assignment.
  • Accumulate per-tree margin arrays in a list instead of assigning into a DataFrame inside the loop.
  • Use np.column_stack to build the full leaf matrix in one operation.
  • Construct the final DataFrame with explicit index and column names instead of incremental concatenation.
xbooster/xgb_constructor.py
Refactor construct_scorecard to aggregate binning statistics across all trees in a single grouped operation and simplify merging.
  • Replace the per-tree loop that builds df_binning_table with a melt + groupby on the full leaf index matrix.
  • Derive Events, Count, NonEvents, and EventRate via vectorized groupby aggregation on the long-format DataFrame.
  • Merge leaf weights with binning stats directly on ['Tree','Node'] and remove the column reordering/cleanup, relying on extract_leaf_weights to define the base schema.
  • Update the test to expect Count to be an integer dtype instead of float, matching the new aggregation output.
xbooster/xgb_constructor.py
tests/test_xgb_regression.py
Optimize _convert_tree_to_points by replacing per-tree DataFrame merges with NumPy-based mapping from leaf indices to points.
  • Add a guard that raises a ValueError if xgb_scorecard_with_points has not been created prior to scoring.
  • Retrieve leaf indices once, then iterate over trees to build a dense points_matrix using per-tree dictionaries mapping Node to Points and np.vectorize lookups.
  • Return a DataFrame of per-tree scores plus a total Score column computed as the row-wise sum of points_matrix instead of concatenating in a loop.
xbooster/xgb_constructor.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues, and left some high level feedback:

  • In _convert_tree_to_points, the previous if self.xgb_scorecard_with_points is not None guard was removed; if this attribute can be unset when the method is called, this will now raise instead of behaving as before, so consider either restoring a guard or explicitly enforcing/validating that the scorecard is already built.
  • The new _convert_tree_to_points loop assumes that tree indices in self.xgb_scorecard_with_points['Tree'] are 0..n_rounds-1 and aligned with the column order in X_leaf_weights; if tree numbering ever changes (e.g., 1-based or non-contiguous), this will break, so consider deriving the tree index from the column labels as before or validating the alignment.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `_convert_tree_to_points`, the previous `if self.xgb_scorecard_with_points is not None` guard was removed; if this attribute can be unset when the method is called, this will now raise instead of behaving as before, so consider either restoring a guard or explicitly enforcing/validating that the scorecard is already built.
- The new `_convert_tree_to_points` loop assumes that tree indices in `self.xgb_scorecard_with_points['Tree']` are 0..n_rounds-1 and aligned with the column order in `X_leaf_weights`; if tree numbering ever changes (e.g., 1-based or non-contiguous), this will break, so consider deriving the tree index from the column labels as before or validating the alignment.

## Individual Comments

### Comment 1
<location> `xbooster/xgb_constructor.py:543-546` </location>
<code_context>
-                )
-                result[f"Score_{tree_number}"] = merged_df["Points"]
-        result = pd.concat([result, result.sum(axis=1).rename("Score")], axis=1)
+        n_samples, n_rounds = X_leaf_weights.shape
+        points_matrix = np.zeros((n_samples, n_rounds))
+        leaf_idx_values = X_leaf_weights.values
+        for t in range(n_rounds):
+            # Get points for this tree
+            tree_points = self.xgb_scorecard_with_points[
</code_context>

<issue_to_address>
**issue (bug_risk):** Handle case where `self.xgb_scorecard_with_points` is `None` to avoid runtime errors.

The previous version returned an empty result when `self.xgb_scorecard_with_points` was `None` because the loop was guarded by a check. The new version unconditionally indexes `self.xgb_scorecard_with_points`, causing a runtime error if it’s unset. If this method can be called before the scorecard is computed, add an explicit guard (early return or clear error) to keep the behavior intentional and predictable.
</issue_to_address>

### Comment 2
<location> `xbooster/xgb_constructor.py:552-553` </location>
<code_context>
+                self.xgb_scorecard_with_points["Tree"] == t
+            ]
+            # Mapping dictionary instead of merge
+            mapping_dict = dict(zip(tree_points["Node"], tree_points["Points"]))
+            points_matrix[:, t] = np.vectorize(mapping_dict.get)(leaf_idx_values[:, t])
+
+        result = pd.DataFrame(
</code_context>

<issue_to_address>
**issue (bug_risk):** Handle missing node mappings and consider numeric precision when looking up points.

`mapping_dict.get` inside `np.vectorize` will return `None` for leaf indices missing from `tree_points['Node']`, which then gets assigned into a float `points_matrix` and can raise a `TypeError`. Also, the previous code rounded `X_leaf_weights[[col]]` before merging to avoid float-comparison issues for non-integer leaf indices; direct lookup on raw float values here may miss matches due to tiny numeric differences. Please (a) use an explicit default (e.g. `mapping_dict.get(x, 0.0)`) and (b) apply the same rounding/casting strategy to `leaf_idx_values` (or the dict keys) as in `construct_scorecard` so the mapping stays consistent.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@RektPunk RektPunk marked this pull request as draft December 21, 2025 08:29
@RektPunk RektPunk marked this pull request as ready for review December 21, 2025 13:31
Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 3 issues, and left some high level feedback:

  • In _convert_tree_to_points, you repeatedly filter self.xgb_scorecard_with_points by Tree and then apply np.vectorize inside the loop; consider pre-building a dict of {tree: {node: points}} mappings once and doing simple dictionary lookups to avoid repeated filtering and vectorization overhead.
  • The new construct_scorecard logic relies on the implicit row ordering produced by pd.DataFrame(tree_leaf_idx).melt(...) and np.tile(labels, tree_leaf_idx.shape[1]); for robustness and readability, it would help to make the alignment explicit (e.g., by constructing the long-format Tree/Node/label arrays directly with NumPy reshaping or by adding a brief comment describing the assumed order).
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `_convert_tree_to_points`, you repeatedly filter `self.xgb_scorecard_with_points` by `Tree` and then apply `np.vectorize` inside the loop; consider pre-building a dict of `{tree: {node: points}}` mappings once and doing simple dictionary lookups to avoid repeated filtering and vectorization overhead.
- The new `construct_scorecard` logic relies on the implicit row ordering produced by `pd.DataFrame(tree_leaf_idx).melt(...)` and `np.tile(labels, tree_leaf_idx.shape[1])`; for robustness and readability, it would help to make the alignment explicit (e.g., by constructing the long-format `Tree`/`Node`/`label` arrays directly with NumPy reshaping or by adding a brief comment describing the assumed order).

## Individual Comments

### Comment 1
<location> `xbooster/xgb_constructor.py:555-557` </location>
<code_context>
+                self.xgb_scorecard_with_points["Tree"] == t
+            ]
+            # Mapping dictionary instead of merge
+            mapping_dict = dict(zip(tree_points["Node"], tree_points["Points"]))
+            points_matrix[:, t] = np.vectorize(mapping_dict.get)(leaf_idx_values[:, t])
+
+        result = pd.DataFrame(
</code_context>

<issue_to_address>
**suggestion (performance):** `np.vectorize` on a Python dict lookup may become a bottleneck

Each `np.vectorize(mapping_dict.get)` call does a Python dict lookup per element, which can be a major bottleneck for large `n_samples * n_rounds`. Since `leaf_idx_values[:, t]` is already a NumPy array, consider a more truly vectorized approach, e.g. prebuilding a dense NumPy array indexed by node id (if ids are small ints) or using `pd.Series(leaf_idx_values[:, t]).map(mapping_dict).to_numpy()` to leverage C-level loops instead of Python-level iteration.

```suggestion
            # Mapping dictionary instead of merge
            mapping_dict = dict(zip(tree_points["Node"], tree_points["Points"]))
            # Use pandas vectorized mapping to avoid Python-level loops
            points_matrix[:, t] = (
                pd.Series(leaf_idx_values[:, t])
                .map(mapping_dict)
                .to_numpy()
            )
```
</issue_to_address>

### Comment 2
<location> `tests/test_xgb_regression.py:126-130` </location>
<code_context>
         assert pd.api.types.is_integer_dtype(scorecard["Tree"])
         assert pd.api.types.is_integer_dtype(scorecard["Node"])
-        assert pd.api.types.is_float_dtype(scorecard["Count"])
+        assert pd.api.types.is_integer_dtype(scorecard["Count"])
         assert pd.api.types.is_float_dtype(scorecard["EventRate"])

</code_context>

<issue_to_address>
**suggestion (testing):** Extend the structure test to assert dtypes for all newly-derived columns (`Events`, `NonEvents`, `EventRate`, `Count`).

Since `construct_scorecard` now derives `Events`, `NonEvents`, and `EventRate` together in one grouped aggregation, it would be good to assert their dtypes here as well:

- `Events` and `NonEvents` use integer dtypes
- `EventRate` is float (and optionally constrained to [0, 1])

That way, future changes to the aggregation or column typing will be caught by this test.

```suggestion
        # Verify data types
        assert pd.api.types.is_integer_dtype(scorecard["Tree"])
        assert pd.api.types.is_integer_dtype(scorecard["Node"])

        # Aggregated count-based columns should be integer
        assert pd.api.types.is_integer_dtype(scorecard["Count"])
        assert pd.api.types.is_integer_dtype(scorecard["Events"])
        assert pd.api.types.is_integer_dtype(scorecard["NonEvents"])

        # EventRate is a derived ratio -> float in [0, 1]
        assert pd.api.types.is_float_dtype(scorecard["EventRate"])
        assert (scorecard["EventRate"] >= 0.0).all()
        assert (scorecard["EventRate"] <= 1.0).all()
```
</issue_to_address>

### Comment 3
<location> `tests/test_xgb_regression.py:130` </location>
<code_context>
     def test_construct_scorecard_output_structure(self, sample_data, trained_model):
</code_context>

<issue_to_address>
**suggestion (testing):** Add a test to verify that `Events + NonEvents == Count` across the entire scorecard.

Since the binning table is now built vectorially, this test should also assert the core accounting relationship:

```python
scorecard = trained_model.construct_scorecard()
assert ((scorecard["Events"] + scorecard["NonEvents"]) == scorecard["Count"]).all()
```

This will help detect regressions in the melt/groupby logic and column handling.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

xRiskLab added a commit that referenced this pull request Dec 21, 2025
- Update version to 0.2.8rc1
- Add CHANGELOG entry documenting:
  * SHAP integration features (alpha)
  * Performance improvements from @RektPunk (PRs #10, #11, #13, #14)
- Release candidate includes both SHAP features and performance optimizations
@xRiskLab
Copy link
Owner

Great work on these performance optimizations!

All your changes from PRs #10, #11, #13, and #14 have been merged into the release/v0.2.8rc1 release candidate branch. Your optimizations are now part of the upcoming v0.2.8 release.

What's included:

Impact:

  • Replaced loop-based pd.concat() operations with vectorized melt() + groupby()
  • Replaced pd.merge() in loops with dictionary mapping using .map()
  • Significant performance improvements for models with many trees
  • All changes maintain backward compatibility and numerical equivalence

The RC branch (release/v0.2.8rc1) combines your performance improvements with the new SHAP integration features. Once testing is complete, this will be released as v0.2.8.

Thank you for these excellent contributions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants