Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
35287df
Create nvforest::handle_t
hcho3 Apr 16, 2026
551bbd2
Update doc
hcho3 Apr 16, 2026
d8ad61b
Define nvForest handle in Python layer
hcho3 Apr 16, 2026
f042cc0
Remove pylibraft dependency
hcho3 Apr 16, 2026
97a8976
Add libraft-headers to Conda recipe
hcho3 Apr 16, 2026
ab44b92
Remove unused parameter
hcho3 Apr 16, 2026
b781fda
Revert "Add libraft-headers to Conda recipe"
hcho3 Apr 16, 2026
9000a73
Revert "Remove pylibraft dependency"
hcho3 Apr 17, 2026
cd9bece
Remove custom handle types entirely
hcho3 Apr 17, 2026
fb4355a
Add new interface to auto-instantiate raft resource
hcho3 Apr 17, 2026
f4c6fee
Add a note about re-using the RAFT handle
hcho3 Apr 18, 2026
59cfadc
Add a comment
hcho3 Apr 18, 2026
8efe44e
Improved formatting
hcho3 Apr 18, 2026
564e38c
Create device_resource wrapper in C++
hcho3 Apr 22, 2026
13ddf90
Define DeviceResources in Python pkg
hcho3 Apr 22, 2026
ed7627e
Update doc
hcho3 Apr 22, 2026
c27b29c
Merge remote-tracking branch 'origin/main' into revise_cpp_api
hcho3 May 2, 2026
65b83ce
Revert "Define DeviceResources in Python pkg"
hcho3 May 2, 2026
0c314cb
Revert "Create device_resource wrapper in C++"
hcho3 May 2, 2026
b3fd3be
Cache auto-instantiated RAFT resource
hcho3 May 2, 2026
a1f05d3
Remove the note about RAFT resource
hcho3 May 2, 2026
5fc3d6d
Add sync for auto-instantiated RAFT resource
hcho3 May 2, 2026
57d8845
Merge branch 'release/26.06' of https://github.com/rapidsai/nvforest …
csadorf May 19, 2026
3be33b9
Fix RAFT resource stream conversion
csadorf May 19, 2026
bb6616a
Move RAFT stream adapter into raft_proto
csadorf May 19, 2026
afd5cab
Update C++ README for device resources API
csadorf May 19, 2026
5af8cac
Remove stale Treelite importer handle include
csadorf May 19, 2026
a2111d2
Reject null prediction pointers
csadorf May 19, 2026
6e2cc42
Document device resources migration path
csadorf May 19, 2026
902c2ea
Clarify auto-resource test coverage
csadorf May 19, 2026
779f8c2
Clarify device resources usage docs
csadorf May 19, 2026
397835f
Document RAFT resource migration debt
csadorf May 19, 2026
5f344d0
Deprecate Python Handle alias
csadorf May 19, 2026
121c400
Build nvforest with C++20
csadorf May 19, 2026
10499f2
Reduce PR 102 to auto-instantiation
csadorf May 19, 2026
0b2ac1a
Document auto-instantiated inference path
csadorf May 19, 2026
b8b3ada
Merge remote-tracking branch 'origin/release/26.06' into revise_cpp_api
hcho3 May 21, 2026
b2f1d32
Add a full docstring for the new predict()
hcho3 May 21, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 25 additions & 8 deletions cpp/include/nvforest/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,10 +106,7 @@ cudaMalloc((void**)&output, num_rows * num_outputs * sizeof(float));

// Assuming that input is a float* pointing to data already located on-device

auto handle = raft_proto::handle_t{};

nvforest_model.predict(
handle,
output,
input,
num_rows,
Expand All @@ -119,11 +116,31 @@ nvforest_model.predict(
);
```

**handle**: To provide a unified interface on CPU and GPU, we introduce
`raft_proto::handle_t` as a wrapper for `raft::handle_t`. This is currently just a
placeholder in CPU-only builds, and using it does not require any CUDA
functionality. For GPU-enabled builds, you can construct a
`raft_proto_handle_t` directly from the `raft::handle_t` you wish to use.
This is the primary C++ inference path. nvForest creates the RAFT handle it
needs internally and synchronizes before returning.

Applications that already manage RAFT handles can pass one explicitly:

```cpp
auto raft_handle = raft::handle_t{};
auto handle = raft_proto::handle_t{raft_handle};

nvforest_model.predict(
handle,
output,
input,
num_rows,
raft_proto::device_type::gpu,
raft_proto::device_type::gpu,
4
);
```

**handle**: The explicit-handle overload accepts `raft_proto::handle_t`, a
wrapper for `raft::handle_t`. This is currently just a placeholder in CPU-only
builds, and using it does not require any CUDA functionality. For GPU-enabled
builds, construct a `raft_proto::handle_t` directly from the `raft::handle_t`
you wish to use.

**output**: Pointer to pre-allocated buffer where results should be
written. If the model has been loaded at single precision, this should be a
Expand Down
51 changes: 51 additions & 0 deletions cpp/include/nvforest/forest_model.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -308,6 +308,57 @@ struct forest_model {
predict(handle, out_buffer, in_buffer, predict_type, specified_chunk_size);
}

/**
* Perform inference on given input using an internally managed RAFT handle.
* This function is blocking and synchronizes the handle before returning.
Comment thread
hcho3 marked this conversation as resolved.
*
* @param[out] output Pointer to the memory location where output should end
* up
* @param[in] input Pointer to the input data
* @param[in] num_rows Number of rows in input
* @param[in] out_mem_type The memory type (device/host) of the output
* buffer
* @param[in] in_mem_type The memory type (device/host) of the input buffer
* @param[in] predict_type Type of inference to perform. Defaults to summing
* the outputs of all trees and produce an output per row. If set to
* "per_tree", we will instead output all outputs of individual trees.
* If set to "leaf_id", we will output the integer ID of the leaf node
* for each tree.
* @param[in] specified_chunk_size: Specifies the mini-batch size for
* processing. This has different meanings on CPU and GPU, but on GPU it
* corresponds to the number of rows evaluated per inference iteration
* on a single block. It can take on any power of 2 from 1 to 32, and
* runtime performance is quite sensitive to the value chosen. In general,
* larger batches benefit from higher values, but it is hard to predict the
* optimal value a priori. If omitted, a heuristic will be used to select a
* reasonable value. On CPU, this argument can generally just be omitted.
*/
template <typename io_t>
void predict(io_t* output,
io_t* input,
std::size_t num_rows,
raft_proto::device_type out_mem_type,
raft_proto::device_type in_mem_type,
infer_kind predict_type = infer_kind::default_kind,
std::optional<index_type> specified_chunk_size = std::nullopt)
Comment thread
hcho3 marked this conversation as resolved.
{
#ifdef NVFOREST_ENABLE_GPU
auto raft_handle = raft::handle_t{};

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question on raft::handle_t{} here, with no arguments, the default ctor uses rmm::cuda_stream_per_thread and a null stream pool, so handle.get_stream_pool_size() is 0 and get_usable_stream_count() is 1, right? . That means the chunking/partitioning loop in the buffer overload above (around lines ~200‑248 of this same file) collapse to single stream sequential copies whenever out_mem_type != in_mem_type or either differs from memory_type().

In other words, a C++ user who picks the simple path and passes host input + device output (or vice versa) silently gets less parallelism than the same call via the explicit-handle path with a populated pool. The Python API does cpu/cpu or gpu/gpu, but this is now a primary C++ API too, so could we give it a small default pool here so the convenience overload doesn't quietly perform worse on the heterogeneous memory case? And also point this out in the doc?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that the stream pool size actually has an effect on the partition size and thus is a runtime optimization parameter, we probably need to run some benchmarks before providing a default choice.

auto handle = raft_proto::handle_t{raft_handle};
#else
auto handle = raft_proto::handle_t{};
#endif
predict(handle,
output,
input,
num_rows,
out_mem_type,
in_mem_type,
predict_type,
specified_chunk_size);
handle.synchronize();
}

private:
decision_forest_variant decision_forest_;
};
Expand Down
9 changes: 1 addition & 8 deletions cpp/tests/treelite_importer.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -408,17 +408,10 @@ TEST(TreeliteImporter, DegenerateTreeWithVectorLeaf)
auto nvforest_model = import_from_treelite_model(*tl_model, tree_layout::breadth_first);
ASSERT_TRUE(nvforest_model.has_vector_leaves());

#ifdef NVFOREST_ENABLE_GPU
auto raft_handle = raft::handle_t{};
auto handle = raft_proto::handle_t{raft_handle};
#else
auto handle = raft_proto::handle_t{};
#endif
auto X = std::vector<double>{0.0};
auto preds = std::vector<double>(2, 0.0);
auto expected_preds = std::vector<double>{0.5, 0.5};
nvforest_model.predict(handle,
preds.data(),
nvforest_model.predict(preds.data(),
X.data(),
1,
raft_proto::device_type::cpu,
Expand Down
17 changes: 13 additions & 4 deletions docs/source/getting_started.rst
Original file line number Diff line number Diff line change
Expand Up @@ -193,6 +193,19 @@ Once the tree model is available as a Treelite object, pass it to the

Now that the tree model is fully imported into nvForest, let's run inference:

.. code-block:: cpp

// Assumption:
// * Both output and input are in the GPU memory.
// * The input buffer should be of dimension (num_rows, num_features)
// * The output buffer should be of dimension (num_rows, fm.num_outputs())
fm.predict(output, input, num_rows,
raft_proto::device_type::gpu, raft_proto::device_type::gpu,
nvforest::infer_kind::default_kind);

Applications that want more control over handle ownership, stream reuse, or
synchronization can pass a RAFT handle explicitly like this:

.. code-block:: cpp

#include <raft/core/handle.hpp>
Expand All @@ -201,10 +214,6 @@ Now that the tree model is fully imported into nvForest, let's run inference:
raft::handle_t raft_handle{};
raft_proto::handle_t handle{raft_handle};

// Assumption:
// * Both output and input are in the GPU memory.
// * The input buffer should be of dimension (num_rows, num_features)
// * The output buffer should be of dimension (num_rows, fm.num_outputs())
fm.predict(handle, output, input, num_rows,
raft_proto::device_type::gpu, raft_proto::device_type::gpu,
nvforest::infer_kind::default_kind);
Loading