Skip to content

Conversation

@ggerganov
Copy link
Member

@ggerganov ggerganov commented Jan 2, 2026

target #18547
alt #18549

  • Add GGML_TENSOR_FLAG_COMPUTE flag indicating that a tensor in the graph must be computed
  • Add new ggml_build_forward_select() call:
    GGML_API struct ggml_tensor * ggml_build_forward_select(
            struct ggml_cgraph  * cgraph,
            struct ggml_tensor ** tensors,
            int                   n_tensors,
            int                   idx);

All provided tensors are built forward into the graph. Only tensors[idx] and it's ancestry are marked for computing via the new flag value.

This new logic allows us to construct graphs that compute different things, but at the same time have the same topology. This is needed to avoid unwanted graph reallocations (#17617).

TODOs:

@github-actions github-actions bot added model Model specific Nvidia GPU Issues specific to Nvidia GPUs Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language Apple Metal https://en.wikipedia.org/wiki/Metal_(API) Ascend NPU issues specific to Ascend NPUs OpenCL Issues specific to the OpenCL backend IBM zDNN issues specific to IBM zDNN Accelerator labels Jan 2, 2026
@jeffbolznv
Copy link
Collaborator

Just want to make sure I understand how this is used - it would still be two separate graphs, they'd just be able to reuse allocations (i.e. ggml-alloc would decide they match)?

I think ggml_can_fuse and ggml_can_fuse_subgroup would need to be updated to make sure all nodes are computed. And any backend-specific fusion logic.

@ggerganov
Copy link
Member Author

Just want to make sure I understand how this is used - it would still be two separate graphs, they'd just be able to reuse allocations (i.e. ggml-alloc would decide they match)?

Yes, for example the graph when the input is token ids (batch.token != null) and the graph when the input is directly embedding vectors (batch.embd != null) are still different, but with this extra logic the scheduler will not need to reallocate them because all nodes remain the same. It's just a different subset of the nodes being marked for computing.

I think ggml_can_fuse and ggml_can_fuse_subgroup would need to be updated to make sure all nodes are computed. And any backend-specific fusion logic.

Not yet sure that it's really necessary to do so - at least I can't think of a fail case so far. Note that the GGML_TENSOR_FLAG_COMPUTE flag is controlled only through ggml_build_forward_select().

Copy link
Collaborator

@max-krasnyansky max-krasnyansky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@am17an
Copy link
Collaborator

am17an commented Jan 3, 2026

We would need to check how this behaves with CUDA graphs, since inherently the computation is changing

@taronaeo
Copy link
Collaborator

taronaeo commented Jan 3, 2026

cc: @AlekseiNikiforovIBM @Andreas-Krebbel

Give us a week or so to check on this :)

@ggerganov ggerganov force-pushed the gg/graph-avoid-branches-3 branch from e7b6c35 to da5d289 Compare January 3, 2026 17:49
@ggerganov ggerganov force-pushed the gg/graph-avoid-branches-3 branch from da5d289 to 9922d3a Compare January 4, 2026 14:46
@ggerganov ggerganov force-pushed the gg/graph-avoid-branches-3 branch from 9922d3a to 9f8a79c Compare January 4, 2026 14:56
@am17an
Copy link
Collaborator

am17an commented Jan 5, 2026

For CUDA graphs I think adding a check for flags in ggml_graph_node_has_matching_properties should be enough. This would trigger an update to the graph

@AlekseiNikiforovIBM
Copy link
Contributor

cc: @AlekseiNikiforovIBM @Andreas-Krebbel

Give us a week or so to check on this :)

LGTM

Copy link
Collaborator

@taronaeo taronaeo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack for IBM zDNN backend :)

}

if ((cgraph->nodes[i]->flags & GGML_TENSOR_FLAG_COMPUTE) == 0) {
continue;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the last node or nodes are not flagged, the loop would end without the final command submission. This would need some way to ensure a final submit if submitted_nodes > 0.

Copy link
Collaborator

@reeselevine reeselevine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WebGPU update looks good to me, we always do a final submission if commands > 0 so there shouldn't be a problem like the comment about the Vulkan backend above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) Ascend NPU issues specific to Ascend NPUs ggml changes relating to the ggml tensor library for machine learning IBM zDNN issues specific to IBM zDNN Accelerator model Model specific Nvidia GPU Issues specific to Nvidia GPUs OpenCL Issues specific to the OpenCL backend SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants