Skip to content

[ET-VK] 4/n Split dispatches between multiple command buffers. Removing flush function call from submit_current_cmd_and_wait. #12525

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Jul 17, 2025

Conversation

trivedivivek
Copy link
Contributor

@trivedivivek trivedivivek commented Jul 15, 2025

SS-JIA and others added 9 commits July 14, 2025 08:33
## Changes

* Update the `gen_vulkan_spv.py` script to account for changes to included files when decided whether to perform a re-compilation


## Additional Changes

Also applied some misc fixes:

* Allow the Python preprocessor to handle unescaped `\` used to extend macros to multiple lines. Fixed this by replacing these with `\` before running the preprocessor.
* Fixed an issue where build would unexpectedly pass when trying to recompile a failing build multiple time. This would cause the caching mechanism to use a old cached build artifact since no changes were detected. Fixed this by removing any existing cached artifacts on unsuccessful build.

## Context

The `gen_vulkan_spv.py` script which handles GLSL shader codegen and GLSL -> SPIR-V compilation for the ExecuTorch Vulkan backend has a caching mechanism to only recompile modified files for the purpose of developer efficiency. However, this mechanism currently doesn't consider whether included files have been changed. Therefore, if a include file is modified without first modifying the source file which uses it, the changes to the include file will not be captured.

Differential Revision: [D78275585](https://our.internmc.facebook.com/intern/diff/D78275585/)

[ghstack-poisoned]
## Changes

* Introduce `run_prepack()` API which combines the functionality of `encode_prepack()` and `prepack()`, but submits prepacking shaders incrementally rather than all at once.
* Introduce graph config options to control command buffer submission behaviour during prepacking.

Note that the current default values for the prepack submission thresholds were determined through experimentation. I will leave determining optimal values for specific devices as a later exercise. The goal of this diff is simply to introduce this mechanism to fix the Llama model loading crash on Samsung S24 (described below).

## Context

Currently, ET-VK will encode all prepacking shaders, and then perform prepacking by submitting only one command buffer.

However, this approach has some drawbacks:

* CPU/GPU parallelism is decreased, since the command buffer is submitted only after all commands have been encoded.
* There can be performance issues at the Vulkan API level when processing a single "large" command buffer.

By splitting up prepacking to occur over multiple command buffers, performance can be improved by avoiding both the aforementioned issues.

## Llama 3.2 1B crash on Samsung S24

I have also noticed that running large models (i.e. Llama 3.2 1B) on the Samsung S24 with ET-VK, the device's display will crash (causing the screen to go black and become unresponsive), and sometimes the device will shut down entirely.

Fortunately, this change also fixes this behaviour, in addition to providing a significant performance boost to model load time for Llama models (from 9s to 3s).

## Performance Impact

* Improves model load time, especially on larger models.

## Future Work

* Deprecate the `encode_prepack()` + `prepack()` pattern in favor of the `run_prepack()` pattern

Differential Revision: [D78275586](https://our.internmc.facebook.com/intern/diff/D78275586/)

[ghstack-poisoned]
## Changes

* Introduce `run_prepack()` API which combines the functionality of `encode_prepack()` and `prepack()`, but submits prepacking shaders incrementally rather than all at once.
* Introduce graph config options to control command buffer submission behaviour during prepacking.

Note that the current default values for the prepack submission thresholds were determined through experimentation. I will leave determining optimal values for specific devices as a later exercise. The goal of this diff is simply to introduce this mechanism to fix the Llama model loading crash on Samsung S24 (described below).

## Context

Currently, ET-VK will encode all prepacking shaders, and then perform prepacking by submitting only one command buffer.

However, this approach has some drawbacks:

* CPU/GPU parallelism is decreased, since the command buffer is submitted only after all commands have been encoded.
* There can be performance issues at the Vulkan API level when processing a single "large" command buffer.

By splitting up prepacking to occur over multiple command buffers, performance can be improved by avoiding both the aforementioned issues.

## Llama 3.2 1B crash on Samsung S24

I have also noticed that running large models (i.e. Llama 3.2 1B) on the Samsung S24 with ET-VK, the device's display will crash (causing the screen to go black and become unresponsive), and sometimes the device will shut down entirely.

Fortunately, this change also fixes this behaviour, in addition to providing a significant performance boost to model load time for Llama models (from 9s to 3s).

## Performance Impact

* Improves model load time, especially on larger models.

## Future Work

* Deprecate the `encode_prepack()` + `prepack()` pattern in favor of the `run_prepack()` pattern

Differential Revision: [D78275586](https://our.internmc.facebook.com/intern/diff/D78275586/)

[ghstack-poisoned]
## Changes

* Introduce `run_prepack()` API which combines the functionality of `encode_prepack()` and `prepack()`, but submits prepacking shaders incrementally rather than all at once.
* Introduce graph config options to control command buffer submission behaviour during prepacking.

Note that the current default values for the prepack submission thresholds were determined through experimentation. I will leave determining optimal values for specific devices as a later exercise. The goal of this diff is simply to introduce this mechanism to fix the Llama model loading crash on Samsung S24 (described below).

## Context

Currently, ET-VK will encode all prepacking shaders, and then perform prepacking by submitting only one command buffer.

However, this approach has some drawbacks:

* CPU/GPU parallelism is decreased, since the command buffer is submitted only after all commands have been encoded.
* There can be performance issues at the Vulkan API level when processing a single "large" command buffer.

By splitting up prepacking to occur over multiple command buffers, performance can be improved by avoiding both the aforementioned issues.

## Llama 3.2 1B crash on Samsung S24

I have also noticed that running large models (i.e. Llama 3.2 1B) on the Samsung S24 with ET-VK, the device's display will crash (causing the screen to go black and become unresponsive), and sometimes the device will shut down entirely.

Fortunately, this change also fixes this behaviour, in addition to providing a significant performance boost to model load time for Llama models (from 9s to 3s).

## Performance Impact

* Improves model load time, especially on larger models.

## Future Work

* Deprecate the `encode_prepack()` + `prepack()` pattern in favor of the `run_prepack()` pattern

Differential Revision: [D78275586](https://our.internmc.facebook.com/intern/diff/D78275586/)

[ghstack-poisoned]
…maphore support in command buffer.

execute stage
## Context

This following diffs aims to improve the performance of the Executorch Vulkan backend by adding a mechanism to issue multiple command buffers in prepack and execute function, so GPU work is issues while CPU is still working on issuing new work.

## This Diff

### Summary

This diff is the first in a series of diffs that aim to split dispatches between multiple command buffers and add semaphore support in the command buffer.

The changes in this diff include:

*   Adding a `VkSemaphore` parameter to the `CommandBuffer` constructor in `vk_api/Command.cpp` and `vk_api/Command.h` to support signaling when the command buffer has completed execution.
*   Modifying the `CommandBuffer` constructor in `vk_api/Command.h` and `vk_api/Command.cpp` to include the `VkSemaphore` parameter.
*   Updating the `CommandBuffer` object in `api/Context.cpp` to include the `VkSemaphore` parameter.

Differential Revision: [D78282194](https://our.internmc.facebook.com/intern/diff/D78282194/)

[ghstack-poisoned]
…maphore support to Adapter::submit_cmd.

This diff adds semaphore support to `Adapter::submit_cmd` by modifying its function signature to include `wait_semaphore` and `signal_semaphore` parameters. The updated function now takes into account the new semaphore parameters and correctly configures the pipeline stages using `VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT`.

Differential Revision: [D78360041](https://our.internmc.facebook.com/intern/diff/D78360041/)

[ghstack-poisoned]
…ers. Add semaphore support to Adapter::submit_cmd."

This diff adds semaphore support to `Adapter::submit_cmd` by modifying its function signature to include `wait_semaphore` and `signal_semaphore` parameters. The updated function now takes into account the new semaphore parameters and correctly configures the pipeline stages using `VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT`.

Differential Revision: [D78360041](https://our.internmc.facebook.com/intern/diff/D78360041/)

[ghstack-poisoned]
…previous semaphore in context.

This diff is the third part of a series of diffs aiming to split dispatches between multiple command buffers. In this diff, we are tracking the previous semaphore in the context.

A new member variable `prev_semaphore_` was added to the `Context` class. This variable is used to store the semaphore of the previously submitted command buffer.

Differential Revision: [D78360037](https://our.internmc.facebook.com/intern/diff/D78360037/)

[ghstack-poisoned]
…ng flush function call from submit_current_cmd_and_wait.

This diff removes the `flush` function call from `submit_current_cmd_and_wait` and instead adds it to the `encode_prepack` method.

This change will allow better control over flush which is needed in follow up diffs.

Differential Revision: [D78360036](https://our.internmc.facebook.com/intern/diff/D78360036/)

[ghstack-poisoned]
Copy link

pytorch-bot bot commented Jul 15, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/12525

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 1 Unrelated Failure

As of commit fce21fe with merge base 0012ffa (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 15, 2025
trivedivivek added a commit that referenced this pull request Jul 15, 2025
…ng flush function call from submit_current_cmd_and_wait.

This diff removes the `flush` function call from `submit_current_cmd_and_wait` and instead adds it to the `encode_prepack` method.

This change will allow better control over flush which is needed in follow up diffs.

Differential Revision: [D78360036](https://our.internmc.facebook.com/intern/diff/D78360036/)

ghstack-source-id: 296431501
Pull Request resolved: #12525
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D78360036

…ple command buffers. Removing flush function call from submit_current_cmd_and_wait."

This diff removes the `flush` function call from `submit_current_cmd_and_wait` and instead adds it to the `encode_prepack` method.

This change will allow better control over flush which is needed in follow up diffs.

Differential Revision: [D78360036](https://our.internmc.facebook.com/intern/diff/D78360036/)

[ghstack-poisoned]
…ers. Removing flush function call from submit_current_cmd_and_wait."

This diff removes the `flush` function call from `submit_current_cmd_and_wait` and instead adds it to the `encode_prepack` method.

This change will allow better control over flush which is needed in follow up diffs.

Differential Revision: [D78360036](https://our.internmc.facebook.com/intern/diff/D78360036/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D78360036

…ple command buffers. Removing flush function call from submit_current_cmd_and_wait."

This diff removes the `flush` function call from `submit_current_cmd_and_wait` and instead adds it to the `encode_prepack` method.

This change will allow better control over flush which is needed in follow up diffs.

Differential Revision: [D78360036](https://our.internmc.facebook.com/intern/diff/D78360036/)

[ghstack-poisoned]
…ers. Removing flush function call from submit_current_cmd_and_wait."

This diff removes the `flush` function call from `submit_current_cmd_and_wait` and instead adds it to the `encode_prepack` method.

This change will allow better control over flush which is needed in follow up diffs.

Differential Revision: [D78360036](https://our.internmc.facebook.com/intern/diff/D78360036/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D78360036

…ple command buffers. Removing flush function call from submit_current_cmd_and_wait."

This diff removes the `flush` function call from `submit_current_cmd_and_wait` and instead adds it to the `encode_prepack` method.

This change will allow better control over flush which is needed in follow up diffs.

Differential Revision: [D78360036](https://our.internmc.facebook.com/intern/diff/D78360036/)

[ghstack-poisoned]
…ers. Removing flush function call from submit_current_cmd_and_wait."

This diff removes the `flush` function call from `submit_current_cmd_and_wait` and instead adds it to the `encode_prepack` method.

This change will allow better control over flush which is needed in follow up diffs.

Differential Revision: [D78360036](https://our.internmc.facebook.com/intern/diff/D78360036/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D78360036

…ple command buffers. Removing flush function call from submit_current_cmd_and_wait."

This diff removes the `flush` function call from `submit_current_cmd_and_wait` and instead adds it to the `encode_prepack` method.

This change will allow better control over flush which is needed in follow up diffs.

Differential Revision: [D78360036](https://our.internmc.facebook.com/intern/diff/D78360036/)

[ghstack-poisoned]
…ers. Removing flush function call from submit_current_cmd_and_wait."

This diff removes the `flush` function call from `submit_current_cmd_and_wait` and instead adds it to the `encode_prepack` method.

This change will allow better control over flush which is needed in follow up diffs.

Differential Revision: [D78360036](https://our.internmc.facebook.com/intern/diff/D78360036/)

[ghstack-poisoned]
trivedivivek added a commit that referenced this pull request Jul 17, 2025
…ng flush function call from submit_current_cmd_and_wait.

Pull Request resolved: #12525

This diff removes the `flush` function call from `submit_current_cmd_and_wait` and instead adds it to the `encode_prepack` method.

This change will allow better control over flush which is needed in follow up diffs.
ghstack-source-id: 296667692
@exported-using-ghexport

Differential Revision: [D78360036](https://our.internmc.facebook.com/intern/diff/D78360036/)
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D78360036

…ple command buffers. Removing flush function call from submit_current_cmd_and_wait."

This diff removes the `flush` function call from `submit_current_cmd_and_wait` and instead adds it to the `encode_prepack` method.

This change will allow better control over flush which is needed in follow up diffs.

Differential Revision: [D78360036](https://our.internmc.facebook.com/intern/diff/D78360036/)

[ghstack-poisoned]
…ers. Removing flush function call from submit_current_cmd_and_wait."

This diff removes the `flush` function call from `submit_current_cmd_and_wait` and instead adds it to the `encode_prepack` method.

This change will allow better control over flush which is needed in follow up diffs.

Differential Revision: [D78360036](https://our.internmc.facebook.com/intern/diff/D78360036/)

[ghstack-poisoned]
trivedivivek added a commit that referenced this pull request Jul 17, 2025
…ng flush function call from submit_current_cmd_and_wait.

Pull Request resolved: #12525

This diff removes the `flush` function call from `submit_current_cmd_and_wait` and instead adds it to the `encode_prepack` method.

This change will allow better control over flush which is needed in follow up diffs.
ghstack-source-id: 296712817
@exported-using-ghexport

Differential Revision: [D78360036](https://our.internmc.facebook.com/intern/diff/D78360036/)
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D78360036

SS-JIA added a commit that referenced this pull request Jul 17, 2025
…maphore support in command buffer. (#12519)

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at
bottom):
* #12525
* #12523
* #12522
* __->__ #12519

execute stage
## Context

This following diffs aims to improve the performance of the Executorch
Vulkan backend by adding a mechanism to issue multiple command buffers
in prepack and execute function, so GPU work is issues while CPU is
still working on issuing new work.

## This Diff

### Summary

This diff is the first in a series of diffs that aim to split dispatches
between multiple command buffers and add semaphore support in the
command buffer.

The changes in this diff include:

* Adding a `VkSemaphore` parameter to the `CommandBuffer` constructor in
`vk_api/Command.cpp` and `vk_api/Command.h` to support signaling when
the command buffer has completed execution.
* Modifying the `CommandBuffer` constructor in `vk_api/Command.h` and
`vk_api/Command.cpp` to include the `VkSemaphore` parameter.
* Updating the `CommandBuffer` object in `api/Context.cpp` to include
the `VkSemaphore` parameter.

Differential Revision:
[D78282194](https://our.internmc.facebook.com/intern/diff/D78282194/)

---------

Co-authored-by: Stephen Jia <[email protected]>
SS-JIA added a commit that referenced this pull request Jul 17, 2025
…maphore support to Adapter::submit_cmd. (#12522)

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at
bottom):
* #12525
* #12523
* __->__ #12522
* #12519

This diff adds semaphore support to `Adapter::submit_cmd` by modifying
its function signature to include `wait_semaphore` and
`signal_semaphore` parameters. The updated function now takes into
account the new semaphore parameters and correctly configures the
pipeline stages using `VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT`.

Differential Revision:
[D78360041](https://our.internmc.facebook.com/intern/diff/D78360041/)

---------

Co-authored-by: Stephen Jia <[email protected]>
SS-JIA added a commit that referenced this pull request Jul 17, 2025
…previous semaphore in context. (#12523)

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at
bottom):
* #12525
* __->__ #12523
* #12522
* #12519

This diff is the third part of a series of diffs aiming to split
dispatches between multiple command buffers. In this diff, we are
tracking the previous semaphore in the context.

A new member variable `prev_semaphore_` was added to the `Context`
class. This variable is used to store the semaphore of the previously
submitted command buffer.

Differential Revision:
[D78360037](https://our.internmc.facebook.com/intern/diff/D78360037/)

---------

Co-authored-by: Stephen Jia <[email protected]>
@SS-JIA SS-JIA changed the base branch from gh/trivedivivek/124/base to main July 17, 2025 05:36
@SS-JIA SS-JIA merged commit 0e433e8 into main Jul 17, 2025
99 of 103 checks passed
@SS-JIA SS-JIA deleted the gh/trivedivivek/124/head branch July 17, 2025 05:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported release notes: vulkan Changes to the Vulkan backend delegate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants