Skip to content

Conversation

Gasoonjia
Copy link
Contributor

This PR introduces the export support for cuda delegate using aoti library. Also create ci test for verification.

Copy link

pytorch-bot bot commented Sep 19, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/14438

Note: Links to docs will display an error until the docs builds have been completed.

❌ 6 New Failures, 6 Cancelled Jobs, 2 Unrelated Failures

As of commit 679b0e0 with merge base a548635 (image):

NEW FAILURES - The following jobs have failed:

CANCELLED JOBS - The following jobs were cancelled. Please retry:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 19, 2025
Comment on lines +40 to +41
print("STDOUT:")
print(result.stdout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use logging it provides different levels of logging



@final
class CudaPartitioner(Partitioner):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this is basically just skeleton code? We could skip having an initial partitioner implementation entirely I think and just only allow the other to_backend api for now? Is that how it works?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should keep it; we need this partitioner to skip et operator decomposition for all ops.

fake_edge_program = copy.deepcopy(edge_program)
partitioner_result = partitioner_instance(fake_edge_program)
tagged_exported_program = partitioner_result.tagged_exported_program
tagged_exported_program.example_inputs = edge_program.example_inputs
Copy link
Contributor

@JacobSzwejbka JacobSzwejbka Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we serializing the example inputs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

serializing here you mean exir.save, or to pte?

if os.path.isfile(file):
os.remove(file)
print(f"Removed file: {file}")
except Exception as e:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these should be fatal exceptions right? The test should fail?

*kernel_metadata.json
*kernel.cpp
*wrapper_metadata.json
*wrapper.cpp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does wrapper .cpp stick around? AOTI compilation doesnt clean it up after generating the .so and .cubin?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right now it will generating lots of files, including cubin, wrapper.cpp,etc, and will be auto cleaned up.



# exist fallback operators in et namespace;
supported_fallback_kernels: Dict[str, Any] = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@larryliu0820 I feel like I keep hearing conflicting information. Is AOTI falling back to ET or is it a graph break. Graph break sounds more natural in the ET ecosystem to me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we will leverage AOTI fallback and don't break graph, for working on missing operators, which is more natural for aoti and convient for us.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graph break sounds more natural in the ET ecosystem to me

Yes I'm thinking we want to have some cuda kernels and let it graph break. We can also reuse that kernel in AOTI fallback.


output_path = os.path.join(os.getcwd(), "aoti.so")

options: dict[str, typing.Any] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some documentation on what these options are or where they are defined?

debug_compile
embed_kernel_binary

are the two non obvious to me

"aot_inductor.output_path": output_path,
"aot_inductor.debug_compile": True,
"aot_inductor.force_mmap_weights": False,
"max_autotune": True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we autotuning? We dont know what gpu we are running on?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe @yushangdi can answer this question better. My understanding is it will autotune for the GPU we have during aoti compile. it can get the info automatically.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that's correct.

with open(so_path, "rb") as f:
so_data = f.read()

named_data_store.add_named_data("so_blob", so_data, 1, "aoti_cuda_blob")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where do you put the cubin?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no cubin is explicitly needed. All stuffs has been in .so

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah "aot_inductor.embed_kernel_binary": True, puts the kernel in .so

owning_program, submodule, call_module_node, tag, is_submodule
)

in_spec = pytree.tree_flatten((tuple(subgraph_signature.user_inputs), {}))[1]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: add a check for examing the input signature of first partition and original edge program

@@ -0,0 +1,116 @@
# Copyright (c) Meta Platforms, Inc. and affiliates.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add some unit tests under backends/cuda/test

Comment on lines +725 to +727
submodule_exmaple_inputs = (
owning_program.example_inputs if is_first_partition else None
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not enough, we need to make sure the signature of the first partition is the same as the signature of the original exported program.

Base automatically changed from install-cuda-pt to main September 20, 2025 06:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants