-
Notifications
You must be signed in to change notification settings - Fork 668
Aoti cuda export support #14438
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Aoti cuda export support #14438
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/14438
Note: Links to docs will display an error until the docs builds have been completed. ❌ 6 New Failures, 6 Cancelled Jobs, 2 Unrelated FailuresAs of commit 679b0e0 with merge base a548635 ( NEW FAILURES - The following jobs have failed:
CANCELLED JOBS - The following jobs were cancelled. Please retry:
BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
print("STDOUT:") | ||
print(result.stdout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use logging
it provides different levels of logging
5d521f3
to
b00bc14
Compare
|
||
|
||
@final | ||
class CudaPartitioner(Partitioner): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this is basically just skeleton code? We could skip having an initial partitioner implementation entirely I think and just only allow the other to_backend api for now? Is that how it works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should keep it; we need this partitioner to skip et operator decomposition for all ops.
fake_edge_program = copy.deepcopy(edge_program) | ||
partitioner_result = partitioner_instance(fake_edge_program) | ||
tagged_exported_program = partitioner_result.tagged_exported_program | ||
tagged_exported_program.example_inputs = edge_program.example_inputs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we serializing the example inputs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
serializing here you mean exir.save, or to pte?
if os.path.isfile(file): | ||
os.remove(file) | ||
print(f"Removed file: {file}") | ||
except Exception as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these should be fatal exceptions right? The test should fail?
*kernel_metadata.json | ||
*kernel.cpp | ||
*wrapper_metadata.json | ||
*wrapper.cpp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does wrapper .cpp stick around? AOTI compilation doesnt clean it up after generating the .so and .cubin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right now it will generating lots of files, including cubin, wrapper.cpp,etc, and will be auto cleaned up.
|
||
|
||
# exist fallback operators in et namespace; | ||
supported_fallback_kernels: Dict[str, Any] = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@larryliu0820 I feel like I keep hearing conflicting information. Is AOTI falling back to ET or is it a graph break. Graph break sounds more natural in the ET ecosystem to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we will leverage AOTI fallback and don't break graph, for working on missing operators, which is more natural for aoti and convient for us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Graph break sounds more natural in the ET ecosystem to me
Yes I'm thinking we want to have some cuda kernels and let it graph break. We can also reuse that kernel in AOTI fallback.
|
||
output_path = os.path.join(os.getcwd(), "aoti.so") | ||
|
||
options: dict[str, typing.Any] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some documentation on what these options are or where they are defined?
debug_compile
embed_kernel_binary
are the two non obvious to me
"aot_inductor.output_path": output_path, | ||
"aot_inductor.debug_compile": True, | ||
"aot_inductor.force_mmap_weights": False, | ||
"max_autotune": True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are we autotuning? We dont know what gpu we are running on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe @yushangdi can answer this question better. My understanding is it will autotune for the GPU we have during aoti compile. it can get the info automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah that's correct.
with open(so_path, "rb") as f: | ||
so_data = f.read() | ||
|
||
named_data_store.add_named_data("so_blob", so_data, 1, "aoti_cuda_blob") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where do you put the cubin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no cubin is explicitly needed. All stuffs has been in .so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah "aot_inductor.embed_kernel_binary": True,
puts the kernel in .so
7ead0cd
to
679b0e0
Compare
owning_program, submodule, call_module_node, tag, is_submodule | ||
) | ||
|
||
in_spec = pytree.tree_flatten((tuple(subgraph_signature.user_inputs), {}))[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: add a check for examing the input signature of first partition and original edge program
@@ -0,0 +1,116 @@ | |||
# Copyright (c) Meta Platforms, Inc. and affiliates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add some unit tests under backends/cuda/test
submodule_exmaple_inputs = ( | ||
owning_program.example_inputs if is_first_partition else None | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not enough, we need to make sure the signature of the first partition is the same as the signature of the original exported program.
This PR introduces the export support for cuda delegate using aoti library. Also create ci test for verification.