RFC-0037-Interoperability-Standard-of-3rd-Backend-Integration-Mechanism #64

FFFrog · 2024-04-02T01:45:29Z

Proposal to add interoperability standard for 3rd backend based PrivateUse1 Mechanisam into PyTorch.

Rendered version: https://github.com/FFFrog/rfcs/blob/rfc-for-privateuse1/RFC-0037-Interoperability-Standard-of-3rd-Backend-Integration-Mechanism.md

albanD · 2024-05-06T21:45:57Z

RFC-0037-Interoperability-Standard-of-3rd-Backend-Integration-Mechanism.md

+
+
+## **Proposed Implementation**
+![Architecture Outline](./RFC-0030-assets/3rd_backend_architecture.png)


I love this diagram!
I think we want to do a couple clarifications here:

For the middle case of XPU, showcase what still needs to be in core (allocator, stream, event, etc)

Similarly for the out of tree case, would be curious to showcase what is in core and is providing the extension point being used.

I would separate this PrivateUse1 device in core (that will look similar to XPU) and make it point to both out of core projects and point to the demo project in core (that is independent from the core integration).

This is my mistake, I didn't notice that XPU has upstreamed some modules into pytorch, please refer to the latest diagram below.

The out of tree case is similar to that of IPU, only some necessary interfaces, logical branch codes and so on are integrated into the core. However, PrivateUse1 also has its own special features:

Completeness: Support all torch functions as much as possible

Universality: Provide flexibility to third-party devices as much as possible to shield differences between devices

As you commented below, it's probably best to keep the demo project as a PyTorch-organized project rather than keeping it in the tree.

albanD · 2024-05-06T21:50:59Z

RFC-0037-Interoperability-Standard-of-3rd-Backend-Integration-Mechanism.md

+* **Usage**: Reserved for CI/CD and official standard reference implementation, not intended for user use.
+* **Compilation**: Separate compilation entry point, separate from other backends.
+* **Independence**: The related code is stored separately, without coupling with other functional modules of PyTorch.


I feel like a lot of this can be achieved by making this demo backend another repo on the pytorch/* org that we take as a submodule for testing only and we build. This way, it is:

Fully independent codebase from core, just like the real out-of-tree backends

A real end-to-end example of how to make a new backend

Can be fully tested in our CI and be pinned as needed

Good idea.
Compared to in-tree, this is indeed a better way, thank you.

In addition, compared with in-tree, out-of-tree will be more troublesome when merging some PRs that cause the 3rd device to be unavailable, because developers need to coordinate between the two repo by pin commitid or other methods.

Yes there are pros and cons both ways. Looking into what would be the easiest to setup. Will get back to you shortly.

RFC-0037-Interoperability-Standard-of-3rd-Backend-Integration-Mechanism.md

albanD · 2024-05-06T21:58:56Z

RFC-0037-Interoperability-Standard-of-3rd-Backend-Integration-Mechanism.md

+| Refactoring | Move new trace utils from source to header, which leads to some symbols can’t be found. | 1 | [#114367](https://github.com/pytorch/pytorch/pull/114367/files) |
+| Refactoring | Migrate to getCvar* functions for env variable checking, which leads to function name can’t be found. | 1 | [#113797](https://github.com/pytorch/pytorch/pull/113797) |
+| New Features | Add support for new data types, data type assert fails. | 2 | [#107586](https://github.com/pytorch/pytorch/pull/107586), [#116594](https://github.com/pytorch/pytorch/pull/116594) |
+| New Features | Add function to materialize COW storages, which add a pure virtual function Allocator::copy_data, derived class didn’t implement this pure virtual function. | 2 | [#117053](https://github.com/pytorch/pytorch/pull/117053), [#113396](https://github.com/pytorch/pytorch/pull/113396) |


Given that these things are going to continue happening, I would be curious what we expect the workflow to be when such a change is needed.

The options that come to mind here would be:

specific channel where such change are tracked such that extension writers can subscribe and update their extensions accordingly. I guess this would mean that the extension is pinned to some version of PT and they move forward in lockstep.

we define the extension points implemented out of core in such a way that we can preserve BC there even when changing core. Might be tricky to define such API and would restrict what extension writers can do.

In both cases, I think we might need to continue ensuring that it is ok for extensions not to implement all the features that can be extended. Either by having some generic feature flag to say which feature each extension supports, or having a good default implementation that works, ?

In my opinion, the first approach seems more appropriate and aligns closely with my description in this RFC.

We can provide a way for third-party devices to easily obtain those modifications that have affected third-party devices (including original PR and adaptation methods). This way, when the corresponding PyTorch version switches from Version A to Version B, it will be easy to see which parts need to be re-adapted for third-party devices. The advantages will be more obvious when there are many third-party devices.

This approach has minimal impact and is relatively easy for third-party devices to accept. It also does not impose significant restrictions, obstacles, or additional workload on community developers.

Regarding the mentioned approach, there are two scenarios:

In Tree:
If the related test cases for the Demo as part of CI fail, the developer needs to modify the corresponding implementation of the Demo synchronously. During the final code merge, the modifications to the Demo files will be checked (distinguishing normal modifications to Demo files). If there are modifications, a special tag will be added to the PR.

Out of Tree:
the reviewer can determine whether to add a special tag based on the real situation and mark the corresponding PR in the PyTorch repository on the PR.

FFFrog · 2024-05-09T02:51:33Z

Thank you a lot for @albanD review.

Based on your comments, I have prepared a new diagram to illustrate the architecture outline, please take a look at it.

jgong5 · 2024-05-14T21:48:35Z

Based on your comments, I have prepared a new diagram to illustrate the architecture outline, please take a look at it.

Thanks for the updated arch diagram. Some comments on the XPU. I feel listing XPU as "partial out-tree" and list it similarly with other out-of-tree devices might look a bit confusing:

XPU is expected to be fully functional within PyTorch core in the short term, just as CUDA. Even though, some of the ATen ops are supported via an out-of-the-tree repo but it is added as a third-party repo of PyTorch core and build together. This is different from other third-party devices which are maintained out of the tree.
Maintaining an out-of-the-tree ATen repo is an interim approach to facilitate the XPU upstreaming. We would target them in-tree in longer term. So eventually, it would be "all in-tree".

FFFrog · 2024-05-17T09:17:45Z

Based on your comments, I have prepared a new diagram to illustrate the architecture outline, please take a look at it.

Thanks for the updated arch diagram. Some comments on the XPU. I feel listing XPU as "partial out-tree" and list it similarly with other out-of-tree devices might look a bit confusing:

XPU is expected to be fully functional within PyTorch core in the short term, just as CUDA. Even though, some of the ATen ops are supported via an out-of-the-tree repo but it is added as a third-party repo of PyTorch core and build together. This is different from other third-party devices which are maintained out of the tree.

Maintaining an out-of-the-tree ATen repo is an interim approach to facilitate the XPU upstreaming. We would target them in-tree in longer term. So eventually, it would be "all in-tree".

So sorry for the introduced confusion.
I have updated arch diagram again, please have a look at it, thank you.

jgong5 · 2024-05-20T01:33:44Z

So sorry for the introduced confusion.
I have updated arch diagram again, please have a look at it, thank you.

The updated diagram looks good to me. Thanks for taking the time!

artyom-beilis · 2024-07-02T19:03:32Z

Can you please setup some mailing list or update policy for other out-of-tree backend developers.

I'm the author of OpenCL backend https://github.com/artyom-beilis/pytorch_dlprim and now I'm catching up with changes in 2.3 and 2.4 - since I do it only part time it would be better to have some notices and updates in advance.

FFFrog · 2024-07-12T15:47:24Z

Can you please setup some mailing list or update policy for other out-of-tree backend developers.

I'm the author of OpenCL backend https://github.com/artyom-beilis/pytorch_dlprim and now I'm catching up with changes in 2.3 and 2.4 - since I do it only part time it would be better to have some notices and updates in advance.

Sorry for the late reply, I'm on vacation recently.

My colleagues and I have started development work, and the initial version will support Runtime.

Recently we will try to communicate with the community, with the goal of creating a project under the PyTorch organization, and then we will push our initial version to the project as soon as possible.

If you are interested, you are more than welcome to participate in this work.

artyom-beilis · 2024-07-12T18:47:13Z

Recently we will try to communicate with the community, with the goal of creating a project under the PyTorch organization, and then we will push our initial version to the project as soon as possible.

If you are interested, you are more than welcome to participate in this work.

Which project? In general anything that would simplify maintaining out-of-tree backend is welcome :-) Because I work on it in my spare time and sometimes I just can't keep up with all changes

albanD · 2024-07-17T21:51:42Z

Thanks for the update! It sounds great.

Can you please setup some mailing list or update policy for other out-of-tree backend developers.

I think a mailing list might be a bit challenging but looking at the change history of the demo module should be able to give an idea of what was added/changed recently.

In general anything that would simplify maintaining out-of-tree backend is welcome :-) Because I work on it in my spare time and sometimes I just can't keep up with all changes

There is quite a bit of churn, and I expect there will still be for a few more months as we fully stabilize the new improved API (you might want to wait a bit to upgrade if you don't have much time).
I do expect that it will quickly pay off though as, having a shared interface and extension point will allow us to improve both ease of use (because we designed this API for that exact purpose) and stability (because we have multiple users that will catch accidental regressions).

FFFrog · 2024-07-19T15:44:56Z

@artyom-beilis, if there are no other special circumstances, we will open source our project providing PyTorch Third-party Reference Demo by using PrivateUse1 mechinasm in the next week or so.

This is what we want to do with the project, I just drew a simple diagram, more detailed information must be found in the CODE.

FFFrog · 2024-07-19T16:10:12Z

@albanD, I drew a simple diagram of the overall project structure and what we want to do.

I want to explain something about the diagram.

The xpu in the picture is different from Intel's xpu. It is just a name for a general device.
The design of many manufacturers' APIs will more or less draw on CUDA, so using CUDA as the standard can maximize compatibility with various third-party devices.
If the device has its own dedicated API, then the module with a blue background in the picture may need to be changed; if the device API is similar to CUDA, in theory only a few changes are needed.

FFFrog · 2024-08-06T02:01:15Z

Hi, @albanD @artyom-beilis , sorry for the late feedback.

At present, we have implemented the first version of Demo according to the community's latest third-party device integration mechanism. The main framework has been completed, including basic general Runtime capabilities, operator registration, autocast, etc.

Of course, there are still many general details that have not been completed, such as:

More General: Except for the npu directory in the root directory (which is a collection of specific backend functions and can be replaced with other backends), remove all npu-related representations, such as torch_npu renamed to torch_backend, csrc/npu renamed to csrc/backend, etc.
Codegen: redesigned to facilitate out-of-the-box use of new other backends
Backend custom API: provide backend custom API integration capabilities
Documentation: end-to-end documentation
Test cases sets: general test case collection, etc.

We will work hard to improve the above features and other details. After all are ready, we will try to integrate CUDA to PyTorch through this Demo and provide a full-process integration tutorial.

If you have any questions or suggestions, please let me know. Thank you.

artyom-beilis · 2024-08-06T05:20:16Z

Hi @FFFrog

Looking at NPU's readme

This project provides a foundational layer that abstracts and standardizes the interaction between PyTorch and different types of GPU hardware.

Does it mean I'm expected to implement out of tree backend in terms of NPU, i.e. as sort of NPU extension? Should I use it or can I continue working on existing out of tree implementation?

Finally there is like 3 GPU OOT implementations (I'm aware of) around Intel's XPU, Apple Metal and my dlprimitives/OpenCL

FFFrog · 2024-08-07T02:49:39Z

This project provides a foundational layer that abstracts and standardizes the interaction between PyTorch and different types of GPU hardware.

Does it mean I'm expected to implement out of tree backend in terms of NPU, i.e. as sort of NPU extension? Should I use it or can I continue working on existing out of tree implementation?

First of all, thank you very much for your comments.

Challenges of integrating the new backend into PyTorch through the third-party device integration mechanism based on PrivateUse1:

High development threshold: The third-party device integration mechanism is mainly implemented by various scattered and irregular HOOKs and registration mechanisms, lacking a unified view
Poor reusability: There are many common features between the backends integrated based on the third-party device integration mechanism, such as codegen (automatically implement operator registration, custom operator routing, forward and backward binding, etc.), PyTorch common API, common memory pool strategy, common test case set, etc.

However, due to the possible differences between various third-party backends, in order to ensure universality as much as possible, our current strategy considers CUDA as the standard, and all other backend APIs need to align themselves with the CUDA API (CUDA currently dominates the field of artificial intelligence, and the CUDA API is also well known in the industry)

For this DEMO project, we plan to divide it into two phases:

Phase 1: This is the phase we are in now, which requires COPY&Modify. This project is mainly used as a reference implementation.
Phase 2: This is what we will do next, completing the device abstraction layer, and the third-party backend will serve as the plug-in for this demo (it is worth adding that for the PyTorch general API, ideally, the third-party backend only needs to implement the backend API corresponding to the CUDA API, but for the case of custom APIs, the backend currently needs to complete the end-to-end integration of PyTorch by itself)

Back to your question, if your time permits, it is recommended that you wait until our device abstraction layer is completed before integrating PyTorch

FFFrog · 2024-08-07T02:58:59Z

@artyom-beilis

Finally there is like 3 GPU OOT implementations (I'm aware of) around Intel's XPU, Apple Metal and my dlprimitives/OpenCL

As far as I know:

Intel XPU: Currently in a semi-built-in state, will be fully in-tree later, its dispatchKey is xpu(dedicated key)
Apple Matel: This should be an in-tree backend, its dispatchKey is mps(dedicated key)
dlprimitives/OpenCL: out-of-tree
Intel HPU: This is currently out-of-tree, its dispatchKey is hpu(dedicated key), but the OOT repo is not opensource
Meta MTIA: out-of-tree, its dispatchKey is mtia(dedicated key)
Huawei NPU: out-of-tree, its dispatchKey is PrivateUse1(public key)

artyom-beilis · 2024-08-07T05:35:26Z

High development threshold: The third-party device integration mechanism is mainly implemented by various scattered and irregular HOOKs and registration mechanisms, lacking a unified view

From my point of view it was mostly implementing operators - while biggest problem was to understand which are required, basic and what conditions are required - sometimes lack of documentation (for example what is the difference between _copy_from and _copy_from_and_resize ???)

Poor reusability: There are many common features between the backends integrated based on the third-party device integration mechanism, such as codegen (automatically implement operator registration, custom operator routing, forward and backward binding, etc.),

There are two things - operators that can be implemented in terms of others - it would be nice to have some kind of operator tree that would show native operators and ones implemented in terms of others.

Regarding code-gen - do you mean automatic kernel code generation or building operators in terms of other operators?

PyTorch common API, common memory pool strategy,

Pool was probably the trickiest part to implement and still it is sub-optimal in terms of memory allocation. But still there are many interesting points to consider that aren't similar to CUDA: in OpenCL for example you can't use pointer arithmetics on host as in CUDA - you need to add an offset or use sub-buffers, some Integrated GPU devices share memory with CPU (Intel , AMD APUs, ARM)

common test case set, etc.

This would be awesome

Back to your question, if your time permits, it is recommended that you wait until our device abstraction layer is completed before integrating PyTorch

I understand but it is problematic. Since if I wait for an API to be finalised I'll wait forever ;-).

What is expected to change? The most critical part and most of the work are the operators implemented.

FFFrog · 2024-08-08T12:01:58Z

@artyom-beilis Hi, I will get back to you tomorrow, sorry for the inconvenience as I'm a bit busy lately.

FFFrog · 2024-08-09T13:42:54Z

From my point of view it was mostly implementing operators - while biggest problem was to understand which are required, basic and what conditions are required - sometimes lack of documentation (for example what is the difference between _copy_from and _copy_from_and_resize ???)

Yes, there are many operators in Pytorch and some of them are very similar to each other; we can simply divide all operators into two parts;

Factory operators: all operators related to tensor creation, conversion, etc.
Computational operators: all operators that deal with tensors

We will provide reference implementation and documents about all factory operators but not the latter because the latter is easy to understand

There are two things - operators that can be implemented in terms of others - it would be nice to have some kind of operator tree that would show native operators and ones implemented in terms of others.

Yes, PyTorch has many operators that are composed of other basic operators, we can provide a tree list as you described above, but there are some issues with timeliness because the relationship between operators will be updated
An approach maybe can give you a hand:

Compile Pytorch With DEBUG
export TORCH_SHOW_DISPATCH_TRACE=1
python -c "import torch; torch.rand(3,3)"

then, you will get the backstrace on operators like below, and will know which operator need to be implemented:

 [call] op=[aten::rand], key=[BackendSelect]
  [redispatch] op=[aten::rand], key=[CPU]
   [call] op=[aten::empty.memory_format], key=[BackendSelect]
    [redispatch] op=[aten::empty.memory_format], key=[CPU]
   [call] op=[aten::uniform_], key=[CPU]

FFFrog · 2024-08-09T13:56:05Z

Regarding code-gen - do you mean automatic kernel code generation or building operators in terms of other operators?

As for codegen, it will generate a lot of codes according to our requirements, including but not limited to forward operator registration, backward operator registration, custom operator routing files, etc.

All you need to do is implemant operators related to specific backend and also need to provide a yaml, in which providing opeartor info, the codegen will automaticly generate all codes you need.

What I would like to add is that codegen is under development and the design of yaml has not yet been determined, you can take PyTorch yaml as an reference now.

FFFrog · 2024-08-09T14:04:31Z

Pool was probably the trickiest part to implement and still it is sub-optimal in terms of memory allocation. But still there are many interesting points to consider that aren't similar to CUDA: in OpenCL for example you can't use pointer arithmetics on host as in CUDA - you need to add an offset or use sub-buffers, some Integrated GPU devices share memory with CPU (Intel , AMD APUs, ARM)

I absolutely agree with you.
We will plan to provide several basic memory pool strategies such as CUDA Memory Pool、OpenCL Memory Pool and etc. The new backend can use the most appropriate strategy to implement the new allocator according to the characteristics of the new backend. Of course, the new backend can also implement its own memory pool from scratch.

FFFrog · 2024-08-12T14:27:14Z

@albanD, sorry to bother you again.
It seems that you are working on accelerator diversity in PyTorch, and perhaps this project could help you accelerate this goal, can you scan the project quickly and give us some advices ?

artyom-beilis · 2024-08-12T19:58:56Z

We will provide reference implementation and documents about all factory operators but not the latter because the latter is easy to understand

That would be fantastic... Because sometimes it just makes me wonder what this operator is doing and under what conditions.

All you need to do is implemant operators related to specific backend and also need to provide a yaml,

This is first time I hear of the yaml...

We will plan to provide several basic memory pool strategies such as CUDA Memory Pool、OpenCL Memory Pool

Yes, this would be nice, because currently pytorch opencl/dlprimitives backend suffers of much more extensive memory use that it should.

perhaps this project could help you accelerate this goal, can you scan the project quickly and give us some advices ?

Probably what I most needed is some kind of location were you can actually ask some questions for some stuff that isn't easy to understand so far @albanD did an amazing job helping with OpenCL backend. Currently I mostly ask questions on dev-discuss but sometimes I feel that there a lots of stuff (currently stuck on torch.load... without moving model to CPU and back to device)

Thanks a lot!

FFFrog · 2024-08-13T01:57:22Z

Our extreme goal is that new backend can be integrated into PyTorch smoothly by implementing some APIs and structures related with specific backend without considering any detailed stuff related to PyTorch，such as cpu fallback(@albanD have done it for dlprimitive),backend renaming and etc.

Of course, it is not always convenient for all backends to integrate into PyTorch by using this project， if the new backend doesn't involve many PyTorch features, doing it from scratch maybe also a good choice.

By the way, if you have any questions about new backend integration, you can also mention me or file a issue in the project, i am very glad to share with everyone.

Base implementation aiming towards pytorch/rfcs#64 Details of the implementation and next steps in https://github.com/pytorch/pytorch/blob/gh/albanD/3/head/test/cpp_extensions/open_registration_extension/README.md Pull Request resolved: #131814 Approved by: https://github.com/ezyang

Base implementation aiming towards pytorch/rfcs#64 Details of the implementation and next steps in https://github.com/pytorch/pytorch/blob/gh/albanD/3/head/test/cpp_extensions/open_registration_extension/README.md Pull Request resolved: pytorch#131814 Approved by: https://github.com/ezyang

add new RFC on third-party device integration mechanism

9bd181e

facebook-github-bot added the cla signed label Apr 2, 2024

albanD reviewed May 6, 2024

View reviewed changes

FFFrog requested a review from albanD May 11, 2024 01:53

albanD mentioned this pull request May 30, 2024

enable device index check for all device types pytorch/pytorch#126767

Closed

albanD mentioned this pull request Jul 27, 2024

Add device daemon pytorch/pytorch#131814

Closed

FFFrog mentioned this pull request Aug 9, 2024

import torch_backend warning cosdt/torch_backend#37

Open



		## Proposed Implementation
		![Architecture Outline](./RFC-0030-assets/3rd_backend_architecture.png)

RFC-0037-Interoperability-Standard-of-3rd-Backend-Integration-Mechanism #64

Are you sure you want to change the base?

RFC-0037-Interoperability-Standard-of-3rd-Backend-Integration-Mechanism #64

Uh oh!

Conversation

FFFrog commented Apr 2, 2024 • edited by albanD Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

albanD May 6, 2024

Choose a reason for hiding this comment

Uh oh!

FFFrog May 9, 2024

Choose a reason for hiding this comment

Uh oh!

albanD May 6, 2024

Choose a reason for hiding this comment

Uh oh!

FFFrog May 9, 2024

Choose a reason for hiding this comment

Uh oh!

FFFrog May 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

albanD Jul 17, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

albanD May 6, 2024

Choose a reason for hiding this comment

Uh oh!

FFFrog May 9, 2024

Choose a reason for hiding this comment

Uh oh!

FFFrog commented May 9, 2024

Uh oh!

jgong5 commented May 14, 2024

Uh oh!

FFFrog commented May 17, 2024

Uh oh!

jgong5 commented May 20, 2024

Uh oh!

artyom-beilis commented Jul 2, 2024

Uh oh!

FFFrog commented Jul 12, 2024

Uh oh!

artyom-beilis commented Jul 12, 2024

Uh oh!

albanD commented Jul 17, 2024

Uh oh!

FFFrog commented Jul 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FFFrog commented Jul 19, 2024

Uh oh!

FFFrog commented Aug 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

artyom-beilis commented Aug 6, 2024

Uh oh!

FFFrog commented Aug 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FFFrog commented Aug 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

artyom-beilis commented Aug 7, 2024

Uh oh!

FFFrog commented Aug 8, 2024

Uh oh!

FFFrog commented Aug 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FFFrog commented Aug 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FFFrog commented Aug 9, 2024

Uh oh!

FFFrog commented Aug 12, 2024

Uh oh!

FFFrog commented Apr 2, 2024 •

edited by albanD

Loading

FFFrog May 9, 2024 •

edited

Loading

FFFrog commented Jul 19, 2024 •

edited

Loading

FFFrog commented Aug 6, 2024 •

edited

Loading

FFFrog commented Aug 7, 2024 •

edited

Loading

FFFrog commented Aug 7, 2024 •

edited

Loading

FFFrog commented Aug 9, 2024 •

edited

Loading

FFFrog commented Aug 9, 2024 •

edited

Loading