-
Notifications
You must be signed in to change notification settings - Fork 73
RFC-0037-Interoperability-Standard-of-3rd-Backend-Integration-Mechanism #64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
||
|
||
## **Proposed Implementation** | ||
 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love this diagram!
I think we want to do a couple clarifications here:
- For the middle case of XPU, showcase what still needs to be in core (allocator, stream, event, etc)
- Similarly for the out of tree case, would be curious to showcase what is in core and is providing the extension point being used.
- I would separate this PrivateUse1 device in core (that will look similar to XPU) and make it point to both out of core projects and point to the demo project in core (that is independent from the core integration).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- This is my mistake, I didn't notice that XPU has upstreamed some modules into pytorch, please refer to the latest diagram below.
- The out of tree case is similar to that of IPU, only some necessary interfaces, logical branch codes and so on are integrated into the core. However, PrivateUse1 also has its own special features:
- Completeness: Support all torch functions as much as possible
- Universality: Provide flexibility to third-party devices as much as possible to shield differences between devices
- As you commented below, it's probably best to keep the demo project as a PyTorch-organized project rather than keeping it in the tree.
* **Usage**: Reserved for CI/CD and official standard reference implementation, not intended for user use. | ||
* **Compilation**: Separate compilation entry point, separate from other backends. | ||
* **Independence**: The related code is stored separately, without coupling with other functional modules of PyTorch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like a lot of this can be achieved by making this demo backend another repo on the pytorch/* org that we take as a submodule for testing only and we build. This way, it is:
- Fully independent codebase from core, just like the real out-of-tree backends
- A real end-to-end example of how to make a new backend
- Can be fully tested in our CI and be pinned as needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea.
Compared to in-tree, this is indeed a better way, thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition, compared with in-tree, out-of-tree will be more troublesome when merging some PRs that cause the 3rd device to be unavailable, because developers need to coordinate between the two repo by pin commitid or other methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes there are pros and cons both ways. Looking into what would be the easiest to setup. Will get back to you shortly.
RFC-0037-Interoperability-Standard-of-3rd-Backend-Integration-Mechanism.md
Show resolved
Hide resolved
| Refactoring | Move new trace utils from source to header, which leads to some symbols can’t be found. | 1 | [#114367](https://github.com/pytorch/pytorch/pull/114367/files) | | ||
| Refactoring | Migrate to getCvar* functions for env variable checking, which leads to function name can’t be found. | 1 | [#113797](https://github.com/pytorch/pytorch/pull/113797) | | ||
| New Features | Add support for new data types, data type assert fails. | 2 | [#107586](https://github.com/pytorch/pytorch/pull/107586), [#116594](https://github.com/pytorch/pytorch/pull/116594) | | ||
| New Features | Add function to materialize COW storages, which add a pure virtual function Allocator::copy_data, derived class didn’t implement this pure virtual function. | 2 | [#117053](https://github.com/pytorch/pytorch/pull/117053), [#113396](https://github.com/pytorch/pytorch/pull/113396) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that these things are going to continue happening, I would be curious what we expect the workflow to be when such a change is needed.
The options that come to mind here would be:
- specific channel where such change are tracked such that extension writers can subscribe and update their extensions accordingly. I guess this would mean that the extension is pinned to some version of PT and they move forward in lockstep.
- we define the extension points implemented out of core in such a way that we can preserve BC there even when changing core. Might be tricky to define such API and would restrict what extension writers can do.
In both cases, I think we might need to continue ensuring that it is ok for extensions not to implement all the features that can be extended. Either by having some generic feature flag to say which feature each extension supports, or having a good default implementation that works, ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, the first approach seems more appropriate and aligns closely with my description in this RFC.
We can provide a way for third-party devices to easily obtain those modifications that have affected third-party devices (including original PR and adaptation methods). This way, when the corresponding PyTorch version switches from Version A to Version B, it will be easy to see which parts need to be re-adapted for third-party devices. The advantages will be more obvious when there are many third-party devices.
This approach has minimal impact and is relatively easy for third-party devices to accept. It also does not impose significant restrictions, obstacles, or additional workload on community developers.
Regarding the mentioned approach, there are two scenarios:
- In Tree:
If the related test cases for the Demo as part of CI fail, the developer needs to modify the corresponding implementation of the Demo synchronously. During the final code merge, the modifications to the Demo files will be checked (distinguishing normal modifications to Demo files). If there are modifications, a special tag will be added to the PR. - Out of Tree:
the reviewer can determine whether to add a special tag based on the real situation and mark the corresponding PR in the PyTorch repository on the PR.
Thank you a lot for @albanD review. Based on your comments, I have prepared a new diagram to illustrate the architecture outline, please take a look at it. |
Thanks for the updated arch diagram. Some comments on the XPU. I feel listing XPU as "partial out-tree" and list it similarly with other out-of-tree devices might look a bit confusing:
|
The updated diagram looks good to me. Thanks for taking the time! |
Can you please setup some mailing list or update policy for other out-of-tree backend developers. I'm the author of OpenCL backend https://github.com/artyom-beilis/pytorch_dlprim and now I'm catching up with changes in 2.3 and 2.4 - since I do it only part time it would be better to have some notices and updates in advance. |
Sorry for the late reply, I'm on vacation recently. My colleagues and I have started development work, and the initial version will support Runtime. Recently we will try to communicate with the community, with the goal of creating a project under the PyTorch organization, and then we will push our initial version to the project as soon as possible. If you are interested, you are more than welcome to participate in this work. |
Which project? In general anything that would simplify maintaining out-of-tree backend is welcome |
Thanks for the update! It sounds great.
I think a mailing list might be a bit challenging but looking at the change history of the demo module should be able to give an idea of what was added/changed recently.
There is quite a bit of churn, and I expect there will still be for a few more months as we fully stabilize the new improved API (you might want to wait a bit to upgrade if you don't have much time). |
@artyom-beilis, if there are no other special circumstances, we will open source our project providing PyTorch Third-party Reference Demo by using PrivateUse1 mechinasm in the next week or so. This is what we want to do with the project, I just drew a simple diagram, more detailed information must be found in the CODE. |
@albanD, I drew a simple diagram of the overall project structure and what we want to do. I want to explain something about the diagram.
|
Hi, @albanD @artyom-beilis , sorry for the late feedback. At present, we have implemented the first version of Demo according to the community's latest third-party device integration mechanism. The main framework has been completed, including basic general Runtime capabilities, operator registration, autocast, etc. Of course, there are still many general details that have not been completed, such as:
We will work hard to improve the above features and other details. After all are ready, we will try to integrate CUDA to PyTorch through this Demo and provide a full-process integration tutorial. If you have any questions or suggestions, please let me know. Thank you. |
Hi @FFFrog Looking at NPU's readme
Does it mean I'm expected to implement out of tree backend in terms of NPU, i.e. as sort of NPU extension? Should I use it or can I continue working on existing out of tree implementation? Finally there is like 3 GPU OOT implementations (I'm aware of) around Intel's XPU, Apple Metal and my dlprimitives/OpenCL |
First of all, thank you very much for your comments. Challenges of integrating the new backend into PyTorch through the third-party device integration mechanism based on PrivateUse1:
However, due to the possible differences between various third-party backends, in order to ensure universality as much as possible, our current strategy considers CUDA as the standard, and all other backend APIs need to align themselves with the CUDA API (CUDA currently dominates the field of artificial intelligence, and the CUDA API is also well known in the industry) For this DEMO project, we plan to divide it into two phases:
Back to your question, if your time permits, it is recommended that you wait until our device abstraction layer is completed before integrating PyTorch |
As far as I know:
|
From my point of view it was mostly implementing operators - while biggest problem was to understand which are required, basic and what conditions are required - sometimes lack of documentation (for example what is the difference between
There are two things - operators that can be implemented in terms of others - it would be nice to have some kind of operator tree that would show native operators and ones implemented in terms of others. Regarding code-gen - do you mean automatic kernel code generation or building operators in terms of other operators?
Pool was probably the trickiest part to implement and still it is sub-optimal in terms of memory allocation. But still there are many interesting points to consider that aren't similar to CUDA: in OpenCL for example you can't use pointer arithmetics on host as in CUDA - you need to add an offset or use sub-buffers, some Integrated GPU devices share memory with CPU (Intel , AMD APUs, ARM)
This would be awesome
I understand but it is problematic. Since if I wait for an API to be finalised I'll wait forever What is expected to change? The most critical part and most of the work are the operators implemented. |
@artyom-beilis Hi, I will get back to you tomorrow, sorry for the inconvenience as I'm a bit busy lately. |
Yes, there are many operators in Pytorch and some of them are very similar to each other; we can simply divide all operators into two parts;
We will provide reference implementation and documents about all factory operators but not the latter because the latter is easy to understand
Yes, PyTorch has many operators that are composed of other basic operators, we can provide a tree list as you described above, but there are some issues with timeliness because the relationship between operators will be updated
then, you will get the backstrace on operators like below, and will know which operator need to be implemented:
|
As for codegen, it will generate a lot of codes according to our requirements, including but not limited to forward operator registration, backward operator registration, custom operator routing files, etc. All you need to do is implemant operators related to specific backend and also need to provide a yaml, in which providing opeartor info, the codegen will automaticly generate all codes you need. What I would like to add is that codegen is under development and the design of yaml has not yet been determined, you can take PyTorch yaml as an reference now. |
I absolutely agree with you. |
@albanD, sorry to bother you again. |
That would be fantastic... Because sometimes it just makes me wonder what this operator is doing and under what conditions.
This is first time I hear of the yaml...
Yes, this would be nice, because currently pytorch opencl/dlprimitives backend suffers of much more extensive memory use that it should.
Probably what I most needed is some kind of location were you can actually ask some questions for some stuff that isn't easy to understand so far @albanD did an amazing job helping with OpenCL backend. Currently I mostly ask questions on dev-discuss but sometimes I feel that there a lots of stuff (currently stuck on torch.load... without moving model to CPU and back to device) Thanks a lot! |
Our extreme goal is that new backend can be integrated into PyTorch smoothly by implementing some APIs and structures related with specific backend without considering any detailed stuff related to PyTorch,such as cpu fallback(@albanD have done it for dlprimitive),backend renaming and etc. Of course, it is not always convenient for all backends to integrate into PyTorch by using this project, if the new backend doesn't involve many PyTorch features, doing it from scratch maybe also a good choice. By the way, if you have any questions about new backend integration, you can also mention me or file a issue in the project, i am very glad to share with everyone. |
Base implementation aiming towards pytorch/rfcs#64 Details of the implementation and next steps in https://github.com/pytorch/pytorch/blob/gh/albanD/3/head/test/cpp_extensions/open_registration_extension/README.md Pull Request resolved: #131814 Approved by: https://github.com/ezyang
Base implementation aiming towards pytorch/rfcs#64 Details of the implementation and next steps in https://github.com/pytorch/pytorch/blob/gh/albanD/3/head/test/cpp_extensions/open_registration_extension/README.md Pull Request resolved: pytorch#131814 Approved by: https://github.com/ezyang
Base implementation aiming towards pytorch/rfcs#64 Details of the implementation and next steps in https://github.com/pytorch/pytorch/blob/gh/albanD/3/head/test/cpp_extensions/open_registration_extension/README.md Pull Request resolved: pytorch#131814 Approved by: https://github.com/ezyang
Proposal to add interoperability standard for 3rd backend based PrivateUse1 Mechanisam into PyTorch.
Rendered version: https://github.com/FFFrog/rfcs/blob/rfc-for-privateuse1/RFC-0037-Interoperability-Standard-of-3rd-Backend-Integration-Mechanism.md