Skip to content

RFC-0037-Interoperability-Standard-of-3rd-Backend-Integration-Mechanism #64

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

FFFrog
Copy link

@FFFrog FFFrog commented Apr 2, 2024

Proposal to add interoperability standard for 3rd backend based PrivateUse1 Mechanisam into PyTorch.

Rendered version: https://github.com/FFFrog/rfcs/blob/rfc-for-privateuse1/RFC-0037-Interoperability-Standard-of-3rd-Backend-Integration-Mechanism.md



## **Proposed Implementation**
![Architecture Outline](./RFC-0030-assets/3rd_backend_architecture.png)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this diagram!
I think we want to do a couple clarifications here:

  • For the middle case of XPU, showcase what still needs to be in core (allocator, stream, event, etc)
  • Similarly for the out of tree case, would be curious to showcase what is in core and is providing the extension point being used.
  • I would separate this PrivateUse1 device in core (that will look similar to XPU) and make it point to both out of core projects and point to the demo project in core (that is independent from the core integration).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. This is my mistake, I didn't notice that XPU has upstreamed some modules into pytorch, please refer to the latest diagram below.
  2. The out of tree case is similar to that of IPU, only some necessary interfaces, logical branch codes and so on are integrated into the core. However, PrivateUse1 also has its own special features:
  • Completeness: Support all torch functions as much as possible
  • Universality: Provide flexibility to third-party devices as much as possible to shield differences between devices
  1. As you commented below, it's probably best to keep the demo project as a PyTorch-organized project rather than keeping it in the tree.

Comment on lines +48 to +50
* **Usage**: Reserved for CI/CD and official standard reference implementation, not intended for user use.
* **Compilation**: Separate compilation entry point, separate from other backends.
* **Independence**: The related code is stored separately, without coupling with other functional modules of PyTorch.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like a lot of this can be achieved by making this demo backend another repo on the pytorch/* org that we take as a submodule for testing only and we build. This way, it is:

  • Fully independent codebase from core, just like the real out-of-tree backends
  • A real end-to-end example of how to make a new backend
  • Can be fully tested in our CI and be pinned as needed

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea.
Compared to in-tree, this is indeed a better way, thank you.

Copy link
Author

@FFFrog FFFrog May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, compared with in-tree, out-of-tree will be more troublesome when merging some PRs that cause the 3rd device to be unavailable, because developers need to coordinate between the two repo by pin commitid or other methods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes there are pros and cons both ways. Looking into what would be the easiest to setup. Will get back to you shortly.

| Refactoring | Move new trace utils from source to header, which leads to some symbols can’t be found. | 1 | [#114367](https://github.com/pytorch/pytorch/pull/114367/files) |
| Refactoring | Migrate to getCvar* functions for env variable checking, which leads to function name can’t be found. | 1 | [#113797](https://github.com/pytorch/pytorch/pull/113797) |
| New Features | Add support for new data types, data type assert fails. | 2 | [#107586](https://github.com/pytorch/pytorch/pull/107586), [#116594](https://github.com/pytorch/pytorch/pull/116594) |
| New Features | Add function to materialize COW storages, which add a pure virtual function Allocator::copy_data, derived class didn’t implement this pure virtual function. | 2 | [#117053](https://github.com/pytorch/pytorch/pull/117053), [#113396](https://github.com/pytorch/pytorch/pull/113396) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that these things are going to continue happening, I would be curious what we expect the workflow to be when such a change is needed.

The options that come to mind here would be:

  • specific channel where such change are tracked such that extension writers can subscribe and update their extensions accordingly. I guess this would mean that the extension is pinned to some version of PT and they move forward in lockstep.
  • we define the extension points implemented out of core in such a way that we can preserve BC there even when changing core. Might be tricky to define such API and would restrict what extension writers can do.

In both cases, I think we might need to continue ensuring that it is ok for extensions not to implement all the features that can be extended. Either by having some generic feature flag to say which feature each extension supports, or having a good default implementation that works, ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, the first approach seems more appropriate and aligns closely with my description in this RFC.

We can provide a way for third-party devices to easily obtain those modifications that have affected third-party devices (including original PR and adaptation methods). This way, when the corresponding PyTorch version switches from Version A to Version B, it will be easy to see which parts need to be re-adapted for third-party devices. The advantages will be more obvious when there are many third-party devices.

This approach has minimal impact and is relatively easy for third-party devices to accept. It also does not impose significant restrictions, obstacles, or additional workload on community developers.

Regarding the mentioned approach, there are two scenarios:

  • In Tree:
    If the related test cases for the Demo as part of CI fail, the developer needs to modify the corresponding implementation of the Demo synchronously. During the final code merge, the modifications to the Demo files will be checked (distinguishing normal modifications to Demo files). If there are modifications, a special tag will be added to the PR.
  • Out of Tree:
    the reviewer can determine whether to add a special tag based on the real situation and mark the corresponding PR in the PyTorch repository on the PR.

@FFFrog
Copy link
Author

FFFrog commented May 9, 2024

Thank you a lot for @albanD review.

Based on your comments, I have prepared a new diagram to illustrate the architecture outline, please take a look at it.

3rd_backend_architecture

@FFFrog FFFrog requested a review from albanD May 11, 2024 01:53
@jgong5
Copy link

jgong5 commented May 14, 2024

Based on your comments, I have prepared a new diagram to illustrate the architecture outline, please take a look at it.

Thanks for the updated arch diagram. Some comments on the XPU. I feel listing XPU as "partial out-tree" and list it similarly with other out-of-tree devices might look a bit confusing:

  1. XPU is expected to be fully functional within PyTorch core in the short term, just as CUDA. Even though, some of the ATen ops are supported via an out-of-the-tree repo but it is added as a third-party repo of PyTorch core and build together. This is different from other third-party devices which are maintained out of the tree.
  2. Maintaining an out-of-the-tree ATen repo is an interim approach to facilitate the XPU upstreaming. We would target them in-tree in longer term. So eventually, it would be "all in-tree".

@FFFrog
Copy link
Author

FFFrog commented May 17, 2024

Based on your comments, I have prepared a new diagram to illustrate the architecture outline, please take a look at it.

Thanks for the updated arch diagram. Some comments on the XPU. I feel listing XPU as "partial out-tree" and list it similarly with other out-of-tree devices might look a bit confusing:

  1. XPU is expected to be fully functional within PyTorch core in the short term, just as CUDA. Even though, some of the ATen ops are supported via an out-of-the-tree repo but it is added as a third-party repo of PyTorch core and build together. This is different from other third-party devices which are maintained out of the tree.
  2. Maintaining an out-of-the-tree ATen repo is an interim approach to facilitate the XPU upstreaming. We would target them in-tree in longer term. So eventually, it would be "all in-tree".

So sorry for the introduced confusion.
I have updated arch diagram again, please have a look at it, thank you.
无标题-2024-03-25-0951

@jgong5
Copy link

jgong5 commented May 20, 2024

So sorry for the introduced confusion.
I have updated arch diagram again, please have a look at it, thank you.

The updated diagram looks good to me. Thanks for taking the time!

@artyom-beilis
Copy link

Can you please setup some mailing list or update policy for other out-of-tree backend developers.

I'm the author of OpenCL backend https://github.com/artyom-beilis/pytorch_dlprim and now I'm catching up with changes in 2.3 and 2.4 - since I do it only part time it would be better to have some notices and updates in advance.

@FFFrog
Copy link
Author

FFFrog commented Jul 12, 2024

Can you please setup some mailing list or update policy for other out-of-tree backend developers.

I'm the author of OpenCL backend https://github.com/artyom-beilis/pytorch_dlprim and now I'm catching up with changes in 2.3 and 2.4 - since I do it only part time it would be better to have some notices and updates in advance.

Sorry for the late reply, I'm on vacation recently.

My colleagues and I have started development work, and the initial version will support Runtime.

Recently we will try to communicate with the community, with the goal of creating a project under the PyTorch organization, and then we will push our initial version to the project as soon as possible.

If you are interested, you are more than welcome to participate in this work.

@artyom-beilis
Copy link

Recently we will try to communicate with the community, with the goal of creating a project under the PyTorch organization, and then we will push our initial version to the project as soon as possible.

If you are interested, you are more than welcome to participate in this work.

Which project? In general anything that would simplify maintaining out-of-tree backend is welcome :-) Because I work on it in my spare time and sometimes I just can't keep up with all changes

@albanD
Copy link
Contributor

albanD commented Jul 17, 2024

Thanks for the update! It sounds great.

Can you please setup some mailing list or update policy for other out-of-tree backend developers.

I think a mailing list might be a bit challenging but looking at the change history of the demo module should be able to give an idea of what was added/changed recently.

In general anything that would simplify maintaining out-of-tree backend is welcome :-) Because I work on it in my spare time and sometimes I just can't keep up with all changes

There is quite a bit of churn, and I expect there will still be for a few more months as we fully stabilize the new improved API (you might want to wait a bit to upgrade if you don't have much time).
I do expect that it will quickly pay off though as, having a shared interface and extension point will allow us to improve both ease of use (because we designed this API for that exact purpose) and stability (because we have multiple users that will catch accidental regressions).

@FFFrog
Copy link
Author

FFFrog commented Jul 19, 2024

@artyom-beilis, if there are no other special circumstances, we will open source our project providing PyTorch Third-party Reference Demo by using PrivateUse1 mechinasm in the next week or so.

This is what we want to do with the project, I just drew a simple diagram, more detailed information must be found in the CODE.

Reference Demo

@FFFrog
Copy link
Author

FFFrog commented Jul 19, 2024

@albanD, I drew a simple diagram of the overall project structure and what we want to do.

I want to explain something about the diagram.

  • The xpu in the picture is different from Intel's xpu. It is just a name for a general device.
  • The design of many manufacturers' APIs will more or less draw on CUDA, so using CUDA as the standard can maximize compatibility with various third-party devices.
  • If the device has its own dedicated API, then the module with a blue background in the picture may need to be changed; if the device API is similar to CUDA, in theory only a few changes are needed.

@FFFrog
Copy link
Author

FFFrog commented Aug 6, 2024

Hi, @albanD @artyom-beilis , sorry for the late feedback.

At present, we have implemented the first version of Demo according to the community's latest third-party device integration mechanism. The main framework has been completed, including basic general Runtime capabilities, operator registration, autocast, etc.

Of course, there are still many general details that have not been completed, such as:

  • More General: Except for the npu directory in the root directory (which is a collection of specific backend functions and can be replaced with other backends), remove all npu-related representations, such as torch_npu renamed to torch_backend, csrc/npu renamed to csrc/backend, etc.
  • Codegen: redesigned to facilitate out-of-the-box use of new other backends
  • Backend custom API: provide backend custom API integration capabilities
  • Documentation: end-to-end documentation
  • Test cases sets: general test case collection, etc.

We will work hard to improve the above features and other details. After all are ready, we will try to integrate CUDA to PyTorch through this Demo and provide a full-process integration tutorial.

If you have any questions or suggestions, please let me know. Thank you.

@artyom-beilis
Copy link

Hi @FFFrog

Looking at NPU's readme

This project provides a foundational layer that abstracts and standardizes the interaction between PyTorch and different types of GPU hardware.

Does it mean I'm expected to implement out of tree backend in terms of NPU, i.e. as sort of NPU extension? Should I use it or can I continue working on existing out of tree implementation?

Finally there is like 3 GPU OOT implementations (I'm aware of) around Intel's XPU, Apple Metal and my dlprimitives/OpenCL

@FFFrog
Copy link
Author

FFFrog commented Aug 7, 2024

This project provides a foundational layer that abstracts and standardizes the interaction between PyTorch and different types of GPU hardware.

Does it mean I'm expected to implement out of tree backend in terms of NPU, i.e. as sort of NPU extension? Should I use it or can I continue working on existing out of tree implementation?

First of all, thank you very much for your comments.

Challenges of integrating the new backend into PyTorch through the third-party device integration mechanism based on PrivateUse1:

  • High development threshold: The third-party device integration mechanism is mainly implemented by various scattered and irregular HOOKs and registration mechanisms, lacking a unified view
  • Poor reusability: There are many common features between the backends integrated based on the third-party device integration mechanism, such as codegen (automatically implement operator registration, custom operator routing, forward and backward binding, etc.), PyTorch common API, common memory pool strategy, common test case set, etc.

However, due to the possible differences between various third-party backends, in order to ensure universality as much as possible, our current strategy considers CUDA as the standard, and all other backend APIs need to align themselves with the CUDA API (CUDA currently dominates the field of artificial intelligence, and the CUDA API is also well known in the industry)

For this DEMO project, we plan to divide it into two phases:

  • Phase 1: This is the phase we are in now, which requires COPY&Modify. This project is mainly used as a reference implementation.
  • Phase 2: This is what we will do next, completing the device abstraction layer, and the third-party backend will serve as the plug-in for this demo (it is worth adding that for the PyTorch general API, ideally, the third-party backend only needs to implement the backend API corresponding to the CUDA API, but for the case of custom APIs, the backend currently needs to complete the end-to-end integration of PyTorch by itself)

Back to your question, if your time permits, it is recommended that you wait until our device abstraction layer is completed before integrating PyTorch

@FFFrog
Copy link
Author

FFFrog commented Aug 7, 2024

@artyom-beilis

Finally there is like 3 GPU OOT implementations (I'm aware of) around Intel's XPU, Apple Metal and my dlprimitives/OpenCL

As far as I know:

  • Intel XPU: Currently in a semi-built-in state, will be fully in-tree later, its dispatchKey is xpu(dedicated key)
  • Apple Matel: This should be an in-tree backend, its dispatchKey is mps(dedicated key)
  • dlprimitives/OpenCL: out-of-tree
  • Intel HPU: This is currently out-of-tree, its dispatchKey is hpu(dedicated key), but the OOT repo is not opensource
  • Meta MTIA: out-of-tree, its dispatchKey is mtia(dedicated key)
  • Huawei NPU: out-of-tree, its dispatchKey is PrivateUse1(public key)

@artyom-beilis
Copy link

High development threshold: The third-party device integration mechanism is mainly implemented by various scattered and irregular HOOKs and registration mechanisms, lacking a unified view

From my point of view it was mostly implementing operators - while biggest problem was to understand which are required, basic and what conditions are required - sometimes lack of documentation (for example what is the difference between _copy_from and _copy_from_and_resize ???)

Poor reusability: There are many common features between the backends integrated based on the third-party device integration mechanism, such as codegen (automatically implement operator registration, custom operator routing, forward and backward binding, etc.),

There are two things - operators that can be implemented in terms of others - it would be nice to have some kind of operator tree that would show native operators and ones implemented in terms of others.

Regarding code-gen - do you mean automatic kernel code generation or building operators in terms of other operators?

PyTorch common API, common memory pool strategy,

Pool was probably the trickiest part to implement and still it is sub-optimal in terms of memory allocation. But still there are many interesting points to consider that aren't similar to CUDA: in OpenCL for example you can't use pointer arithmetics on host as in CUDA - you need to add an offset or use sub-buffers, some Integrated GPU devices share memory with CPU (Intel , AMD APUs, ARM)

common test case set, etc.

This would be awesome

Back to your question, if your time permits, it is recommended that you wait until our device abstraction layer is completed before integrating PyTorch

I understand but it is problematic. Since if I wait for an API to be finalised I'll wait forever ;-).

What is expected to change? The most critical part and most of the work are the operators implemented.

@FFFrog
Copy link
Author

FFFrog commented Aug 8, 2024

@artyom-beil​​is Hi, I will get back to you tomorrow, sorry for the inconvenience as I'm a bit busy lately.

@FFFrog
Copy link
Author

FFFrog commented Aug 9, 2024

From my point of view it was mostly implementing operators - while biggest problem was to understand which are required, basic and what conditions are required - sometimes lack of documentation (for example what is the difference between _copy_from and _copy_from_and_resize ???)

Yes, there are many operators in Pytorch and some of them are very similar to each other; we can simply divide all operators into two parts;

  • Factory operators: all operators related to tensor creation, conversion, etc.
  • Computational operators: all operators that deal with tensors

We will provide reference implementation and documents about all factory operators but not the latter because the latter is easy to understand

There are two things - operators that can be implemented in terms of others - it would be nice to have some kind of operator tree that would show native operators and ones implemented in terms of others.

Yes, PyTorch has many operators that are composed of other basic operators, we can provide a tree list as you described above, but there are some issues with timeliness because the relationship between operators will be updated
An approach maybe can give you a hand:

  • Compile Pytorch With DEBUG
  • export TORCH_SHOW_DISPATCH_TRACE=1
  • python -c "import torch; torch.rand(3,3)"

then, you will get the backstrace on operators like below, and will know which operator need to be implemented:

 [call] op=[aten::rand], key=[BackendSelect]
  [redispatch] op=[aten::rand], key=[CPU]
   [call] op=[aten::empty.memory_format], key=[BackendSelect]
    [redispatch] op=[aten::empty.memory_format], key=[CPU]
   [call] op=[aten::uniform_], key=[CPU]

@FFFrog
Copy link
Author

FFFrog commented Aug 9, 2024

Regarding code-gen - do you mean automatic kernel code generation or building operators in terms of other operators?

As for codegen, it will generate a lot of codes according to our requirements, including but not limited to forward operator registration, backward operator registration, custom operator routing files, etc.

All you need to do is implemant operators related to specific backend and also need to provide a yaml, in which providing opeartor info, the codegen will automaticly generate all codes you need.

What I would like to add is that codegen is under development and the design of yaml has not yet been determined, you can take PyTorch yaml as an reference now.

@FFFrog
Copy link
Author

FFFrog commented Aug 9, 2024

Pool was probably the trickiest part to implement and still it is sub-optimal in terms of memory allocation. But still there are many interesting points to consider that aren't similar to CUDA: in OpenCL for example you can't use pointer arithmetics on host as in CUDA - you need to add an offset or use sub-buffers, some Integrated GPU devices share memory with CPU (Intel , AMD APUs, ARM)

I absolutely agree with you.
We will plan to provide several basic memory pool strategies such as CUDA Memory Pool、OpenCL Memory Pool and etc. The new backend can use the most appropriate strategy to implement the new allocator according to the characteristics of the new backend. Of course, the new backend can also implement its own memory pool from scratch.

@FFFrog
Copy link
Author

FFFrog commented Aug 12, 2024

@albanD, sorry to bother you again.
It seems that you are working on accelerator diversity in PyTorch, and perhaps this project could help you accelerate this goal, can you scan the project quickly and give us some advices ?

@artyom-beilis
Copy link

We will provide reference implementation and documents about all factory operators but not the latter because the latter is easy to understand

That would be fantastic... Because sometimes it just makes me wonder what this operator is doing and under what conditions.

All you need to do is implemant operators related to specific backend and also need to provide a yaml,

This is first time I hear of the yaml...

We will plan to provide several basic memory pool strategies such as CUDA Memory Pool、OpenCL Memory Pool

Yes, this would be nice, because currently pytorch opencl/dlprimitives backend suffers of much more extensive memory use that it should.

perhaps this project could help you accelerate this goal, can you scan the project quickly and give us some advices ?

Probably what I most needed is some kind of location were you can actually ask some questions for some stuff that isn't easy to understand so far @albanD did an amazing job helping with OpenCL backend. Currently I mostly ask questions on dev-discuss but sometimes I feel that there a lots of stuff (currently stuck on torch.load... without moving model to CPU and back to device)

Thanks a lot!

@FFFrog
Copy link
Author

FFFrog commented Aug 13, 2024

Our extreme goal is that new backend can be integrated into PyTorch smoothly by implementing some APIs and structures related with specific backend without considering any detailed stuff related to PyTorch,such as cpu fallback(@albanD have done it for dlprimitive),backend renaming and etc.

Of course, it is not always convenient for all backends to integrate into PyTorch by using this project, if the new backend doesn't involve many PyTorch features, doing it from scratch maybe also a good choice.

By the way, if you have any questions about new backend integration, you can also mention me or file a issue in the project, i am very glad to share with everyone.

pytorchmergebot pushed a commit to pytorch/pytorch that referenced this pull request Aug 27, 2024
malfet pushed a commit to aditew01/pytorch that referenced this pull request Sep 13, 2024
Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Sep 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants