-
Notifications
You must be signed in to change notification settings - Fork 882
More ROCm support #3401
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
More ROCm support #3401
Conversation
@jakki-amd / @smedegaard / @agunapal - can you please make initial review? Thanks. |
@@ -63,6 +63,10 @@ std::shared_ptr<torch::Device> BaseHandler::GetTorchDevice( | |||
return std::make_shared<torch::Device>(torch::kCPU); | |||
} | |||
|
|||
#if defined(__HIPCC__) || (defined(__clang__) && defined(__HIP__)) || defined(__HIPCC_RTC__) | |||
return std::make_shared<torch::Device>(torch::kHIP, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you using HIP for the device here instead of CUDA? ROCm PyTorch masquerades as the CUDA device. Though a HIP device does exist as a dispatch key, no kernels are registered to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will revert this change.
Since kHIP
is explicitly defined in PyTorch (e.g., for clearer semantics I guess), I expected that automatic re-mapping would happen internally for kernel registration, lookup, etc. However, I don't really see this in PyTorch.
@@ -67,10 +67,10 @@ If you plan to develop with TorchServe and change some source code, you must ins | |||
Use the optional `--rocm` or `--cuda` flag with `install_dependencies.py` for installing accelerator specific dependencies. | |||
|
|||
Possible values are | |||
- rocm: `rocm61`, `rocm60` | |||
- rocm: `rocm6.3`, `rocm6.2`, `rocm6.1`, 'rocm6.0' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think it would be more consistent to follow the same naming convention as with CUDA flag naming, meaning using rocm61
instead of rocm6.1
as CUDA flags are also given like cu111
, not cu11.1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I specifically checked the naming convention of both CUDA and ROCm and confirmed internally with some AMDers, then decided to use something like rocm6.3
instead of rocm63
.
docs/github_actions.md
Outdated
strategy: | ||
fail-fast: false | ||
matrix: | ||
cuda: ["rocm6.1", "rocm6.2"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how defining cuda label to be rocm6.1
without changing ci-gpu CI script would work. Some work was done regarding CI scripts in this branch here, but the branch is out-of-date.
Left few minor comments, otherwise looks good! |
@smedegaard / @agunapal - can you please review? |
cc @mreso |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall, left some minor comments
WORKDIR "serve" | ||
RUN python -m pip install -U pip setuptools \ | ||
&& python -m pip install --no-cache-dir -r requirements/developer.txt \ | ||
&& python ts_scripts/install_from_src.py \ | ||
&& python ts_scripts/install_from_src.py --environment=dev \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whats the motivation for this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jakki-amd - can you please explain the change here?
Besides, I just found the default "production"
may be wrong - it should be "prod"
, shouldn't it?
serve/ts_scripts/install_from_src.py
Line 29 in 62c4d6a
default="production", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding motivation (did this work long time ago so apologies if I don't remember all the details), I think the motivation was that as the target of this image in the last section is to build development image and thus I think in Docker we should then have all the dependencies installed that developing requires. Therefore I added environment flag dev
.
We do get part of the development dependencies from the line 296 pip install --no-cache-dir -r requirements/developer.txt
, but install_from_src.py
contains code that installs additional dependencies for development installs and if we would ever add more development dependencies to install_from_src.py
that are not in requirements/developer.txt
, we should then remember to add them this Dockerfile and I find it safer just to install anything that is relevant for development work.
Regarding default="production"
, it certainly does seem that dev
and prod
are the valid environment flags.
docker/README.md
Outdated
@@ -136,6 +137,30 @@ Creates a docker image with `torchserve` and `torch-model-archiver` installed fr | |||
./build_image.sh -bt dev -g -cv cu92 | |||
``` | |||
|
|||
- For creating GPU based image with rocm version 6.0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- For creating GPU based image with rocm version 6.0: | |
- For creating GPU based image with ROCm version 6.0: |
docker/README.md
Outdated
./build_image.sh -bt dev -g -rv rocm6.0 | ||
``` | ||
|
||
- For creating GPU based image with rocm version 6.1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- For creating GPU based image with rocm version 6.1: | |
- For creating GPU based image with ROCm version 6.1: |
docker/README.md
Outdated
./build_image.sh -bt dev -g -rv rocm6.1 | ||
``` | ||
|
||
- For creating GPU based image with rocm version 6.2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- For creating GPU based image with rocm version 6.2: | |
- For creating GPU based image with ROCm version 6.2: |
docker/README.md
Outdated
./build_image.sh -bt dev -g -rv rocm6.2 | ||
``` | ||
|
||
- For creating GPU based image with rocm version 6.3: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- For creating GPU based image with rocm version 6.3: | |
- For creating GPU based image with ROCm version 6.3: |
Goals
Git PRs
Already done
nvidia-smi
amd-smi
/rocm-smi
TODOs
Must do
choices=["rocm6.0", "rocm6.1", "rocm6.2"]
rocm6.3
)Nice to do
nvidia-smi
amd-smi
/rocm-smi
Exploration in the TorchServe
master
branchfind . -type f,l | xargs grep --color=always -nri cuda
find . -type f,l | xargs grep --color=always -nriE '\Wnv'
find . -type f,l | xargs grep --color=always -nri '_nv'
Parts
Requirement files
Docker files
Config files
Build scripts
Frontend
Backend
Documentation
Examples
CI
Regression tests
GitHub workflows
Benchmarks
Notes
Code name examples
cu92
,cu101
,cu102
,cu111
,cu113
,cu116
,cu117
,cu118
,cu121
rocm5.9
,rocm6.0
,rocm6.1
,rocm6.2
,rocm6.3
Description
Please read our CONTRIBUTING.md prior to creating your first pull request.
Please include a summary of the feature or issue being fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.
Fixes #(issue)
Type of change
Please delete options that are not relevant.
Feature/Issue validation/testing
Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.
Test A
Logs for Test A
Test B
Logs for Test B
Checklist: