More ROCm support #3401

glen-amd · 2025-03-18T17:17:04Z

Goals

Bring the support for AMD ROCm to TorchServe
Let the TorchServe community be aware of the support for AMD GPUs

Git PRs

Already done

NVIDIA device information via CLI nvidia-smi
AMD device information via CLI amd-smi/rocm-smi
More...

TODOs

Must do

Support for the latest ROCm release
- Currently supported
  - choices=["rocm6.0", "rocm6.1", "rocm6.2"]
    - In the file "ts_scripts/install_dependencies.py"
- Latest release by 20250311
  - ROCm 6.3 (rocm6.3)

Nice to do

NVML instead of CLI nvidia-smi
- NVML has both C/C++ APIs and Python bindings.
- TODO: JNI for Java
AMD SMI lib instead of CLI amd-smi/rocm-smi
- AMD SMI lib has both C/C++ APIs and Python bindings.
- TODO: JNI for Java

Exploration in the TorchServe `master` branch

Commands
- find . -type f,l | xargs grep --color=always -nri cuda
- find . -type f,l | xargs grep --color=always -nriE '\Wnv'
- find . -type f,l | xargs grep --color=always -nri '_nv'
File types

Parts

Requirement files

Docker files

Config files

ts_scripts/spellcheck_conf/wordlist.txt

Build scripts

Frontend

Backend

cpp/src/backends/handler/base_handler.cc

Documentation

Examples

CI

Regression tests

GitHub workflows

Benchmarks

benchmarks/install_dependencies.sh
benchmarks/benchmark.py
- "nvidia-docker"

Notes

Code name examples

NVIDIA/CUDA
- cu92, cu101, cu102, cu111, cu113, cu116, cu117, cu118, cu121
AMD/ROCm
- rocm5.9, rocm6.0, rocm6.1, rocm6.2, rocm6.3

Description

Please read our CONTRIBUTING.md prior to creating your first pull request.

Please include a summary of the feature or issue being fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

Fixes #(issue)

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
New feature (non-breaking change which adds functionality)
This change requires a documentation update

Feature/Issue validation/testing

Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

Test A
Logs for Test A
Test B
Logs for Test B

Checklist:

Did you have fun?
Have you added tests that prove your fix is effective or that this feature works?
Has code been commented, particularly in hard-to-understand areas?
Have you made corresponding changes to the documentation?

glen-amd · 2025-03-18T17:20:48Z

@jakki-amd / @smedegaard / @agunapal - can you please make initial review? Thanks.

jeffdaily · 2025-03-18T18:08:38Z

cpp/src/backends/handler/base_handler.cc

  }

+#if defined(__HIPCC__) || (defined(__clang__) && defined(__HIP__)) || defined(__HIPCC_RTC__)
+  return std::make_shared<torch::Device>(torch::kHIP,


Why are you using HIP for the device here instead of CUDA? ROCm PyTorch masquerades as the CUDA device. Though a HIP device does exist as a dispatch key, no kernels are registered to it.

I will revert this change.

Since kHIP is explicitly defined in PyTorch (e.g., for clearer semantics I guess), I expected that automatic re-mapping would happen internally for kernel registration, lookup, etc. However, I don't really see this in PyTorch.

jakki-amd · 2025-03-20T15:08:07Z

CONTRIBUTING.md


    Possible values are
-    - rocm: `rocm61`, `rocm60`
+    - rocm: `rocm6.3`, `rocm6.2`, `rocm6.1`, 'rocm6.0'


nit: I think it would be more consistent to follow the same naming convention as with CUDA flag naming, meaning using rocm61 instead of rocm6.1 as CUDA flags are also given like cu111, not cu11.1.

I specifically checked the naming convention of both CUDA and ROCm and confirmed internally with some AMDers, then decided to use something like rocm6.3 instead of rocm63.

jakki-amd · 2025-03-20T15:45:09Z

docs/github_actions.md

+            strategy:
+            fail-fast: false
+            matrix:
+                cuda: ["rocm6.1", "rocm6.2"]


I don't see how defining cuda label to be rocm6.1 without changing ci-gpu CI script would work. Some work was done regarding CI scripts in this branch here, but the branch is out-of-date.

jakki-amd · 2025-03-20T15:50:04Z

Left few minor comments, otherwise looks good!

…e/blob/28-create-github-actions-for-rocm/docs/github_actions.md

glen-amd · 2025-03-24T21:43:21Z

@smedegaard / @agunapal - can you please review?

agunapal · 2025-03-26T17:32:06Z

@smedegaard / @agunapal - can you please review?

cc @mreso

mreso

LGTM overall, left some minor comments

mreso · 2025-04-03T23:29:31Z

docker/Dockerfile

 RUN python -m pip install -U pip setuptools \
    && python -m pip install --no-cache-dir -r requirements/developer.txt \
-    && python ts_scripts/install_from_src.py \
+    && python ts_scripts/install_from_src.py --environment=dev \


Whats the motivation for this change?

@jakki-amd - can you please explain the change here?

Besides, I just found the default "production" may be wrong - it should be "prod", shouldn't it?

serve/ts_scripts/install_from_src.py

Line 29 in 62c4d6a

default="production",

Regarding motivation (did this work long time ago so apologies if I don't remember all the details), I think the motivation was that as the target of this image in the last section is to build development image and thus I think in Docker we should then have all the dependencies installed that developing requires. Therefore I added environment flag dev.

We do get part of the development dependencies from the line 296 pip install --no-cache-dir -r requirements/developer.txt, but install_from_src.py contains code that installs additional dependencies for development installs and if we would ever add more development dependencies to install_from_src.py that are not in requirements/developer.txt , we should then remember to add them this Dockerfile and I find it safer just to install anything that is relevant for development work.

Regarding default="production", it certainly does seem that dev and prod are the valid environment flags.

mreso · 2025-04-03T23:38:18Z

docker/README.md

 ./build_image.sh -bt dev -g -cv cu92
 ```

+- For creating GPU based image with rocm version 6.0:


Suggested change

- For creating GPU based image with rocm version 6.0:

- For creating GPU based image with ROCm version 6.0:

mreso · 2025-04-03T23:38:33Z

docker/README.md

+./build_image.sh -bt dev -g -rv rocm6.0
+```
+
+- For creating GPU based image with rocm version 6.1:


Suggested change

- For creating GPU based image with rocm version 6.1:

- For creating GPU based image with ROCm version 6.1:

mreso · 2025-04-03T23:38:40Z

docker/README.md

+./build_image.sh -bt dev -g -rv rocm6.1
+```
+
+- For creating GPU based image with rocm version 6.2:


Suggested change

- For creating GPU based image with rocm version 6.2:

- For creating GPU based image with ROCm version 6.2:

mreso · 2025-04-03T23:38:46Z

docker/README.md

+./build_image.sh -bt dev -g -rv rocm6.2
+```
+
+- For creating GPU based image with rocm version 6.3:


Suggested change

- For creating GPU based image with rocm version 6.3:

- For creating GPU based image with ROCm version 6.3:

glen-amd added 6 commits March 12, 2025 11:36

More ROCm support

c8141c3

More ROCm support

1f041e8

More ROCm support

d7f76f8

More ROCm support

cd10eec

A bug fix; Comments; rocm6.3 related changes.

70ab027

ROCm 6.3 related additions

3023f53

pytorch-bot bot added the module: rocm label Mar 18, 2025

jeffdaily reviewed Mar 18, 2025

View reviewed changes

Reverted the use of torch::kHIP

1a06283

glen-amd changed the title ~~More rocm support~~ More ROCmsupport Mar 19, 2025

glen-amd changed the title ~~More ROCmsupport~~ More ROCm support Mar 19, 2025

jakki-amd reviewed Mar 20, 2025

View reviewed changes

glen-amd added 2 commits March 20, 2025 10:08

Added matrix.gpu-type by referencing https://github.com/nod-ai/serv…

1f00b02

…e/blob/28-create-github-actions-for-rocm/docs/github_actions.md

Addessed a few errors from CI runs

3012668

mreso suggested changes Apr 3, 2025

View reviewed changes

"rocm" normalized to "ROCm"

b8876b5

	- For creating GPU based image with rocm version 6.0:
	- For creating GPU based image with ROCm version 6.0:

More ROCm support #3401

Are you sure you want to change the base?

More ROCm support #3401

Uh oh!

Conversation

glen-amd commented Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Goals

Git PRs

Already done

TODOs

Must do

Nice to do

Exploration in the TorchServe master branch

Parts

Requirement files

Docker files

Config files

Build scripts

Frontend

Backend

Documentation

Examples

CI

Regression tests

GitHub workflows

Benchmarks

Notes

Code name examples

Description

Type of change

Feature/Issue validation/testing

Checklist:

Uh oh!

glen-amd commented Mar 18, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jakki-amd Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glen-amd Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jakki-amd commented Mar 20, 2025

Uh oh!

glen-amd commented Mar 24, 2025

Uh oh!

agunapal commented Mar 26, 2025

Uh oh!

mreso left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

glen-amd commented Mar 18, 2025 •

edited

Loading

Exploration in the TorchServe `master` branch

jakki-amd Mar 20, 2025 •

edited

Loading

glen-amd Mar 20, 2025 •

edited

Loading