Build and Run vLLM on Arm Servers LP - update dtype #2432

chrismoroney · 2025-10-16T18:11:00Z

Before submitting a pull request for a new Learning Path, please review Create a Learning Path

I have reviewed Create a Learning Path

Please do not include any confidential information in your contribution. This includes confidential microarchitecture details and unannounced product information.

I have checked my contribution for confidential information

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of the Creative Commons Attribution 4.0 International License.

chrismoroney · 2025-10-16T18:12:13Z

Connected to a t4g.2xlarge instance with arm architecture (8 vCPU, 32 GiB RAM), 50 GiB storage with encryption

Obsevations:
• Running with dtype=”bfloat16” caused a runtime error:
[rank0]: RuntimeError: "rms_norm_impl" not implemented for 'BFloat16'
• Cause: vLLM’s RMSNorm operation is missing a bf16 implementation. Additionally, if using a Graviton2 CPU, the CPU lacks native bf16.
• Solution(s):

Run new code to change dtype to float32 (quick, easy)
Run code on GPU instance OR on a Graviton 3 CPU (both supporting bf16)
If running with CPU, can safeguard to ensure running just on CPU (force CPU-only execution)
• Suggestion: The quick and easy fix is just changing dtype to float32. This isn’t necessary if we have GPU or a Graviton 3 CPU, but the float32 is more universal across both. If we want to keep bfloat16, then we need to update the specs required for the LP. However if we want to keep the specs currently existing (Arm-based server, 8 CPUs, 16 GB RAM, 50 GB disk), then switching datatype to float32 will be safest, as well as ensuring we are running on CPU.

chrismoroney · 2025-10-16T18:13:06Z

python3 -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-0.5B-Instruct --dtype float16

For this code segment, can also change from float16 to float32
• While not broken and still works, float32 is more universal (usually implemented universally across all ops and CPUs). Float16 (and bfloat16) is ideal with much faster Tensor Cores on GPU instances, however FP32 is more universal with CPU instances (especially going with the requirements of the hardware in this LP.

Revised: python3 -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-0.5B-Instruct --dtype float32

chrismoroney · 2025-10-20T15:30:54Z

After conversing with @jasonrandrews , best approach is to keep code the way it is to show that Arm has bfloat16 on Graviton3 and Graviton4. Reverting code changes and clarifying in setup that instructions were tested on Grav3.

data type update

7699226

revert bfloat, clarify cloud instance

bea94d7

jasonrandrews merged commit 80b9f8e into ArmDeveloperEcosystem:main Oct 21, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Build and Run vLLM on Arm Servers LP - update dtype #2432

Build and Run vLLM on Arm Servers LP - update dtype #2432

Uh oh!

chrismoroney commented Oct 16, 2025

Uh oh!

chrismoroney commented Oct 16, 2025

Uh oh!

chrismoroney commented Oct 16, 2025

Uh oh!

chrismoroney commented Oct 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Build and Run vLLM on Arm Servers LP - update dtype #2432

Build and Run vLLM on Arm Servers LP - update dtype #2432

Uh oh!

Conversation

chrismoroney commented Oct 16, 2025

Uh oh!

chrismoroney commented Oct 16, 2025

Uh oh!

chrismoroney commented Oct 16, 2025

Uh oh!

chrismoroney commented Oct 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants