Skip to content

Conversation

chrismoroney
Copy link
Collaborator

Before submitting a pull request for a new Learning Path, please review Create a Learning Path

  • I have reviewed Create a Learning Path

Please do not include any confidential information in your contribution. This includes confidential microarchitecture details and unannounced product information.

  • I have checked my contribution for confidential information

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of the Creative Commons Attribution 4.0 International License.

@chrismoroney
Copy link
Collaborator Author

Connected to a t4g.2xlarge instance with arm architecture (8 vCPU, 32 GiB RAM), 50 GiB storage with encryption

Obsevations:
• Running with dtype=”bfloat16” caused a runtime error:
[rank0]: RuntimeError: "rms_norm_impl" not implemented for 'BFloat16'
• Cause: vLLM’s RMSNorm operation is missing a bf16 implementation. Additionally, if using a Graviton2 CPU, the CPU lacks native bf16.
• Solution(s):

  1. Run new code to change dtype to float32 (quick, easy)
  2. Run code on GPU instance OR on a Graviton 3 CPU (both supporting bf16)
  3. If running with CPU, can safeguard to ensure running just on CPU (force CPU-only execution)
    • Suggestion: The quick and easy fix is just changing dtype to float32. This isn’t necessary if we have GPU or a Graviton 3 CPU, but the float32 is more universal across both. If we want to keep bfloat16, then we need to update the specs required for the LP. However if we want to keep the specs currently existing (Arm-based server, 8 CPUs, 16 GB RAM, 50 GB disk), then switching datatype to float32 will be safest, as well as ensuring we are running on CPU.

@chrismoroney
Copy link
Collaborator Author

python3 -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-0.5B-Instruct --dtype float16

For this code segment, can also change from float16 to float32
• While not broken and still works, float32 is more universal (usually implemented universally across all ops and CPUs). Float16 (and bfloat16) is ideal with much faster Tensor Cores on GPU instances, however FP32 is more universal with CPU instances (especially going with the requirements of the hardware in this LP.

Revised: python3 -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-0.5B-Instruct --dtype float32

@chrismoroney
Copy link
Collaborator Author

After conversing with @jasonrandrews , best approach is to keep code the way it is to show that Arm has bfloat16 on Graviton3 and Graviton4. Reverting code changes and clarifying in setup that instructions were tested on Grav3.

@jasonrandrews jasonrandrews merged commit 80b9f8e into ArmDeveloperEcosystem:main Oct 21, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants