Skip to content

Support and performance optimization for Hopper (H200) architecture? #1

@JimyMa

Description

@JimyMa

Hi team,

Thanks for the great work on boltz-cp. I am currently evaluating the framework for long-sequence model training and inference on large-scale GPU clusters, and I would like to understand the current state of support and optimization for the Hopper (H200) architecture.

Could you share some insights on the following aspects?

  1. Out-of-the-box Compatibility
    Are there any known issues, specific CUDA driver API requirements, or recommended baseline configurations when deploying boltz-cp on H200 nodes?

  2. Hopper-Specific Hardware Utilization
    Given the high memory bandwidth and compute requirements of Context Parallelism, does the current implementation actively leverage Hopper-specific features?

Are the underlying kernels optimized using TMA (Tensor Memory Accelerator) for asynchronous data movement and global memory access?

Is there any integration with the Transformer Engine to accelerate core tensor computations using FP8 or mixed precision?

  1. Communication and Cluster Scaling
    Efficient communication is critical for the frequent activation transfers in CP. How does the framework perform over 4th-Gen NVLink/NVSwitch topologies within an H200 node? Furthermore, for cross-node scaling, are there recommended parallelism strategies or known bottlenecks when operating over GPU Direct RDMA networks?

Any deployment best practices, performance benchmarks, or planned roadmap items regarding H200 support would be greatly appreciated.

Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions