Skip to content

Conversation

Darth-Kronos
Copy link

@Darth-Kronos Darth-Kronos commented Oct 17, 2025

What does this PR do?

Type of change: ? Bug Fix

Overview: ?
When converting weight tensors to INT8/FP8, the zero-point array’s datatype was previously validated against ONNX datatypes (onnx.TensorProto.FLOAT8E4M3FN or onnx.TensorProto.INT8). However, since the zero-point array is a NumPy array, weights were always incorrectly scaled to INT8 for FP8 quantization.

This PR fixes that issue by checking the data_type field from the onnx.TensorProto instead of inferring it from the corresponding NumPy arrays.

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

  • Make sure you read and follow Contributor guidelines and your commits are signed.
  • Is this change backward compatible?: Yes
  • Did you write any new necessary tests?: No
  • Did you add or update any necessary documentation?: No
  • Did you update Changelog?: No

Additional Information

Summary by CodeRabbit

  • Refactor
    • Enhanced quantization parameter handling in ONNX utilities to improve type safety and consistency across quantization workflows.
    • Strengthened FP8 tensor creation with explicit data type specification.
    • Improved robustness of quantized model processing through refined internal data structure handling.

@Darth-Kronos Darth-Kronos requested a review from a team as a code owner October 17, 2025 14:16
@Darth-Kronos Darth-Kronos requested a review from i-riyad October 17, 2025 14:16
Copy link

copy-pr-bot bot commented Oct 17, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link

coderabbitai bot commented Oct 17, 2025

Walkthrough

The PR refactors type handling for scale and zero-point values in quantization utilities, shifting from numpy arrays to ONNX TensorProto objects at the retrieval stage, with type conversions deferred to points of use.

Changes

Cohort / File(s) Summary
ONNX Tensor Type Refactoring
modelopt/onnx/quantization/qdq_utils.py
_get_scale_and_zp now returns ONNX TensorProto objects instead of numpy arrays; _convert_weight updated to accept ONNX TensorProto for scale and zero-point with internal conversion to arrays; _create_fp8_tensor explicitly sets data_type to Float8; call sites in qdq_to_dq and related functions updated to handle TensorProto objects; error checks adjusted to use tensor metadata (e.g., zp.data_type) instead of numpy dtype operations.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Rationale: Single file with consistent, homogeneous type refactoring pattern (numpy arrays → ONNX TensorProto). Changes follow a straightforward substitution logic: parameter types updated, conversions relocated to call sites, and tensor metadata accessed appropriately. Control flow unchanged. Review focuses on verifying type correctness and conversion safety.

Poem

🐰 Types transformed from arrays to tensors so bright,
Scale and zero-point dressed in ONNX delight,
Conversions deferred to their rightful place,
FP8 floats now wear their data_type face!

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title Check ✅ Passed The PR title "Fix ONNX FP8 scaling" directly addresses the main objective of the changeset, which is to correct incorrect scaling of weight tensors when converting to INT8/FP8 by fixing datatype checks in ONNX tensor handling. The title is concise, avoids vague or generic terminology, and clearly conveys that this is a bug fix focused on FP8 scaling in ONNX code. A developer scanning the repository history would reasonably understand that this PR addresses a scaling-related issue in the ONNX FP8 quantization logic. While the title could be slightly more specific about the root cause (improper datatype checking), it nonetheless captures the essential nature of the fix.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant