Trainer: the model analysis on the AOT compiled JAX program. #1036

ds-hwang · 2025-03-05T19:53:26Z

This will help researchers estimate HBM usage and computation costs before launching a job, allowing them to determine whether a model is compute-bound or memory-bound.

Introduced aot_model_analysis(), which returns analysis results as a string, making it reusable (e.g., in Jupyter notebooks).

In addition, change run_aot_compilation to support it on CPU. run_aot_compilation tool prints fuji-1B-v3 model analysis as follows.

XLA_FLAGS=--xla_dump_to=/tmp/aot_xla_dump \
python -m axlearn.experiments.run_aot_compilation \
    --module=axlearn.experiments.text.gpt.c4_trainer \
    --config=fuji-1B-v3 \
    --topology=v4-1024 --cpu 1> /tmp/aot_stdout

======= Memory Analysis ==================================
Input memory: 4465.0 MB / 4.36 GB
Output memory: 4464.8 MB / 4.36 GB
Temp memory: 174977.1 MB / 170.88 GB
Code memory: 0.0 MB / 0.00 GB
Total HBM memory: 183906.9 MB / 179.60 GB
======= Cost Analysis ====================================
FLOPS: 71733280.0 M / 70052.03 G
The number of exp/log/sin/cos ops: 21364.8 M / 20.86 G
The total memory traffic: 1792723.2 MB / 1750.71 GB
  HBM access: 751479.1 MB / 733.87 GB
  L2 cache access: 328740.8 MB / 321.04 GB
  Register usage: 61266.7 MB / 59.83 GB
  Output data transferred: 677251.9 MB / 661.38 GB
Hardware utilization scores
  Tensor Cores / MatMul units: 647.0
  ALU (Arithmetic Logic Unit): 430.0
  Memory Load/Store Units: 144.0
  L1 Cache Operations: 92.0
  L2 Cache Operations: 60.0
  Special Function Units (exp/log/sin/cos): 41.0
  Integer Units (for indexing, loop counters): 16.0
  Branch Divergence (Control Flow Processing): 12.0
  Load Balancing / Dispatch): 10.0
  Texture Units (or Rarely Used Compute Units): 8.0

ds-hwang · 2025-03-05T19:56:15Z

@markblee Could you review it? From 1112

axlearn/experiments/run_aot_compilation.py

apghml · 2025-03-05T20:12:36Z

axlearn/common/trainer.py

+    if not hasattr(compiled, "memory_analysis"):
+        return ""
+
+    to_mb_gb = lambda x: f"{x / (1024**2):.1f} MB / {x / (1024**3):.2f} GB"


I think this could be confusing since users might think the left number is the usage and right number is the maximum available. (e.g., they might mistake it for meaning "x mb out of y gb used"). Can we eliminate the mb?

But some downstream model (e.g. speech model), GB is too big. Let me make it dynamically decide MB or GB.

apghml · 2025-03-05T20:13:35Z

axlearn/experiments/run_aot_compilation.py

@@ -91,6 +101,14 @@ def _compile_and_dump_programs(
            logging.info("Wrote serialized %s to %s", program_name, serialized_compiled_output_path)


This PR still doesn't deduplicate the the memory printing code in run_aot_compilation.py with the new function you have in trainer.py?

axlearn/experiments/run_aot_compilation.py

This will help researchers estimate HBM usage and computation costs before launching a job, allowing them to determine whether a model is compute-bound or memory-bound. Introduced aot_model_analysis(), which returns analysis results as a string, making it reusable (e.g., in Jupyter notebooks). `run_aot_compilation` tool prints fuji-1B-v3 model analysis as follows. ``` ======= Memory Analysis ================================== Input memory: 4465.0 MB / 4.36 GB Output memory: 4464.8 MB / 4.36 GB Temp memory: 174977.1 MB / 170.88 GB Code memory: 0.0 MB / 0.00 GB Total HBM memory: 183906.9 MB / 179.60 GB ======= Cost Analysis ==================================== FLOPS: 71733280.0 M / 70052.03 G The number of exp/log/sin/cos ops: 21364.8 M / 20.86 G The total memory traffic: 1792723.2 MB / 1750.71 GB HBM access: 751479.1 MB / 733.87 GB L2 cache access: 328740.8 MB / 321.04 GB Register usage: 61266.7 MB / 59.83 GB Output data transferred: 677251.9 MB / 661.38 GB Hardware utilization scores Tensor Cores / MatMul units: 647.0 ALU (Arithmetic Logic Unit): 430.0 Memory Load/Store Units: 144.0 L1 Cache Operations: 92.0 L2 Cache Operations: 60.0 Special Function Units (exp/log/sin/cos): 41.0 Integer Units (for indexing, loop counters): 16.0 Branch Divergence (Control Flow Processing): 12.0 Load Balancing / Dispatch): 10.0 Texture Units (or Rarely Used Compute Units): 8.0 ```

ds-hwang requested review from ruomingp, markblee and a team as code owners March 5, 2025 19:53

ds-hwang force-pushed the trainer branch from 2ce24ef to a263907 Compare March 5, 2025 20:02

apghml requested changes Mar 5, 2025

View reviewed changes

ds-hwang force-pushed the trainer branch from a263907 to 4ee0ce2 Compare March 5, 2025 20:40

ds-hwang requested a review from apghml March 5, 2025 21:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trainer: the model analysis on the AOT compiled JAX program. #1036

Trainer: the model analysis on the AOT compiled JAX program. #1036

ds-hwang commented Mar 5, 2025

ds-hwang commented Mar 5, 2025

apghml Mar 5, 2025

ds-hwang Mar 5, 2025

apghml Mar 5, 2025 •

edited

Loading

		@@ -91,6 +101,14 @@ def _compile_and_dump_programs(
		logging.info("Wrote serialized %s to %s", program_name, serialized_compiled_output_path)

Trainer: the model analysis on the AOT compiled JAX program. #1036

Are you sure you want to change the base?

Trainer: the model analysis on the AOT compiled JAX program. #1036

Conversation

ds-hwang commented Mar 5, 2025

ds-hwang commented Mar 5, 2025

apghml Mar 5, 2025

Choose a reason for hiding this comment

ds-hwang Mar 5, 2025

Choose a reason for hiding this comment

apghml Mar 5, 2025 • edited Loading

Choose a reason for hiding this comment

apghml Mar 5, 2025 •

edited

Loading