[PROTON-DEV] Add realtime metric #7334

ShawnZhong · 2025-06-26T23:51:36Z

In addition to the existing CYCLE metric, add a new metric, REALTIME, which is a global clock synchronized across SMs. It is guaranteed to count in nanoseconds, regardless of the GPU clock frequency.

This is useful for constructing a global timeline for

understanding the bubble/imbalance across SMs/CUs

better visualization for Chrome Trace e.g.,

triton/third_party/proton/common/include/TraceDataIO/TraceWriter.h

Lines 15 to 17 in 9faa7cd

    
           // TODO(fywkevin): this time gap to offset multiple kernels is not needed after 
        
           // we have the global time. 
        
           const uint64_t kKernelTimeGap = 10000000;

ShawnZhong · 2025-06-27T04:56:21Z

cc @Jokeren and @fywkevin for review. I am new to LLVM. Please let me know if anything can be improved :)

Jokeren · 2025-06-27T05:39:43Z

I don't understand how the i64 return value fits here as we should truncate it to a smaller bitwidth.

ShawnZhong · 2025-06-27T05:49:59Z

I don't understand how the i64 return value fits here as we should truncate it to a smaller bitwidth.

Since 9367791, the default clock size has been changed to 64 bits with some of the higher bits used for other purposes.

triton/third_party/proton/dialect/lib/ProtonGPUToLLVM/Utility.cpp

Lines 96 to 122 in 9faa7cd

    
           // Constructing the tag and clock (8 byte) 
        
           // ======================================= 
        
           // tag and upper clock (4 bytes): 
        
           // 31: start or end (1 bit) 
        
           // 30:23 scope id (8 bits) 
        
           // 22:11 reserved (12 bits) 
        
           // 10:0  64-bit clock bit 32:42 (11 bits) 
        
           // ======================================= 
        
           // lower clock (4 bytes): 
        
           // 31:0 64-bit clock bit 0:31 
        
           // ======================================= 
        
           Value clock = op.getCounter(); 
        
           auto clkTy = mlir::cast<IntegerType>(clock.getType()); 
        
           uint32_t maskedScopeId = op.getScopeId() & 0xff; 
        
           Value tag = op.getIsStart() ? b.i32_val(maskedScopeId << 23) 
        
                                       : b.i32_val(1 << 31 | maskedScopeId << 23); 
        
           Value valsVec; 
        
           if (clkTy.getWidth() == 64) { 
        
             auto clkVecTy = vec_ty(i32_ty, 2); 
        
             auto clkVec = b.bitcast(clock, clkVecTy); 
        
             Value clkLower = b.extract_element(i32_ty, clkVec, b.i32_val(0)); 
        
             Value clkUpper = b.extract_element(i32_ty, clkVec, b.i32_val(1)); 
        
             Value tagClkUpper = b.or_(tag, b.and_(clkUpper, b.i32_val(0x7ff))); 
        
             valsVec = packLLVector(loc, {tagClkUpper, clkLower}, rewriter); 
        
           } else { 
        
             valsVec = packLLVector(loc, {tag, clock}, rewriter); 
        
           }

I suppose TargetInfo::realtime (as well as the existing TargetInfo::clock) should be safe to return 64-bit integer and the truncation is done by later code.

Jokeren · 2025-06-27T08:04:56Z

I still don't get how "cycle" is different from "time" in this case.
I’m not questioning the truncation of the i64 GPU timestamp itself. Our goal is to align CPU and GPU timestamps once we establish a consistent reference between them. If we truncate the GPU-side timestamps, it’s unclear how we can still construct a unified trace.

fywkevin · 2025-06-27T19:50:48Z

Since we use a global timer to align CTAs across SMs, so we only need 1 global timestamp per CTA to capture the starting time. The right way to do this is to add this 64-bit global timestamp in the metadata section of each CTA (we store smid, ctaid, buffer-size, ..., in the CTA metadata section). Briefly, capturing the global timestamp at the beginning of kernel execution, and store in when proton_gpu.finalize.

ShawnZhong added 3 commits June 26, 2025 16:43

Add realtime metric

1fa8dcd

fix test

03e4c64

add test

74ddd39

ShawnZhong marked this pull request as ready for review June 27, 2025 04:56

ShawnZhong requested a review from ptillet as a code owner June 27, 2025 04:56

ShawnZhong changed the title ~~[WIP][PROTON-DEV] Add realtime metric~~ [PROTON-DEV] Add realtime metric Jun 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[PROTON-DEV] Add realtime metric #7334

[PROTON-DEV] Add realtime metric #7334

Uh oh!

ShawnZhong commented Jun 26, 2025 •

edited

Loading

Uh oh!

ShawnZhong commented Jun 27, 2025

Uh oh!

Jokeren commented Jun 27, 2025

Uh oh!

ShawnZhong commented Jun 27, 2025

Uh oh!

Jokeren commented Jun 27, 2025

Uh oh!

fywkevin commented Jun 27, 2025

Uh oh!

Uh oh!

	// TODO(fywkevin): this time gap to offset multiple kernels is not needed after
	// we have the global time.
	const uint64_t kKernelTimeGap = 10000000;

[PROTON-DEV] Add realtime metric #7334

Are you sure you want to change the base?

[PROTON-DEV] Add realtime metric #7334

Uh oh!

Conversation

ShawnZhong commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ShawnZhong commented Jun 27, 2025

Uh oh!

Jokeren commented Jun 27, 2025

Uh oh!

ShawnZhong commented Jun 27, 2025

Uh oh!

Jokeren commented Jun 27, 2025

Uh oh!

fywkevin commented Jun 27, 2025

Uh oh!

Uh oh!

ShawnZhong commented Jun 26, 2025 •

edited

Loading