Skip to content

[PROTON-DEV] Add realtime metric #7334

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: proton-dev
Choose a base branch
from

Conversation

ShawnZhong
Copy link
Contributor

@ShawnZhong ShawnZhong commented Jun 26, 2025

In addition to the existing CYCLE metric, add a new metric, REALTIME, which is a global clock synchronized across SMs. It is guaranteed to count in nanoseconds, regardless of the GPU clock frequency.

This is useful for constructing a global timeline for

@ShawnZhong ShawnZhong marked this pull request as ready for review June 27, 2025 04:56
@ShawnZhong ShawnZhong requested a review from ptillet as a code owner June 27, 2025 04:56
@ShawnZhong
Copy link
Contributor Author

cc @Jokeren and @fywkevin for review. I am new to LLVM. Please let me know if anything can be improved :)

@ShawnZhong ShawnZhong changed the title [WIP][PROTON-DEV] Add realtime metric [PROTON-DEV] Add realtime metric Jun 27, 2025
@Jokeren
Copy link
Contributor

Jokeren commented Jun 27, 2025

I don't understand how the i64 return value fits here as we should truncate it to a smaller bitwidth.

@ShawnZhong
Copy link
Contributor Author

I don't understand how the i64 return value fits here as we should truncate it to a smaller bitwidth.

Since 9367791, the default clock size has been changed to 64 bits with some of the higher bits used for other purposes.

// Constructing the tag and clock (8 byte)
// =======================================
// tag and upper clock (4 bytes):
// 31: start or end (1 bit)
// 30:23 scope id (8 bits)
// 22:11 reserved (12 bits)
// 10:0 64-bit clock bit 32:42 (11 bits)
// =======================================
// lower clock (4 bytes):
// 31:0 64-bit clock bit 0:31
// =======================================
Value clock = op.getCounter();
auto clkTy = mlir::cast<IntegerType>(clock.getType());
uint32_t maskedScopeId = op.getScopeId() & 0xff;
Value tag = op.getIsStart() ? b.i32_val(maskedScopeId << 23)
: b.i32_val(1 << 31 | maskedScopeId << 23);
Value valsVec;
if (clkTy.getWidth() == 64) {
auto clkVecTy = vec_ty(i32_ty, 2);
auto clkVec = b.bitcast(clock, clkVecTy);
Value clkLower = b.extract_element(i32_ty, clkVec, b.i32_val(0));
Value clkUpper = b.extract_element(i32_ty, clkVec, b.i32_val(1));
Value tagClkUpper = b.or_(tag, b.and_(clkUpper, b.i32_val(0x7ff)));
valsVec = packLLVector(loc, {tagClkUpper, clkLower}, rewriter);
} else {
valsVec = packLLVector(loc, {tag, clock}, rewriter);
}

I suppose TargetInfo::realtime (as well as the existing TargetInfo::clock) should be safe to return 64-bit integer and the truncation is done by later code.

@Jokeren
Copy link
Contributor

Jokeren commented Jun 27, 2025

I still don't get how "cycle" is different from "time" in this case.
I’m not questioning the truncation of the i64 GPU timestamp itself. Our goal is to align CPU and GPU timestamps once we establish a consistent reference between them. If we truncate the GPU-side timestamps, it’s unclear how we can still construct a unified trace.

@fywkevin
Copy link
Contributor

Since we use a global timer to align CTAs across SMs, so we only need 1 global timestamp per CTA to capture the starting time. The right way to do this is to add this 64-bit global timestamp in the metadata section of each CTA (we store smid, ctaid, buffer-size, ..., in the CTA metadata section). Briefly, capturing the global timestamp at the beginning of kernel execution, and store in when proton_gpu.finalize.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants