Description
Environment
- Hardware: Aurora
- Intel Compute Runtime Version: 24.35.30872.22
Context
I'm developing a profiler for SYCL offload programs. My approach involves serializing kernel launches using zeEventHostSynchronize
to ensure only one kernel is offloaded to the Intel GPU device at a time. For each kernel, I use a profiling thread to read stall sampling data using zetMetricStreamerReadData
.
Current Implementation
Currently, after each kernel execution, I collect and process the data. To ensure non-overlapping stall samples between kernels, I've implemented a manual buffer flushing function zeroFlushStreamerBuffer(streamer, desc)
. This function closes the current streamer and opens a new one.
void zeroFlushStreamerBuffer(zet_metric_streamer_handle_t& streamer, ZeDeviceDescriptor* desc)
{
ze_result_t status = ZE_RESULT_SUCCESS;
// Close the old streamer
status = zetMetricStreamerClose(streamer);
level0_check_result(status, **LINE**);
// Open a new streamer
uint32_t interval = 500000; // ns
zet_metric_streamer_desc_t streamer_desc = {ZET_STRUCTURE_TYPE_METRIC_STREAMER_DESC, nullptr, max_metric_samples, interval};
status = zetMetricStreamerOpen(desc->context_, desc->device_, desc->metric_group_, &streamer_desc, nullptr, &streamer);
if (status != ZE_RESULT_SUCCESS) {
std::cerr << "[ERROR] Failed to open metric streamer (" << status << "). The sampling interval might be too small." << std::endl;
streamer = nullptr;
return;
}
if (streamer_desc.notifyEveryNReports > max_metric_samples) {
max_metric_samples = streamer_desc.notifyEveryNReports;
}
}
Current Implementation Details
To provide more context, here's the main profiling loop where zeroFlushStreamerBuffer
is used:
void
ZeMetricProfiler::RunProfilingLoop
(
ZeDeviceDescriptor* desc,
zet_metric_streamer_handle_t& streamer
)
{
std::vector<uint8_t> raw_metrics(MAX_METRIC_BUFFER + 512);
desc->profiling_state_.store(PROFILER_ENABLED, std::memory_order_release);
ze_result_t status;
while (desc->profiling_state_.load(std::memory_order_acquire) != PROFILER_DISABLED) {
// Wait for the kernel to start running
while (true) {
status = zeEventHostSynchronize(desc->serial_kernel_start_, 50000000);
if (status == ZE_RESULT_SUCCESS) {
break;
}
// Handle case where kernel execution is extremely short:
// In such cases, the kernel might finish before zeEventHostSynchronize can detect the start event.
// Without this check, a deadlock could occur:
// - The Profiling thread would keep waiting for the start event (which has already been reset).
// - The App thread would be waiting for the Profiling thread to complete data processing.
// kernel_started_ allows Profiling thread to proceed, avoiding deadlock.
if (desc->kernel_started_.load(std::memory_order_acquire)) {
break;
}
if (desc->profiling_state_.load(std::memory_order_acquire) == PROFILER_DISABLED) {
return;
}
}
// Kernel is running, enter sampling loop
while (true) {
// Update correlation ID
gpu_correlation_channel_receive(1, UpdateCorrelationID, desc);
// Wait for the next interval
status = zeEventHostSynchronize(desc->serial_kernel_end_, 5000);
if (status == ZE_RESULT_SUCCESS) {
break;
}
CollectAndProcessMetrics(desc, streamer, raw_metrics);
}
// Kernel has finished, perform final sampling and cleanup
CollectAndProcessMetrics(desc, streamer, raw_metrics);
// FIXME(Yuning): may need a better way to flush the streamer buffer without repeatedly closing and reopening the streamer
zeroFlushStreamerBuffer(streamer, desc);
desc->running_kernel_ = nullptr;
desc->kernel_started_.store(false, std::memory_order_release);
// Notify the app thread that data processing is complete
status = zeEventHostSignal(desc->serial_data_ready_);
level0_check_result(status, **LINE**);
}
}
This code demonstrates how we currently handle metric collection for each kernel execution, including the use of zeroFlushStreamerBuffer
to attempt non-overlapping data collection between kernels.
Questions
-
Data Overlap: When collecting data for a kernel after its execution, is there a possibility that the data from
zetMetricStreamerReadData
includes stall samples from the previous kernel? My goal is to obtain non-overlapping stall samples for each kernel to enable fine-grained performance analysis. -
API Enhancement: If my understanding is correct, would it be possible to provide a levelzero API for flushing the metrics streamer, such as
zetMetricStreamerFlushData
? This could potentially be more efficient than the currentzeroFlushStreamerBuffer
implementation. -
Clarification: If my understanding is incorrect, could you please confirm that each call to
zetMetricStreamerReadData
always returns non-overlapping data? This would allow me to remove thezeroFlushStreamerBuffer
function, potentially improving performance.
Request
I would greatly appreciate clarification on the behavior of zetMetricStreamerReadData
in this context and any guidance on the best practices for ensuring non-overlapping metric collection between kernel executions.