Clarification on `zetMetricStreamerReadData` Behavior for Non-Overlapping Kernel Profiling

## Environment
- Hardware: Aurora
- Intel Compute Runtime Version: 24.35.30872.22

## Context
I'm developing a profiler for SYCL offload programs. My approach involves serializing kernel launches using `zeEventHostSynchronize` to ensure only one kernel is offloaded to the Intel GPU device at a time. For each kernel, I use a profiling thread to read stall sampling data using `zetMetricStreamerReadData`.

## Current Implementation
Currently, after each kernel execution, I collect and process the data. To ensure non-overlapping stall samples between kernels, I've implemented a manual buffer flushing function `zeroFlushStreamerBuffer(streamer, desc)`. This function closes the current streamer and opens a new one.

```cpp
void zeroFlushStreamerBuffer(zet_metric_streamer_handle_t& streamer, ZeDeviceDescriptor* desc)
{
    ze_result_t status = ZE_RESULT_SUCCESS;
    // Close the old streamer
    status = zetMetricStreamerClose(streamer);
    level0_check_result(status, **LINE**);
    // Open a new streamer
    uint32_t interval = 500000; // ns
    zet_metric_streamer_desc_t streamer_desc = {ZET_STRUCTURE_TYPE_METRIC_STREAMER_DESC, nullptr, max_metric_samples, interval};
    status = zetMetricStreamerOpen(desc->context_, desc->device_, desc->metric_group_, &streamer_desc, nullptr, &streamer);
    if (status != ZE_RESULT_SUCCESS) {
        std::cerr << "[ERROR] Failed to open metric streamer (" << status << "). The sampling interval might be too small." << std::endl;
        streamer = nullptr;
        return;
    }
    if (streamer_desc.notifyEveryNReports > max_metric_samples) {
        max_metric_samples = streamer_desc.notifyEveryNReports;
    }
}
```

## Current Implementation Details

To provide more context, here's the main profiling loop where `zeroFlushStreamerBuffer` is used:

```cpp
void 
ZeMetricProfiler::RunProfilingLoop
(
  ZeDeviceDescriptor* desc,
  zet_metric_streamer_handle_t& streamer
)
{
  std::vector<uint8_t> raw_metrics(MAX_METRIC_BUFFER + 512);
  desc->profiling_state_.store(PROFILER_ENABLED, std::memory_order_release);
  ze_result_t status;
  
  while (desc->profiling_state_.load(std::memory_order_acquire) != PROFILER_DISABLED) {
    // Wait for the kernel to start running
    while (true) {
      status = zeEventHostSynchronize(desc->serial_kernel_start_, 50000000);
      if (status == ZE_RESULT_SUCCESS) {
        break;
      }
      // Handle case where kernel execution is extremely short:
      // In such cases, the kernel might finish before zeEventHostSynchronize can detect the start event.
      // Without this check, a deadlock could occur:
      // - The Profiling thread would keep waiting for the start event (which has already been reset).
      // - The App thread would be waiting for the Profiling thread to complete data processing.
      // kernel_started_ allows Profiling thread to proceed, avoiding deadlock.
      if (desc->kernel_started_.load(std::memory_order_acquire)) {
        break;
      }
      if (desc->profiling_state_.load(std::memory_order_acquire) == PROFILER_DISABLED) {
        return;
      }
    }
    // Kernel is running, enter sampling loop
    while (true) {
      // Update correlation ID
      gpu_correlation_channel_receive(1, UpdateCorrelationID, desc);
      // Wait for the next interval
      status = zeEventHostSynchronize(desc->serial_kernel_end_, 5000);
      if (status == ZE_RESULT_SUCCESS) {
        break;
      }
      CollectAndProcessMetrics(desc, streamer, raw_metrics);
    }
    // Kernel has finished, perform final sampling and cleanup
    CollectAndProcessMetrics(desc, streamer, raw_metrics);
    // FIXME(Yuning): may need a better way to flush the streamer buffer without repeatedly closing and reopening the streamer
    zeroFlushStreamerBuffer(streamer, desc);
    desc->running_kernel_ = nullptr;
    desc->kernel_started_.store(false, std::memory_order_release);
    
    // Notify the app thread that data processing is complete
    status = zeEventHostSignal(desc->serial_data_ready_);
    level0_check_result(status, **LINE**);
  }
}
```

This code demonstrates how we currently handle metric collection for each kernel execution, including the use of `zeroFlushStreamerBuffer` to attempt non-overlapping data collection between kernels.

## Questions
1. **Data Overlap**: When collecting data for a kernel after its execution, is there a possibility that the data from `zetMetricStreamerReadData` includes stall samples from the previous kernel? My goal is to obtain non-overlapping stall samples for each kernel to enable fine-grained performance analysis.

2. **API Enhancement**: If my understanding is correct, would it be possible to provide a levelzero API for flushing the metrics streamer, such as `zetMetricStreamerFlushData`? This could potentially be more efficient than the current `zeroFlushStreamerBuffer` implementation.

3. **Clarification**: If my understanding is incorrect, could you please confirm that each call to `zetMetricStreamerReadData` always returns non-overlapping data? This would allow me to remove the `zeroFlushStreamerBuffer` function, potentially improving performance.

## Request
I would greatly appreciate clarification on the behavior of `zetMetricStreamerReadData` in this context and any guidance on the best practices for ensuring non-overlapping metric collection between kernel executions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Clarification on `zetMetricStreamerReadData` Behavior for Non-Overlapping Kernel Profiling #767

Environment

Context

Current Implementation

Current Implementation Details

Questions

Request

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Clarification on zetMetricStreamerReadData Behavior for Non-Overlapping Kernel Profiling #767

Description

Environment

Context

Current Implementation

Current Implementation Details

Questions

Request

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Clarification on `zetMetricStreamerReadData` Behavior for Non-Overlapping Kernel Profiling #767