RFC: petri: Add Windows host WPR trace collection for tests #1656

mattkur · 2025-07-07T17:27:15Z

Automatically collect Windows Performance Recorder traces during test execution
Embed WPR profile for OpenVMM/OpenHCL/Hyper-V components
Per-test trace sessions with automatic cleanup
Traces saved to test output directory

Co-authored with GitHub CoPilot

- Automatically collect Windows Performance Recorder traces during test execution - Embed WPR profile for OpenVMM/OpenHCL/Hyper-V components - Per-test trace sessions with automatic cleanup - Traces saved to test output directory

mattkur · 2025-07-07T17:29:27Z

This still suffers from a challenge: we run out of event collector resources. Under a modest degree of concurrency, we'll see trace start failing.

Since the collectors are global anyways, what I think I need to do is use the ETW APIs directly. Then I can ref count the running collector and dump logs when I need to.

(Or package xperf.exe and use the command line ...)

mattkur · 2025-07-07T17:32:15Z

petri/src/host_wpr_trace.rs

+
+use std::sync::Arc;
+
+use std::sync::atomic::{AtomicBool, Ordering};


formatting oops

smalis-msft · 2025-07-07T17:32:43Z

I think this isn't quite the right approach here, since as you discovered we run out of resources very quickly. I think what we need is to run this trace collection at a testpass level, instead of an individual test level. Then we can just have one trace file and no concurrency problems, and hopefully all the events have easily filterable ids in them. Of course, that's assuming only one testpass is on a runner at a time, which I'm not sure is a guarantee...

mattkur · 2025-07-07T17:36:06Z

I think this isn't quite the right approach here, since as you discovered we run out of resources very quickly. I think what we need is to run this trace collection at a testpass level, instead of an individual test level. Then we can just have one trace file and no concurrency problems, and hopefully all the events have easily filterable ids in them. Of course, that's assuming only one testpass is on a runner at a time, which I'm not sure is a guarantee...

Can I conceptually think of a "testpass" as one invocation of cargo nextest run ?

smalis-msft · 2025-07-07T17:38:30Z

I think so, yeah? Though it's certainly a little more fuzzy in our world.

mattkur · 2025-07-07T22:39:11Z

I think so, yeah? Though it's certainly a little more fuzzy in our world.

Yeah, the problem is that we don't have a single binary that launches our tests. At least, one that we control. cargo-nextest creates a process for each test case (a trial in libmimic speak).

I can envision a way to solve this for a single run. But if we're running multiple CI actions on the same machine, this quickly gets more difficult to reason through.

My motivation is to figure out a way to debug issues that repro only in CI that have interesting an interesting intersection with the HyperV vmm.

Any ideas?

Cargo nextest wrapper script? (once that stabilizes, I guess)
environment variable set as part of the pipeline?

smalis-msft · 2025-07-08T14:49:37Z

We could maybe modify flowey to collect traces while running the nextest node? That's probably the fastest approach to get something onboarded, but I'm still worried that if we have multiple PRs/runs going on a single runner that could cause issues. @tjones60 Is that a concern, or are we guaranteed to only have 1 job running at a time?

mattkur · 2025-07-08T17:27:07Z

We could maybe modify flowey to collect traces while running the nextest node? That's probably the fastest approach to get something onboarded, but I'm still worried that if we have multiple PRs/runs going on a single runner that could cause issues. @tjones60 Is that a concern, or are we guaranteed to only have 1 job running at a time?

oh, good idea. That does mean that we won't get these traces when run manually. But maybe that's okay.

I'll try to make a change where flowey defines a unique name. It's possible that we run out of system resources if there is too much parallelism (multiple CI runs on same machine), but the code to coordinate among multiple tests is really messy. My latest iteration is a stab at that.

tjones60 · 2025-07-08T17:34:24Z

We could maybe modify flowey to collect traces while running the nextest node? That's probably the fastest approach to get something onboarded, but I'm still worried that if we have multiple PRs/runs going on a single runner that could cause issues. @tjones60 Is that a concern, or are we guaranteed to only have 1 job running at a time?

oh, good idea. That does mean that we won't get these traces when run manually. But maybe that's okay.

I'll try to make a change where flowey defines a unique name. It's possible that we run out of system resources if there is too much parallelism (multiple CI runs on same machine), but the code to coordinate among multiple tests is really messy. My latest iteration is a stab at that.

Multiple CI runs do not run concurrently on the same machine, but multiple petri tests do run at the same time. I have thought about this before, and I don't think there is a good way to collect traces for our Hyper-V tests since many of them are not tagged with the relevant VM. We could collect all the system-wide traces, but I think that would have limited usefulness.

smalis-msft · 2025-07-08T18:25:45Z

Do we think there's any chance we could ask the hyper-v folks (and/or do this ourselves) to add a vm id to every event we care about? That way a system-level trace could be useful.

Alternatively, we could maybe create a serial-testing mode where tests are run one at a time, and have some way to do a run with that manually, but keep it off by default?

(Also, anything we add to flowey can maybe be reused in the future when we have an xflowey vmm-test command)

mattkur · 2025-07-14T12:48:10Z

Do we think there's any chance we could ask the hyper-v folks (and/or do this ourselves) to add a vm id to every event we care about? That way a system-level trace could be useful.

Alternatively, we could maybe create a serial-testing mode where tests are run one at a time, and have some way to do a run with that manually, but keep it off by default?

(Also, anything we add to flowey can maybe be reused in the future when we have an xflowey vmm-test command)

Yes, we should be making or driving for changes in the Hyper-V virt stack if we find that there are events that the virt stack emits without sufficient ability to tie to the appropriate VM. In general, I think that we will find value even in the events as they are emitted now.

However, this is not the right approach. Let me summarize the findings into an issue and close out this PR.

mattkur · 2025-07-14T12:59:37Z

Filed #1689 ; closing this PR for now.

petri: Add Windows host WPR trace collection for tests

3a4db9a

- Automatically collect Windows Performance Recorder traces during test execution - Embed WPR profile for OpenVMM/OpenHCL/Hyper-V components - Per-test trace sessions with automatic cleanup - Traces saved to test output directory

mattkur requested a review from a team as a code owner July 7, 2025 17:27

mattkur commented Jul 7, 2025

View reviewed changes

petri/src/host_wpr_trace.rs

use std::sync::Arc;

use std::sync::atomic::{AtomicBool, Ordering};

Copy link

Contributor Author

mattkur Jul 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

formatting oops

wip: ref count but still use wpr.exe

42fb752

mattkur mentioned this pull request Jul 14, 2025

ci: add virt stack etl traces to petri runs #1689

Open

mattkur closed this Jul 14, 2025

mattkur mentioned this pull request Jul 14, 2025

ci: add virt stack etl traces to petri runs mattkur/openvmm#5

Open

Copilot AI mentioned this pull request Jul 14, 2025

ci: add virt stack etl traces to petri runs mattkur/openvmm#6

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC: petri: Add Windows host WPR trace collection for tests #1656

RFC: petri: Add Windows host WPR trace collection for tests #1656

Uh oh!

mattkur commented Jul 7, 2025

Uh oh!

mattkur commented Jul 7, 2025 •

edited

Loading

Uh oh!

mattkur Jul 7, 2025

Uh oh!

smalis-msft commented Jul 7, 2025 •

edited

Loading

Uh oh!

mattkur commented Jul 7, 2025

Uh oh!

smalis-msft commented Jul 7, 2025

Uh oh!

mattkur commented Jul 7, 2025

Uh oh!

smalis-msft commented Jul 8, 2025

Uh oh!

mattkur commented Jul 8, 2025

Uh oh!

tjones60 commented Jul 8, 2025

Uh oh!

smalis-msft commented Jul 8, 2025

Uh oh!

mattkur commented Jul 14, 2025

Uh oh!

mattkur commented Jul 14, 2025

Uh oh!

Uh oh!


		use std::sync::Arc;

		use std::sync::atomic::{AtomicBool, Ordering};

RFC: petri: Add Windows host WPR trace collection for tests #1656

RFC: petri: Add Windows host WPR trace collection for tests #1656

Uh oh!

Conversation

mattkur commented Jul 7, 2025

Uh oh!

mattkur commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattkur Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

smalis-msft commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattkur commented Jul 7, 2025

Uh oh!

smalis-msft commented Jul 7, 2025

Uh oh!

mattkur commented Jul 7, 2025

Uh oh!

smalis-msft commented Jul 8, 2025

Uh oh!

mattkur commented Jul 8, 2025

Uh oh!

tjones60 commented Jul 8, 2025

Uh oh!

smalis-msft commented Jul 8, 2025

Uh oh!

mattkur commented Jul 14, 2025

Uh oh!

mattkur commented Jul 14, 2025

Uh oh!

Uh oh!

mattkur commented Jul 7, 2025 •

edited

Loading

smalis-msft commented Jul 7, 2025 •

edited

Loading