Skip to content

RFC: petri: Add Windows host WPR trace collection for tests #1656

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

mattkur
Copy link
Contributor

@mattkur mattkur commented Jul 7, 2025

  • Automatically collect Windows Performance Recorder traces during test execution
  • Embed WPR profile for OpenVMM/OpenHCL/Hyper-V components
  • Per-test trace sessions with automatic cleanup
  • Traces saved to test output directory

Co-authored with GitHub CoPilot

- Automatically collect Windows Performance Recorder traces during test execution
- Embed WPR profile for OpenVMM/OpenHCL/Hyper-V components
- Per-test trace sessions with automatic cleanup
- Traces saved to test output directory
@mattkur mattkur requested a review from a team as a code owner July 7, 2025 17:27
@mattkur
Copy link
Contributor Author

mattkur commented Jul 7, 2025

This still suffers from a challenge: we run out of event collector resources. Under a modest degree of concurrency, we'll see trace start failing.

Since the collectors are global anyways, what I think I need to do is use the ETW APIs directly. Then I can ref count the running collector and dump logs when I need to.

(Or package xperf.exe and use the command line ...)


use std::sync::Arc;

use std::sync::atomic::{AtomicBool, Ordering};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

formatting oops

@smalis-msft
Copy link
Contributor

smalis-msft commented Jul 7, 2025

I think this isn't quite the right approach here, since as you discovered we run out of resources very quickly. I think what we need is to run this trace collection at a testpass level, instead of an individual test level. Then we can just have one trace file and no concurrency problems, and hopefully all the events have easily filterable ids in them. Of course, that's assuming only one testpass is on a runner at a time, which I'm not sure is a guarantee...

@mattkur
Copy link
Contributor Author

mattkur commented Jul 7, 2025

I think this isn't quite the right approach here, since as you discovered we run out of resources very quickly. I think what we need is to run this trace collection at a testpass level, instead of an individual test level. Then we can just have one trace file and no concurrency problems, and hopefully all the events have easily filterable ids in them. Of course, that's assuming only one testpass is on a runner at a time, which I'm not sure is a guarantee...

Can I conceptually think of a "testpass" as one invocation of cargo nextest run ?

@smalis-msft
Copy link
Contributor

I think so, yeah? Though it's certainly a little more fuzzy in our world.

@mattkur
Copy link
Contributor Author

mattkur commented Jul 7, 2025

I think so, yeah? Though it's certainly a little more fuzzy in our world.

Yeah, the problem is that we don't have a single binary that launches our tests. At least, one that we control. cargo-nextest creates a process for each test case (a trial in libmimic speak).

I can envision a way to solve this for a single run. But if we're running multiple CI actions on the same machine, this quickly gets more difficult to reason through.

My motivation is to figure out a way to debug issues that repro only in CI that have interesting an interesting intersection with the HyperV vmm.

Any ideas?

  • Cargo nextest wrapper script? (once that stabilizes, I guess)
  • environment variable set as part of the pipeline?

@smalis-msft
Copy link
Contributor

We could maybe modify flowey to collect traces while running the nextest node? That's probably the fastest approach to get something onboarded, but I'm still worried that if we have multiple PRs/runs going on a single runner that could cause issues. @tjones60 Is that a concern, or are we guaranteed to only have 1 job running at a time?

@mattkur
Copy link
Contributor Author

mattkur commented Jul 8, 2025

We could maybe modify flowey to collect traces while running the nextest node? That's probably the fastest approach to get something onboarded, but I'm still worried that if we have multiple PRs/runs going on a single runner that could cause issues. @tjones60 Is that a concern, or are we guaranteed to only have 1 job running at a time?

oh, good idea. That does mean that we won't get these traces when run manually. But maybe that's okay.

I'll try to make a change where flowey defines a unique name. It's possible that we run out of system resources if there is too much parallelism (multiple CI runs on same machine), but the code to coordinate among multiple tests is really messy. My latest iteration is a stab at that.

@tjones60
Copy link
Contributor

tjones60 commented Jul 8, 2025

We could maybe modify flowey to collect traces while running the nextest node? That's probably the fastest approach to get something onboarded, but I'm still worried that if we have multiple PRs/runs going on a single runner that could cause issues. @tjones60 Is that a concern, or are we guaranteed to only have 1 job running at a time?

oh, good idea. That does mean that we won't get these traces when run manually. But maybe that's okay.

I'll try to make a change where flowey defines a unique name. It's possible that we run out of system resources if there is too much parallelism (multiple CI runs on same machine), but the code to coordinate among multiple tests is really messy. My latest iteration is a stab at that.

Multiple CI runs do not run concurrently on the same machine, but multiple petri tests do run at the same time. I have thought about this before, and I don't think there is a good way to collect traces for our Hyper-V tests since many of them are not tagged with the relevant VM. We could collect all the system-wide traces, but I think that would have limited usefulness.

@smalis-msft
Copy link
Contributor

Do we think there's any chance we could ask the hyper-v folks (and/or do this ourselves) to add a vm id to every event we care about? That way a system-level trace could be useful.

Alternatively, we could maybe create a serial-testing mode where tests are run one at a time, and have some way to do a run with that manually, but keep it off by default?

(Also, anything we add to flowey can maybe be reused in the future when we have an xflowey vmm-test command)

@mattkur
Copy link
Contributor Author

mattkur commented Jul 14, 2025

Do we think there's any chance we could ask the hyper-v folks (and/or do this ourselves) to add a vm id to every event we care about? That way a system-level trace could be useful.

Alternatively, we could maybe create a serial-testing mode where tests are run one at a time, and have some way to do a run with that manually, but keep it off by default?

(Also, anything we add to flowey can maybe be reused in the future when we have an xflowey vmm-test command)

Yes, we should be making or driving for changes in the Hyper-V virt stack if we find that there are events that the virt stack emits without sufficient ability to tie to the appropriate VM. In general, I think that we will find value even in the events as they are emitted now.

However, this is not the right approach. Let me summarize the findings into an issue and close out this PR.

@mattkur
Copy link
Contributor Author

mattkur commented Jul 14, 2025

Filed #1689 ; closing this PR for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants