-
Notifications
You must be signed in to change notification settings - Fork 132
RFC: petri: Add Windows host WPR trace collection for tests #1656
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Automatically collect Windows Performance Recorder traces during test execution - Embed WPR profile for OpenVMM/OpenHCL/Hyper-V components - Per-test trace sessions with automatic cleanup - Traces saved to test output directory
This still suffers from a challenge: we run out of event collector resources. Under a modest degree of concurrency, we'll see trace start failing. Since the collectors are global anyways, what I think I need to do is use the ETW APIs directly. Then I can ref count the running collector and dump logs when I need to. (Or package xperf.exe and use the command line ...) |
|
||
use std::sync::Arc; | ||
|
||
use std::sync::atomic::{AtomicBool, Ordering}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
formatting oops
I think this isn't quite the right approach here, since as you discovered we run out of resources very quickly. I think what we need is to run this trace collection at a testpass level, instead of an individual test level. Then we can just have one trace file and no concurrency problems, and hopefully all the events have easily filterable ids in them. Of course, that's assuming only one testpass is on a runner at a time, which I'm not sure is a guarantee... |
Can I conceptually think of a "testpass" as one invocation of |
I think so, yeah? Though it's certainly a little more fuzzy in our world. |
Yeah, the problem is that we don't have a single binary that launches our tests. At least, one that we control. I can envision a way to solve this for a single run. But if we're running multiple CI actions on the same machine, this quickly gets more difficult to reason through. My motivation is to figure out a way to debug issues that repro only in CI that have interesting an interesting intersection with the HyperV vmm. Any ideas?
|
We could maybe modify flowey to collect traces while running the nextest node? That's probably the fastest approach to get something onboarded, but I'm still worried that if we have multiple PRs/runs going on a single runner that could cause issues. @tjones60 Is that a concern, or are we guaranteed to only have 1 job running at a time? |
oh, good idea. That does mean that we won't get these traces when run manually. But maybe that's okay. I'll try to make a change where flowey defines a unique name. It's possible that we run out of system resources if there is too much parallelism (multiple CI runs on same machine), but the code to coordinate among multiple tests is really messy. My latest iteration is a stab at that. |
Multiple CI runs do not run concurrently on the same machine, but multiple petri tests do run at the same time. I have thought about this before, and I don't think there is a good way to collect traces for our Hyper-V tests since many of them are not tagged with the relevant VM. We could collect all the system-wide traces, but I think that would have limited usefulness. |
Do we think there's any chance we could ask the hyper-v folks (and/or do this ourselves) to add a vm id to every event we care about? That way a system-level trace could be useful. Alternatively, we could maybe create a serial-testing mode where tests are run one at a time, and have some way to do a run with that manually, but keep it off by default? (Also, anything we add to flowey can maybe be reused in the future when we have an xflowey vmm-test command) |
Yes, we should be making or driving for changes in the Hyper-V virt stack if we find that there are events that the virt stack emits without sufficient ability to tie to the appropriate VM. In general, I think that we will find value even in the events as they are emitted now. However, this is not the right approach. Let me summarize the findings into an issue and close out this PR. |
Filed #1689 ; closing this PR for now. |
Co-authored with GitHub CoPilot