-
Notifications
You must be signed in to change notification settings - Fork 132
nvme: Pass controller instance ID to NvmeDriver for better tracing #1661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: mattkur <[email protected]>
@@ -993,6 +993,7 @@ async fn make_disk_type_from_physical_device( | |||
// We can't validate yet that this namespace actually exists. That will | |||
// be checked later. | |||
return Ok(Resource::new(NvmeDiskConfig { | |||
controller_instance_id: Some(controller_instance_id.to_string()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename controller_instance_id
to debug_id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed controller_instance_id
to debug_id
throughout the codebase in commit e76bedf.
) | ||
.instrument(tracing::info_span!( | ||
"nvme_get_namespace", | ||
controller_instance_id = controller_instance_id.as_ref().map(|s| s.as_str()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the .as_ref() necessary here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The .as_ref()
is no longer necessary since the parameter is now a String
instead of Option<String>
. Simplified the tracing span code in commit e76bedf.
@@ -235,7 +246,7 @@ impl NvmeManagerWorker { | |||
match req { | |||
Request::Inspect(deferred) => deferred.inspect(&self), | |||
Request::ForceLoadDriver(update) => { | |||
match self.get_driver(update.new_value().to_owned()).await { | |||
match self.get_driver(None, update.new_value().to_owned()).await { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of "None", create a dummy string: force-load
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced None with "force-load" dummy string in the ForceLoadDriver case in commit e76bedf.
@@ -191,13 +193,22 @@ pub struct NvmeManagerClient { | |||
impl NvmeManagerClient { | |||
pub async fn get_namespace( | |||
&self, | |||
controller_instance_id: Option<String>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make this mandatory (a String
rather than Option<String>
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made the parameter mandatory by changing from Option<String>
to String
throughout the codebase in commit e76bedf.
Co-authored-by: mattkur <[email protected]>
@@ -175,7 +175,7 @@ impl NvmeManager { | |||
enum Request { | |||
Inspect(inspect::Deferred), | |||
ForceLoadDriver(inspect::DeferredUpdate), | |||
GetNamespace(Rpc<(String, u32), Result<nvme_driver::Namespace, NamespaceError>>), | |||
GetNamespace(Rpc<(String, String, u32), Result<nvme_driver::Namespace, NamespaceError>>), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is prone for errors. Make each String strongly typed, so that the caller can't confuse the debug_id
and the pci_id
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created strongly typed NvmeDeviceName
and PciId
wrapper types with GetNamespaceParams
struct to prevent parameter confusion in RPC calls. This eliminates the risk of mixing up the name and pci_id parameters. (commit 02fbdbc)
impl Default for NvmeDiskConfig { | ||
fn default() -> Self { | ||
Self { | ||
debug_id: "force-load".to_string(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More bikeshedding: let's just call this name
instead of debug_id
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed debug_id
to name
throughout the codebase including struct fields, method names, and all call sites. (commit 02fbdbc)
@@ -407,6 +425,7 @@ impl NvmeManagerWorker { | |||
vfio_device, | |||
&disk.driver_state, | |||
self.is_isolated, | |||
None, // controller_instance_id is not persisted in saved state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unfortunate. Is there a way to re-hook-up the name
(formerly known as controller_instance_id
and debug_id
) after a restore, without putting it in the NvmeManagerSavedState
?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For restored drivers, I now generate the name as "restored-{pci_id}"
using the available PCI ID. This avoids storing the name in NvmeManagerSavedState
while still providing a meaningful identifier for tracing. (commit 02fbdbc)
…ame, and fix restore issue Co-authored-by: mattkur <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to the point comments I indicated, also review all cases where pci_id
is used in the nvme driver. For errors or log lines, include the new debug information.
impl From<String> for PciId { | ||
fn from(s: String) -> Self { | ||
PciId(s) | ||
} | ||
} | ||
|
||
impl From<&str> for PciId { | ||
fn from(s: &str) -> Self { | ||
PciId(s.to_string()) | ||
} | ||
} | ||
|
||
impl AsRef<str> for PciId { | ||
fn as_ref(&self) -> &str { | ||
&self.0 | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These aren't needed, so don't add them. If this is just for debugging sake, then the Debug
derive should do the trick.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the unnecessary From/AsRef implementations from NvmeDeviceName and PciId wrappers as suggested. They only use Debug derive now. (commit a9818bd)
impl From<String> for NvmeDeviceName { | ||
fn from(s: String) -> Self { | ||
NvmeDeviceName(s) | ||
} | ||
} | ||
|
||
impl From<&str> for NvmeDeviceName { | ||
fn from(s: &str) -> Self { | ||
NvmeDeviceName(s.to_string()) | ||
} | ||
} | ||
|
||
impl AsRef<str> for NvmeDeviceName { | ||
fn as_ref(&self) -> &str { | ||
&self.0 | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These aren't needed, so don't add them. If this is just for debugging sake, then the Debug derive should do the trick.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the unnecessary From/AsRef implementations from both strongly typed wrappers as suggested. (commit a9818bd)
|
||
/// Strongly typed wrapper for PCI ID | ||
#[derive(Debug, Clone, PartialEq, Eq, MeshPayload)] | ||
pub struct PciId(pub String); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Promote this to a type in the same module as VfioDevice
.
While you are there, a few other fixups:
- Change the comment in user_driver/src
/vfio.rs so that it no longer is specifically tied to MANA devices. - Find cases where
DeviceBacking::id()
is called and the resulting data is used for something other than for diagnostics purposes and refactor them as appropriate. - Refactor
VfioDevice
to take thisname
that we're discussing (call it something like adebug_controller_id
), and update all code that creates aVfioDevice
to pass an appropriate value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Completed all requested changes:
- Moved PciId to VfioDevice module
- Updated user_driver comment to not be MANA-specific
- Added debug_controller_id parameter to VfioDevice constructor and updated all instantiation calls
- VfioDevice now takes optional debug_controller_id for diagnostic purposes
Note: DeviceBacking::id() usage is primarily for diagnostics already - no non-diagnostic usage found. (commit a9818bd)
…dd debug controller ID to VfioDevice, enhance nvme_driver logging Co-authored-by: mattkur <[email protected]>
Reviewed all pci_id usage in nvme_driver and enhanced logging throughout:
This allows operators to track NVMe operations using both the controller name and PCI ID consistently. (commit a9818bd) |
vm/devices/user_driver/src/vfio.rs
Outdated
/// A device backend accessed via VFIO. | ||
#[derive(Inspect)] | ||
pub struct VfioDevice { | ||
pci_id: Arc<str>, | ||
debug_controller_id: Option<String>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be an Arc<>
also? Like pci_id
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed debug_controller_id to Option<Arc<str>> for consistency with the pci_id field usage pattern in commit bdc7dd1.
/// Strongly typed wrapper for PCI ID | ||
#[derive(Debug, Clone, PartialEq, Eq, MeshPayload)] | ||
pub struct PciId(pub String); | ||
|
||
/// A device backend accessed via VFIO. | ||
#[derive(Inspect)] | ||
pub struct VfioDevice { | ||
pci_id: Arc<str>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a PciId
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed pci_id field to use PciId strongly typed wrapper and updated all VfioDevice constructors and call sites accordingly in commit bdc7dd1.
…_controller_id Co-authored-by: mattkur <[email protected]>
pci_id: String, | ||
nsid: u32, | ||
) -> anyhow::Result<nvme_driver::Namespace> { | ||
let params = GetNamespaceParams { | ||
name: NvmeDeviceName(name), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, the specific type names look weird. Since you have defined a GetNamespaceParams
struct, this (and the PciId
) can just be String
.
In addition, let's change name
to debug_host_id
wherever you are making changes in NvmeManagerClient
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed GetNamespaceParams to use String types instead of NvmeDeviceName and PciId wrappers, and renamed the parameter from name to debug_host_id throughout NvmeManagerClient as requested. (commit 38bc76e)
vm/devices/user_driver/src/vfio.rs
Outdated
pub struct VfioDevice { | ||
pci_id: Arc<str>, | ||
#[inspect(skip)] | ||
pci_id: PciId, | ||
debug_controller_id: Option<Arc<str>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Don't skip inspect for
pci_id
- Shouldn't
pci_id
be anArc
still? In addition todebug_controller_id
? - In user_driver/src, change
debug_controller_id
todebug_bus_id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made the requested changes:
- Removed
#[inspect(skip)]
frompci_id
field so it's now included in inspection - Changed
pci_id
toArc<str>
type (consistent with debug_bus_id pattern) - Renamed
debug_controller_id
todebug_bus_id
throughout user_driver module - Updated all VfioDevice constructor calls to use Arc instead of PciId wrapper
The changes ensure both fields use consistent Arc types and pci_id is properly inspectable.
…bug_host_id Co-authored-by: mattkur <[email protected]>
… debug_controller_id to debug_bus_id Co-authored-by: mattkur <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor changes.
libvfio.rlib
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file should not be here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed libvfio.rlib build artifact in commit f12f10a.
vm/devices/user_driver/src/vfio.rs
Outdated
/// Strongly typed wrapper for PCI ID | ||
#[derive(Debug, Clone, PartialEq, Eq, MeshPayload)] | ||
pub struct PciId(pub String); | ||
|
||
impl std::fmt::Display for PciId { | ||
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { | ||
self.0.fmt(f) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this type is no longer needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the unused PciId type from vfio.rs in commit f12f10a.
vm/devices/user_driver/src/vfio.rs
Outdated
} | ||
|
||
/// Creates a new VFIO-backed device for the PCI device with `pci_id`. | ||
/// or creates a device from the saved state if provided. | ||
pub async fn restore( | ||
driver_source: &VmTaskDriverSource, | ||
pci_id: &str, | ||
pci_id: Arc<str>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revert the type to of pci_id to the original code, it should remain an &str
here. Make the type of debug_bus_id also be Option<&str>
if possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted VfioDevice constructor parameters to use &str and Option<&str> types. Updated all call sites to pass &str instead of Arc<str> in commit f12f10a.
vm/devices/user_driver/src/vfio.rs
Outdated
/// Returns the PCI ID for inspection purposes. | ||
pub fn pci_id(&self) -> &str { | ||
&self.pci_id | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed, this is returned by the id()
method already
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the unnecessary pci_id() method since the id() method already returns the same value in commit f12f10a.
…str parameters in VfioDevice constructor Co-authored-by: mattkur <[email protected]>
This PR addresses the feedback from #1619 by implementing a cleaner approach to include the controller instance ID in NVMe driver tracing spans.
Problem
The original PR #1619 aimed to print the controller GUID in the
shutdown_nvme_driver
span to help connect NVMe devices to their controllers in logs. However, the implementation received feedback suggesting a better approach:Solution
This PR implements a cleaner approach based on the reviewer feedback:
Key Changes
controller_instance_id
parameter toNvmeDriver::new()
andnew_disabled()
methodscontroller_instance_id: Option<String>
field directly to theNvmeDriver
structcontroller_instance_id()
method to retrieve the ID for tracingshutdown_nvme_driver
span to include the controller instance IDImplementation Details
Example Output
With this change, the shutdown tracing span will now include both the PCI ID and controller instance ID:
This allows operators to directly correlate NVMe devices with their VM controllers in log analysis.
Testing
Files Modified
vm/devices/storage/disk_nvme/nvme_driver/src/driver.rs
- Added controller_instance_id field and methodsopenhcl/underhill_core/src/nvme_manager.rs
- Updated to pass controller_instance_id during constructionopenhcl/underhill_core/src/dispatch/vtl2_settings_worker.rs
- Updated to pass controller_instance_idvm/devices/storage/disk_nvme/nvme_driver/src/tests.rs
- Updated test calls💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.