Skip to content

nvme: Pass controller instance ID to NvmeDriver for better tracing #1661

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

Copilot
Copy link
Contributor

@Copilot Copilot AI commented Jul 8, 2025

This PR addresses the feedback from #1619 by implementing a cleaner approach to include the controller instance ID in NVMe driver tracing spans.

Problem

The original PR #1619 aimed to print the controller GUID in the shutdown_nvme_driver span to help connect NVMe devices to their controllers in logs. However, the implementation received feedback suggesting a better approach:

  1. Separate HashMap storage: The original PR used a separate HashMap to track controller instance IDs, which was not ideal
  2. Setter approach: Used a setter method after construction instead of passing the ID during construction
  3. Save/restore complexity: Included controller instance ID in save/restore state unnecessarily

Solution

This PR implements a cleaner approach based on the reviewer feedback:

Key Changes

  1. Constructor-based approach: Added optional controller_instance_id parameter to NvmeDriver::new() and new_disabled() methods
  2. Direct field storage: Added controller_instance_id: Option<String> field directly to the NvmeDriver struct
  3. Getter method: Added controller_instance_id() method to retrieve the ID for tracing
  4. No save/restore: Controller instance ID is not persisted in save/restore state (not needed for diagnostic purposes)
  5. Updated tracing: Modified shutdown_nvme_driver span to include the controller instance ID

Implementation Details

  • NvmeDriver changes: Modified constructor to accept and store controller instance ID
  • NvmeManager updates: Updated to pass controller instance ID during driver construction
  • NvmeDiskConfig: Added controller instance ID field to configuration
  • Dispatch integration: Updated vtl2_settings_worker to pass controller instance ID from vmbus_instance_id
  • Test updates: Updated all test calls to pass the new parameter

Example Output

With this change, the shutdown tracing span will now include both the PCI ID and controller instance ID:

shutdown_nvme_driver pci_id="0000:00:04.0" controller_instance_id="12345678-1234-5678-9abc-123456789abc"

This allows operators to directly correlate NVMe devices with their VM controllers in log analysis.

Testing

  • ✅ All existing nvme_driver tests pass
  • ✅ Code builds successfully
  • ✅ Formatting and linting checks pass
  • ✅ No breaking changes to existing functionality

Files Modified

  • vm/devices/storage/disk_nvme/nvme_driver/src/driver.rs - Added controller_instance_id field and methods
  • openhcl/underhill_core/src/nvme_manager.rs - Updated to pass controller_instance_id during construction
  • openhcl/underhill_core/src/dispatch/vtl2_settings_worker.rs - Updated to pass controller_instance_id
  • vm/devices/storage/disk_nvme/nvme_driver/src/tests.rs - Updated test calls

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

@Copilot Copilot AI changed the title [WIP] Create a new PR, based on @microsoft/openvmm/pull/1619 , to solve the same business problem but address feedback in @microsoft/openvmm/pull/1619 . nvme: Pass controller instance ID to NvmeDriver for better tracing Jul 8, 2025
@Copilot Copilot AI requested a review from mattkur July 8, 2025 18:29
Copilot finished work on behalf of mattkur July 8, 2025 18:29
@@ -993,6 +993,7 @@ async fn make_disk_type_from_physical_device(
// We can't validate yet that this namespace actually exists. That will
// be checked later.
return Ok(Resource::new(NvmeDiskConfig {
controller_instance_id: Some(controller_instance_id.to_string()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename controller_instance_id to debug_id

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed controller_instance_id to debug_id throughout the codebase in commit e76bedf.

)
.instrument(tracing::info_span!(
"nvme_get_namespace",
controller_instance_id = controller_instance_id.as_ref().map(|s| s.as_str()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the .as_ref() necessary here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The .as_ref() is no longer necessary since the parameter is now a String instead of Option<String>. Simplified the tracing span code in commit e76bedf.

@@ -235,7 +246,7 @@ impl NvmeManagerWorker {
match req {
Request::Inspect(deferred) => deferred.inspect(&self),
Request::ForceLoadDriver(update) => {
match self.get_driver(update.new_value().to_owned()).await {
match self.get_driver(None, update.new_value().to_owned()).await {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of "None", create a dummy string: force-load

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced None with "force-load" dummy string in the ForceLoadDriver case in commit e76bedf.

@@ -191,13 +193,22 @@ pub struct NvmeManagerClient {
impl NvmeManagerClient {
pub async fn get_namespace(
&self,
controller_instance_id: Option<String>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this mandatory (a String rather than Option<String>)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the parameter mandatory by changing from Option<String> to String throughout the codebase in commit e76bedf.

@Copilot Copilot AI requested a review from mattkur July 8, 2025 19:03
Copilot finished work on behalf of mattkur July 8, 2025 19:03
@@ -175,7 +175,7 @@ impl NvmeManager {
enum Request {
Inspect(inspect::Deferred),
ForceLoadDriver(inspect::DeferredUpdate),
GetNamespace(Rpc<(String, u32), Result<nvme_driver::Namespace, NamespaceError>>),
GetNamespace(Rpc<(String, String, u32), Result<nvme_driver::Namespace, NamespaceError>>),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is prone for errors. Make each String strongly typed, so that the caller can't confuse the debug_id and the pci_id.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created strongly typed NvmeDeviceName and PciId wrapper types with GetNamespaceParams struct to prevent parameter confusion in RPC calls. This eliminates the risk of mixing up the name and pci_id parameters. (commit 02fbdbc)

impl Default for NvmeDiskConfig {
fn default() -> Self {
Self {
debug_id: "force-load".to_string(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More bikeshedding: let's just call this name instead of debug_id.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed debug_id to name throughout the codebase including struct fields, method names, and all call sites. (commit 02fbdbc)

@@ -407,6 +425,7 @@ impl NvmeManagerWorker {
vfio_device,
&disk.driver_state,
self.is_isolated,
None, // controller_instance_id is not persisted in saved state
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unfortunate. Is there a way to re-hook-up the name (formerly known as controller_instance_id and debug_id) after a restore, without putting it in the NvmeManagerSavedState?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For restored drivers, I now generate the name as "restored-{pci_id}" using the available PCI ID. This avoids storing the name in NvmeManagerSavedState while still providing a meaningful identifier for tracing. (commit 02fbdbc)

@Copilot Copilot AI requested a review from mattkur July 8, 2025 21:25
Copilot finished work on behalf of mattkur July 8, 2025 21:25
Copy link
Contributor

@mattkur mattkur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the point comments I indicated, also review all cases where pci_id is used in the nvme driver. For errors or log lines, include the new debug information.

Comment on lines 58 to 74
impl From<String> for PciId {
fn from(s: String) -> Self {
PciId(s)
}
}

impl From<&str> for PciId {
fn from(s: &str) -> Self {
PciId(s.to_string())
}
}

impl AsRef<str> for PciId {
fn as_ref(&self) -> &str {
&self.0
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These aren't needed, so don't add them. If this is just for debugging sake, then the Debug derive should do the trick.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the unnecessary From/AsRef implementations from NvmeDeviceName and PciId wrappers as suggested. They only use Debug derive now. (commit a9818bd)

Comment on lines 36 to 52
impl From<String> for NvmeDeviceName {
fn from(s: String) -> Self {
NvmeDeviceName(s)
}
}

impl From<&str> for NvmeDeviceName {
fn from(s: &str) -> Self {
NvmeDeviceName(s.to_string())
}
}

impl AsRef<str> for NvmeDeviceName {
fn as_ref(&self) -> &str {
&self.0
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These aren't needed, so don't add them. If this is just for debugging sake, then the Debug derive should do the trick.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the unnecessary From/AsRef implementations from both strongly typed wrappers as suggested. (commit a9818bd)


/// Strongly typed wrapper for PCI ID
#[derive(Debug, Clone, PartialEq, Eq, MeshPayload)]
pub struct PciId(pub String);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Promote this to a type in the same module as VfioDevice.

While you are there, a few other fixups:

  1. Change the comment in user_driver/src
    /vfio.rs so that it no longer is specifically tied to MANA devices.
  2. Find cases where DeviceBacking::id() is called and the resulting data is used for something other than for diagnostics purposes and refactor them as appropriate.
  3. Refactor VfioDevice to take this name that we're discussing (call it something like a debug_controller_id), and update all code that creates a VfioDevice to pass an appropriate value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completed all requested changes:

  1. Moved PciId to VfioDevice module
  2. Updated user_driver comment to not be MANA-specific
  3. Added debug_controller_id parameter to VfioDevice constructor and updated all instantiation calls
  4. VfioDevice now takes optional debug_controller_id for diagnostic purposes

Note: DeviceBacking::id() usage is primarily for diagnostics already - no non-diagnostic usage found. (commit a9818bd)

…dd debug controller ID to VfioDevice, enhance nvme_driver logging

Co-authored-by: mattkur <[email protected]>
Copy link
Contributor Author

Copilot AI commented Jul 8, 2025

In addition to the point comments I indicated, also review all cases where pci_id is used in the nvme driver. For errors or log lines, include the new debug information.

Reviewed all pci_id usage in nvme_driver and enhanced logging throughout:

  • Added name field to DriverWorkerTask for comprehensive diagnostic context
  • Enhanced all error and warning tracing calls to include both name and pci_id
  • Updated tracing spans (nvme_new_disabled, nvme_enable, nvme_ctrl_enable, etc.) to include name
  • All diagnostic messages now provide both the device name and PCI ID for better correlation

This allows operators to track NVMe operations using both the controller name and PCI ID consistently. (commit a9818bd)

@Copilot Copilot AI requested a review from mattkur July 8, 2025 22:12
Copilot finished work on behalf of mattkur July 8, 2025 22:12
/// A device backend accessed via VFIO.
#[derive(Inspect)]
pub struct VfioDevice {
pci_id: Arc<str>,
debug_controller_id: Option<String>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be an Arc<> also? Like pci_id?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed debug_controller_id to Option<Arc<str>> for consistency with the pci_id field usage pattern in commit bdc7dd1.

/// Strongly typed wrapper for PCI ID
#[derive(Debug, Clone, PartialEq, Eq, MeshPayload)]
pub struct PciId(pub String);

/// A device backend accessed via VFIO.
#[derive(Inspect)]
pub struct VfioDevice {
pci_id: Arc<str>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a PciId.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed pci_id field to use PciId strongly typed wrapper and updated all VfioDevice constructors and call sites accordingly in commit bdc7dd1.

@Copilot Copilot AI requested a review from mattkur July 9, 2025 00:28
Copilot finished work on behalf of mattkur July 9, 2025 00:28
pci_id: String,
nsid: u32,
) -> anyhow::Result<nvme_driver::Namespace> {
let params = GetNamespaceParams {
name: NvmeDeviceName(name),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, the specific type names look weird. Since you have defined a GetNamespaceParams struct, this (and the PciId) can just be String.

In addition, let's change name to debug_host_id wherever you are making changes in NvmeManagerClient

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed GetNamespaceParams to use String types instead of NvmeDeviceName and PciId wrappers, and renamed the parameter from name to debug_host_id throughout NvmeManagerClient as requested. (commit 38bc76e)

Comment on lines 54 to 57
pub struct VfioDevice {
pci_id: Arc<str>,
#[inspect(skip)]
pci_id: PciId,
debug_controller_id: Option<Arc<str>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Don't skip inspect for pci_id
  2. Shouldn't pci_id be an Arc still? In addition to debug_controller_id?
  3. In user_driver/src, change debug_controller_id to debug_bus_id

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the requested changes:

  1. Removed #[inspect(skip)] from pci_id field so it's now included in inspection
  2. Changed pci_id to Arc<str> type (consistent with debug_bus_id pattern)
  3. Renamed debug_controller_id to debug_bus_id throughout user_driver module
  4. Updated all VfioDevice constructor calls to use Arc instead of PciId wrapper

The changes ensure both fields use consistent Arc types and pci_id is properly inspectable.

@Copilot Copilot AI requested a review from mattkur July 14, 2025 12:49
Copilot finished work on behalf of mattkur July 14, 2025 12:49
… debug_controller_id to debug_bus_id

Co-authored-by: mattkur <[email protected]>
Copilot finished work on behalf of mattkur July 14, 2025 13:08
Copy link
Contributor

@mattkur mattkur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor changes.

libvfio.rlib Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should not be here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed libvfio.rlib build artifact in commit f12f10a.

Comment on lines 42 to 50
/// Strongly typed wrapper for PCI ID
#[derive(Debug, Clone, PartialEq, Eq, MeshPayload)]
pub struct PciId(pub String);

impl std::fmt::Display for PciId {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
self.0.fmt(f)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this type is no longer needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the unused PciId type from vfio.rs in commit f12f10a.

}

/// Creates a new VFIO-backed device for the PCI device with `pci_id`.
/// or creates a device from the saved state if provided.
pub async fn restore(
driver_source: &VmTaskDriverSource,
pci_id: &str,
pci_id: Arc<str>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert the type to of pci_id to the original code, it should remain an &str here. Make the type of debug_bus_id also be Option<&str> if possible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted VfioDevice constructor parameters to use &str and Option<&str> types. Updated all call sites to pass &str instead of Arc<str> in commit f12f10a.

Comment on lines 177 to 180
/// Returns the PCI ID for inspection purposes.
pub fn pci_id(&self) -> &str {
&self.pci_id
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed, this is returned by the id() method already

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the unnecessary pci_id() method since the id() method already returns the same value in commit f12f10a.

…str parameters in VfioDevice constructor

Co-authored-by: mattkur <[email protected]>
@Copilot Copilot AI requested a review from mattkur July 14, 2025 15:38
Copilot finished work on behalf of mattkur July 14, 2025 15:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants