Skip to content

feat(hpc): add InfiniBand diagnostics module (#635)#656

Merged
adolago merged 1 commit intomainfrom
hpc/635-ib-diagnostics
Feb 16, 2026
Merged

feat(hpc): add InfiniBand diagnostics module (#635)#656
adolago merged 1 commit intomainfrom
hpc/635-ib-diagnostics

Conversation

@adolago
Copy link
Owner

@adolago adolago commented Feb 16, 2026

Summary

  • Add IbDiagnosticsModule for running ibdiagnet/iblinkinfo/ibstat fabric diagnostics
  • Supports link_health, topology, counters, and full check modes
  • Parses results into structured JSON output
  • Gated under ofed feature flag

Closes #635

Test plan

  • cargo clippy --features full-hpc passes
  • cargo test --features full-hpc --lib -- ib_diagnostics — 5 tests pass

🤖 Generated with Claude Code

Add IbDiagnosticsModule for running ibdiagnet/iblinkinfo/ibstat checks
and parsing results into structured data.

Closes #635

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@adolago adolago merged commit 391ed53 into main Feb 16, 2026
12 of 14 checks passed
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 258f87040d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

format!("IB diagnostics completed with {} error(s)", errors.len())
};

Ok(ModuleOutput::ok(summary)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Mark module failed when diagnostic commands fail

The loop records command failures in errors, but the function always returns ModuleOutput::ok(...), so runs where tools are missing or commands fail (e.g., ibdiagnet not installed or non-zero exit) are reported as successful to callers. In automation workflows this can let playbooks continue despite incomplete diagnostics, which undermines result reliability; the status should switch to failed (or return an error) when errors is non-empty.

Useful? React with 👍 / 👎.

// Save to file
let output_file = format!("{}/{}.log", output_dir, cmd);
let escaped = stdout.replace('\'', "'\\''");
let _ = run_cmd(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Propagate failures when writing diagnostic log files

The result of the log-write command is discarded with let _ = run_cmd(...), so permission issues, missing directories, or disk errors silently drop report files even though output_dir is returned as if the artifacts were saved. This makes diagnostics output misleading in environments where report persistence is required; log-write failures should be added to errors or fail the module.

Useful? React with 👍 / 👎.

@adolago adolago deleted the hpc/635-ib-diagnostics branch February 17, 2026 17:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[HPC PR 20] Implement fabric diagnostics module (IB-05)

1 participant