Skip to content

Conversation

@hhzhang16
Copy link
Contributor

@hhzhang16 hhzhang16 commented Oct 24, 2025

Overview:

Rework the profiling Kubernetes Job manifests into DGDR manifests. Also some bug fixes to get everything working E2E again.

Details:

  • removed --config validation check in profiler if the user provides --model
  • turns backend into high-level DGDR field
  • Validates modelName and backend in the controller as necessary inputs, injecting it into profiling config with logging/warnings
  • Fixed ClusterRoles/RoleBindings in profiling/operator for GPU searchspace work
  • Updated tests/API docs

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

Release Notes

  • New Features

    • Added backend field to DynamoGraphDeploymentRequest to specify inference backend (vllm, sglang, trtllm) at the top level.
  • Refactor

    • Simplified profiling configuration: modelName and backend are now top-level spec fields automatically mapped to profiling configuration, reducing configuration complexity.
    • Replaced job-based profiling manifests with declarative DynamoGraphDeploymentRequest manifests for improved consistency.
  • Chores

    • Updated Go module dependencies to latest versions.

@hhzhang16 hhzhang16 requested review from a team as code owners October 24, 2025 00:50
@github-actions github-actions bot added the feat label Oct 24, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 24, 2025

Walkthrough

This pull request refactors the profiling infrastructure by introducing a top-level backend field to the DynamoGraphDeploymentRequest CRD. It replaces Kubernetes Job manifests with DynamoGraphDeploymentRequest manifests for AI Configurator, standard, and MoE profiling workflows. Controller logic is updated to handle the new backend field, with simplified configuration handling and corresponding RBAC adjustments.

Changes

Cohort / File(s) Summary
New DynamoGraphDeploymentRequest Profiling Manifests
benchmarks/profiler/deploy/profile_sla_aic_dgdr.yaml, benchmarks/profiler/deploy/profile_sla_dgdr.yaml, benchmarks/profiler/deploy/profile_sla_moe_dgdr.yaml
Added three new YAML manifests defining DynamoGraphDeploymentRequest resources for AI Configurator-based profiling, standard online profiling, and MoE model profiling respectively, replacing previous Kubernetes Job-based approach.
Removed Kubernetes Job Profiling Manifests
benchmarks/profiler/deploy/profile_sla_aic_job.yaml, benchmarks/profiler/deploy/profile_sla_job.yaml, benchmarks/profiler/deploy/profile_sla_moe_job.yaml
Deleted three Job manifest files that previously defined batch profiling tasks for AI Configurator, standard, and MoE workflows.
Profiler Utilities
benchmarks/profiler/utils/profiler_argparse.py
Updated validation error message to clarify requirement for --model or --config with an inline comment.
CRD Schema and API Types
deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeploymentrequests.yaml, deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeploymentrequests.yaml, deploy/cloud/operator/api/v1alpha1/dynamographdeploymentrequest_types.go
Added new backend field (string, enum: vllm/sglang/trtllm) to DynamoGraphDeploymentRequestSpec with required validation. Updated descriptions to indicate modelName and backend are automatically mapped into profilingConfig fields.
Sample CRD Configuration
deploy/cloud/operator/config/samples/nvidia.com_v1alpha1_dynamographdeploymentrequest.yaml
Updated sample to include top-level spec.backend field and added comments indicating automatic wiring of modelName/backend into profilingConfig.
Operator Deployment Configuration
deploy/cloud/helm/platform/components/operator/templates/deployment.yaml
Swapped conditional logic for namespaceRestriction to select different cluster-role names based on enabled/disabled state.
Operator RBAC Templates
deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml, deploy/cloud/helm/platform/components/operator/templates/profiling-job-rbac.yaml
Added new ClusterRole and ClusterRoleBinding resources for cluster-resource-reader; updated profiling-job RBAC with namespace-restricted and cluster-wide mode handling; extended queue-reader-binding with additional subject.
Operator Build Configuration
deploy/cloud/operator/Makefile
Updated default CRD reference docs version from v0.0.12 to latest.
Operator Controller Logic
deploy/cloud/operator/internal/controller/dynamographdeploymentrequest_controller.go
Updated to derive backend from spec.Backend instead of helper function; modified validation to log warnings for overwritable profilingConfig fields; simplified RBAC setup; enhanced config construction to set deployment.model and engine.backend from spec fields.
Operator Controller Tests
deploy/cloud/operator/internal/controller/dynamographdeploymentrequest_controller_test.go
Added ModelName and Backend fields to test specs; updated test cases to use new field structure; adjusted validation expectations for minimal configurations; modified profiling job creation assertions.
Operator Graph Tests
deploy/cloud/operator/internal/dynamo/graph_test.go
Replaced DynamoComponentDeploymentOverridesSpec with DynamoComponentDeploymentSpec; updated GenerateBasePodSpec invocation to use DynamoComponentDeploymentSharedSpec pointer.
Go Dependencies
deploy/cloud/operator/go.mod
Updated indirect dependencies to newer patch versions (golang.org/x/net, golang.org/x/sync, golang.org/x/sys, golang.org/x/term, golang.org/x/text, golang.org/x/tools).
API Documentation
docs/kubernetes/api_reference.md
Updated documentation to reflect new backend field, modelName/backend auto-mapping into profilingConfig, and clarified field descriptions for automatic value assignment.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~35 minutes

The changes span multiple file categories (YAML manifests, Go types, controller logic, RBAC templates, and documentation) with moderate logic density. While the refactoring introduces a new API field and modifies controller behavior, the changes are cohesive and serve a unified purpose. The heterogeneity of file types requires separate reasoning for each category, but repetitive patterns (similar manifest structures, consistent CRD updates) reduce complexity. Test updates align well with implementation changes.

Poem

🐰 A backend field hops into view,
No more Jobs to chase—just a manifest or two!
The profiler dances, configured and bright,
With DGDR manifests—a cleaner sight! ✨
Schema and logic, now dancing as one,
The refactoring's done, our work is such fun! 🎉

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The pull request title "feat: turn profiling k8s jobs into sample DGDR requests" directly and accurately summarizes the primary architectural change evident in the raw_summary: the conversion of Kubernetes Job manifests (profile_sla_job.yaml, profile_sla_aic_job.yaml, profile_sla_moe_job.yaml) into DynamoGraphDeploymentRequest (DGDR) sample manifests (profile_sla_dgdr.yaml, profile_sla_aic_dgdr.yaml, profile_sla_moe_dgdr.yaml). The title is concise, clear, and uses the conventional "feat:" prefix appropriately for a feature enhancement. While the PR also includes supporting changes such as API modifications and RBAC updates, the title effectively captures the main semantic transformation.
Description Check ✅ Passed The pull request description follows the required template structure with all four sections present: Overview, Details, Where should the reviewer start, and Related Issues. The Overview and Details sections are substantive and informative, clearly explaining the main changes including the removal of validation checks, conversion of backend to a high-level DGDR field, controller validation logic, RBAC fixes, and documentation updates. However, the "Where should the reviewer start" section contains only a comment placeholder with no actual content, and the "Related Issues" section shows the template placeholder text without a filled-in issue number. These are non-critical sections, as the core information is well-documented.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
deploy/cloud/operator/Makefile (1)

270-270: Consider pinning the CRD reference docs version for reproducibility.

Using latest for build tool versions can lead to non-reproducible builds and unexpected behavior when the tool is updated. Consider pinning to a specific version (e.g., v0.0.12 or a newer stable release) to ensure consistent documentation generation across different environments and time periods.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5f98e91 and 4989175.

⛔ Files ignored due to path filters (1)
  • deploy/cloud/operator/go.sum is excluded by !**/*.sum
📒 Files selected for processing (20)
  • benchmarks/profiler/deploy/profile_sla_aic_dgdr.yaml (1 hunks)
  • benchmarks/profiler/deploy/profile_sla_aic_job.yaml (0 hunks)
  • benchmarks/profiler/deploy/profile_sla_dgdr.yaml (1 hunks)
  • benchmarks/profiler/deploy/profile_sla_job.yaml (0 hunks)
  • benchmarks/profiler/deploy/profile_sla_moe_dgdr.yaml (1 hunks)
  • benchmarks/profiler/deploy/profile_sla_moe_job.yaml (0 hunks)
  • benchmarks/profiler/utils/profiler_argparse.py (1 hunks)
  • deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeploymentrequests.yaml (3 hunks)
  • deploy/cloud/helm/platform/components/operator/templates/deployment.yaml (1 hunks)
  • deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml (1 hunks)
  • deploy/cloud/helm/platform/components/operator/templates/profiling-job-rbac.yaml (2 hunks)
  • deploy/cloud/operator/Makefile (1 hunks)
  • deploy/cloud/operator/api/v1alpha1/dynamographdeploymentrequest_types.go (1 hunks)
  • deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeploymentrequests.yaml (3 hunks)
  • deploy/cloud/operator/config/samples/nvidia.com_v1alpha1_dynamographdeploymentrequest.yaml (1 hunks)
  • deploy/cloud/operator/go.mod (1 hunks)
  • deploy/cloud/operator/internal/controller/dynamographdeploymentrequest_controller.go (4 hunks)
  • deploy/cloud/operator/internal/controller/dynamographdeploymentrequest_controller_test.go (16 hunks)
  • deploy/cloud/operator/internal/dynamo/graph_test.go (4 hunks)
  • docs/kubernetes/api_reference.md (9 hunks)
💤 Files with no reviewable changes (3)
  • benchmarks/profiler/deploy/profile_sla_job.yaml
  • benchmarks/profiler/deploy/profile_sla_moe_job.yaml
  • benchmarks/profiler/deploy/profile_sla_aic_job.yaml
🧰 Additional context used
🧬 Code graph analysis (3)
deploy/cloud/operator/internal/dynamo/graph_test.go (1)
deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go (2)
  • DynamoComponentDeploymentSpec (38-46)
  • DynamoComponentDeploymentSharedSpec (48-106)
deploy/cloud/operator/internal/controller/dynamographdeploymentrequest_controller_test.go (2)
deploy/cloud/operator/api/v1alpha1/dynamographdeploymentrequest_types.go (3)
  • ProfilingConfigSpec (50-63)
  • DynamoGraphDeploymentRequest (207-216)
  • DynamoGraphDeploymentRequestSpec (91-122)
deploy/cloud/operator/internal/controller/dynamographdeploymentrequest_controller.go (1)
  • StatePending (52-52)
deploy/cloud/operator/internal/controller/dynamographdeploymentrequest_controller.go (1)
deploy/cloud/operator/internal/controller_common/predicate.go (1)
  • Config (55-71)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: vllm (amd64)
  • GitHub Check: trtllm (amd64)
  • GitHub Check: vllm (arm64)
  • GitHub Check: operator (arm64)
  • GitHub Check: sglang
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (15)
deploy/cloud/operator/go.mod (1)

77-84: LGTM!

The indirect dependency updates to newer patch versions are routine maintenance and pose no concerns.

benchmarks/profiler/utils/profiler_argparse.py (1)

298-300: LGTM!

The clarified error message and inline comment improve user experience by explicitly stating that at least one of --model or --config is required. This aligns well with the new backend field design.

benchmarks/profiler/deploy/profile_sla_dgdr.yaml (1)

1-32: LGTM!

This DGDR manifest correctly uses the new top-level modelName and backend fields while omitting deployment.model and engine.backend from profilingConfig, as these are now automatically populated by the controller. The structure aligns well with the updated CRD design.

deploy/cloud/helm/platform/components/operator/templates/deployment.yaml (1)

130-135: Verify the RBAC configuration change for namespace-restricted mode.

The conditional logic has been inverted, and when namespaceRestriction.enabled is true, the deployment now uses the dgdr-profiling-nodes cluster role instead of dgdr-profiling and omits the planner role argument. This suggests different RBAC requirements for namespace-scoped versus cluster-wide deployments.

Please verify that:

  1. The dgdr-profiling-nodes cluster role has the appropriate permissions for namespace-restricted profiling operations
  2. The planner functionality is intentionally disabled or not needed in namespace-restricted mode
  3. Corresponding ClusterRole/RoleBinding resources have been updated to match this change
deploy/cloud/operator/api/v1alpha1/dynamographdeploymentrequest_types.go (1)

94-108: LGTM!

The new Backend field is properly defined with appropriate validation (required, enum constraint for vllm/sglang/trtllm), and the updated comments clearly document the auto-population behavior for deployment.model and engine.backend in the profiling config. This implementation aligns well with the PR objectives.

benchmarks/profiler/deploy/profile_sla_moe_dgdr.yaml (1)

1-42: LGTM!

This MoE profiling manifest correctly uses the new top-level backend field and properly configures MoE-specific settings (is_moe_model: true) while omitting engine.backend from profilingConfig. The configMapRef pattern for referencing the base disaggregation config is appropriate for MoE models.

benchmarks/profiler/deploy/profile_sla_aic_dgdr.yaml (1)

1-33: LGTM!

This AI Configurator profiling manifest correctly uses the new top-level backend field and properly configures AIC-specific settings (use_ai_configurator: true, aic_system, aic_model_name, aic_backend_version) while omitting engine.backend from profilingConfig. The structure aligns well with the updated CRD design for simulation-based profiling.

docs/kubernetes/api_reference.md (4)

278-280: New top-level fields are well-documented with clear auto-population semantics.

The documentation clearly explains that modelName and backend are automatically propagated into profilingConfig.config.deployment.model and profilingConfig.config.engine.backend respectively, which aligns with the PR goal of simplifying the API surface. The validation notes properly indicate these fields are required and should not be duplicated in the profiling config.


282-282: DeploymentOverrides field documentation is clear and optional.

The field is properly marked as optional and the description correctly notes it only applies when autoApply is true. This prevents confusion about when these overrides take effect.


300-300: Status fields correctly marked optional with proper descriptions.

The backend, profilingResults, generatedDeployment, and deployment status fields are appropriately marked as optional since they're populated by the controller during the DGDR lifecycle. Descriptions are clear about their role and population logic.

Also applies to: 303-305


279-279: Backend enum values are consistent across the codebase.

Verification confirms that the backend enum [vllm sglang trtllm] in the documentation matches the CRD definitions, Go type constants, and controller validation logic. All three backends are consistently referenced across:

  • CRD files: deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeploymentrequests.yaml and deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeploymentrequests.yaml
  • Go types: deploy/cloud/operator/internal/dynamo/graph.go (lines 592-594)
  • Controller validation: deploy/cloud/operator/internal/controller/dynamographdeploymentrequest_controller.go (lines 153-155)

No inconsistencies detected.

deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml (2)

527-529: Addition of controller-manager to queue-reader-binding is appropriate.

The controller-manager service account now has access to queue resources, which aligns with the profiling workflow and cluster-wide scheduling needs.


530-576: New cluster-resource-reader ClusterRole provides appropriate read-only access.

The new ClusterRole grants read-only access to cluster-scoped resources (nodes and clusterroles) needed for GPU discovery and RBAC verification. The rule set is minimal and follows the principle of least privilege. Naming and labeling are consistent with existing RBAC conventions.

deploy/cloud/helm/platform/components/operator/templates/profiling-job-rbac.yaml (2)

89-104: Namespace-restricted mode properly adds nodes access via ClusterRoleBinding.

The addition correctly binds the cluster-scoped nodes ClusterRole to the profiling-job ServiceAccount in namespace-restricted mode. This is the appropriate pattern for cluster-scoped resources that need to be accessed from a namespace-restricted context. The conditional template is correctly structured.


111-111: No dangling references detected—renaming is correctly applied and consistent.

Verification confirms the ClusterRole name change from dgdr-profiling-nodes to dgdr-profiling in cluster-wide mode is applied consistently:

  • ClusterRole metadata.name (line 111): dgdr-profiling
  • ClusterRoleBinding metadata.name (line 143): dgdr-profiling
  • roleRef.name (line 150): dgdr-profiling

The operator receives the role name dynamically via the --dgdr-profiling-cluster-role-name flag, avoiding hardcoded references. Helm templates correctly pass the appropriate name based on deployment mode (namespace-restricted retains -dgdr-profiling-nodes, cluster-wide uses -dgdr-profiling), and both match their corresponding RBAC definitions.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 24, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@tedzhouhk
Copy link
Contributor

we need to update the docs: pre_deployment_profiling.md and sla_planner_quickstart.md

metadata:
name: sla-aic
spec:
modelName: Qwen/Qwen3-32B
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this modelname override profile_sla.py's deployment.model?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants