feat: turn profiling k8s jobs into sample DGDR requests #3864

hhzhang16 · 2025-10-24T00:50:06Z

Overview:

Rework the profiling Kubernetes Job manifests into DGDR manifests. Also some bug fixes to get everything working E2E again.

Details:

removed --config validation check in profiler if the user provides --model
turns backend into high-level DGDR field
Validates modelName and backend in the controller as necessary inputs, injecting it into profiling config with logging/warnings
Fixed ClusterRoles/RoleBindings in profiling/operator for GPU searchspace work
Updated tests/API docs

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: #xxx

Summary by CodeRabbit

Release Notes

New Features
- Added backend field to DynamoGraphDeploymentRequest to specify inference backend (vllm, sglang, trtllm) at the top level.
Refactor
- Simplified profiling configuration: modelName and backend are now top-level spec fields automatically mapped to profiling configuration, reducing configuration complexity.
- Replaced job-based profiling manifests with declarative DynamoGraphDeploymentRequest manifests for improved consistency.
Chores
- Updated Go module dependencies to latest versions.

Signed-off-by: Hannah Zhang <[email protected]>

…-turn-profiling-k8s-jobs-into-sample-dgdr-requests

coderabbitai · 2025-10-24T00:58:48Z

Walkthrough

This pull request refactors the profiling infrastructure by introducing a top-level backend field to the DynamoGraphDeploymentRequest CRD. It replaces Kubernetes Job manifests with DynamoGraphDeploymentRequest manifests for AI Configurator, standard, and MoE profiling workflows. Controller logic is updated to handle the new backend field, with simplified configuration handling and corresponding RBAC adjustments.

Changes

Cohort / File(s)	Summary
New DynamoGraphDeploymentRequest Profiling Manifests `benchmarks/profiler/deploy/profile_sla_aic_dgdr.yaml`, `benchmarks/profiler/deploy/profile_sla_dgdr.yaml`, `benchmarks/profiler/deploy/profile_sla_moe_dgdr.yaml`	Added three new YAML manifests defining DynamoGraphDeploymentRequest resources for AI Configurator-based profiling, standard online profiling, and MoE model profiling respectively, replacing previous Kubernetes Job-based approach.
Removed Kubernetes Job Profiling Manifests `benchmarks/profiler/deploy/profile_sla_aic_job.yaml`, `benchmarks/profiler/deploy/profile_sla_job.yaml`, `benchmarks/profiler/deploy/profile_sla_moe_job.yaml`	Deleted three Job manifest files that previously defined batch profiling tasks for AI Configurator, standard, and MoE workflows.
Profiler Utilities `benchmarks/profiler/utils/profiler_argparse.py`	Updated validation error message to clarify requirement for `--model` or `--config` with an inline comment.
CRD Schema and API Types `deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeploymentrequests.yaml`, `deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeploymentrequests.yaml`, `deploy/cloud/operator/api/v1alpha1/dynamographdeploymentrequest_types.go`	Added new `backend` field (string, enum: vllm/sglang/trtllm) to DynamoGraphDeploymentRequestSpec with required validation. Updated descriptions to indicate modelName and backend are automatically mapped into profilingConfig fields.
Sample CRD Configuration `deploy/cloud/operator/config/samples/nvidia.com_v1alpha1_dynamographdeploymentrequest.yaml`	Updated sample to include top-level `spec.backend` field and added comments indicating automatic wiring of modelName/backend into profilingConfig.
Operator Deployment Configuration `deploy/cloud/helm/platform/components/operator/templates/deployment.yaml`	Swapped conditional logic for namespaceRestriction to select different cluster-role names based on enabled/disabled state.
Operator RBAC Templates `deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml`, `deploy/cloud/helm/platform/components/operator/templates/profiling-job-rbac.yaml`	Added new ClusterRole and ClusterRoleBinding resources for cluster-resource-reader; updated profiling-job RBAC with namespace-restricted and cluster-wide mode handling; extended queue-reader-binding with additional subject.
Operator Build Configuration `deploy/cloud/operator/Makefile`	Updated default CRD reference docs version from v0.0.12 to latest.
Operator Controller Logic `deploy/cloud/operator/internal/controller/dynamographdeploymentrequest_controller.go`	Updated to derive backend from `spec.Backend` instead of helper function; modified validation to log warnings for overwritable profilingConfig fields; simplified RBAC setup; enhanced config construction to set deployment.model and engine.backend from spec fields.
Operator Controller Tests `deploy/cloud/operator/internal/controller/dynamographdeploymentrequest_controller_test.go`	Added ModelName and Backend fields to test specs; updated test cases to use new field structure; adjusted validation expectations for minimal configurations; modified profiling job creation assertions.
Operator Graph Tests `deploy/cloud/operator/internal/dynamo/graph_test.go`	Replaced DynamoComponentDeploymentOverridesSpec with DynamoComponentDeploymentSpec; updated GenerateBasePodSpec invocation to use DynamoComponentDeploymentSharedSpec pointer.
Go Dependencies `deploy/cloud/operator/go.mod`	Updated indirect dependencies to newer patch versions (golang.org/x/net, golang.org/x/sync, golang.org/x/sys, golang.org/x/term, golang.org/x/text, golang.org/x/tools).
API Documentation `docs/kubernetes/api_reference.md`	Updated documentation to reflect new backend field, modelName/backend auto-mapping into profilingConfig, and clarified field descriptions for automatic value assignment.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~35 minutes

The changes span multiple file categories (YAML manifests, Go types, controller logic, RBAC templates, and documentation) with moderate logic density. While the refactoring introduces a new API field and modifies controller behavior, the changes are cohesive and serve a unified purpose. The heterogeneity of file types requires separate reasoning for each category, but repetitive patterns (similar manifest structures, consistent CRD updates) reduce complexity. Test updates align well with implementation changes.

Poem

🐰 A backend field hops into view,
No more Jobs to chase—just a manifest or two!
The profiler dances, configured and bright,
With DGDR manifests—a cleaner sight! ✨
Schema and logic, now dancing as one,
The refactoring's done, our work is such fun! 🎉

Pre-merge checks

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The pull request title "feat: turn profiling k8s jobs into sample DGDR requests" directly and accurately summarizes the primary architectural change evident in the raw_summary: the conversion of Kubernetes Job manifests (profile_sla_job.yaml, profile_sla_aic_job.yaml, profile_sla_moe_job.yaml) into DynamoGraphDeploymentRequest (DGDR) sample manifests (profile_sla_dgdr.yaml, profile_sla_aic_dgdr.yaml, profile_sla_moe_dgdr.yaml). The title is concise, clear, and uses the conventional "feat:" prefix appropriately for a feature enhancement. While the PR also includes supporting changes such as API modifications and RBAC updates, the title effectively captures the main semantic transformation.
Description Check	✅ Passed	The pull request description follows the required template structure with all four sections present: Overview, Details, Where should the reviewer start, and Related Issues. The Overview and Details sections are substantive and informative, clearly explaining the main changes including the removal of validation checks, conversion of backend to a high-level DGDR field, controller validation logic, RBAC fixes, and documentation updates. However, the "Where should the reviewer start" section contains only a comment placeholder with no actual content, and the "Related Issues" section shows the template placeholder text without a filled-in issue number. These are non-critical sections, as the core information is well-documented.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

deploy/cloud/operator/Makefile (1)

270-270: Consider pinning the CRD reference docs version for reproducibility.

Using latest for build tool versions can lead to non-reproducible builds and unexpected behavior when the tool is updated. Consider pinning to a specific version (e.g., v0.0.12 or a newer stable release) to ensure consistent documentation generation across different environments and time periods.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5f98e91 and 4989175.

⛔ Files ignored due to path filters (1)

deploy/cloud/operator/go.sum is excluded by !**/*.sum

📒 Files selected for processing (20)

benchmarks/profiler/deploy/profile_sla_aic_dgdr.yaml (1 hunks)
benchmarks/profiler/deploy/profile_sla_aic_job.yaml (0 hunks)
benchmarks/profiler/deploy/profile_sla_dgdr.yaml (1 hunks)
benchmarks/profiler/deploy/profile_sla_job.yaml (0 hunks)
benchmarks/profiler/deploy/profile_sla_moe_dgdr.yaml (1 hunks)
benchmarks/profiler/deploy/profile_sla_moe_job.yaml (0 hunks)
benchmarks/profiler/utils/profiler_argparse.py (1 hunks)
deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeploymentrequests.yaml (3 hunks)
deploy/cloud/helm/platform/components/operator/templates/deployment.yaml (1 hunks)
deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml (1 hunks)
deploy/cloud/helm/platform/components/operator/templates/profiling-job-rbac.yaml (2 hunks)
deploy/cloud/operator/Makefile (1 hunks)
deploy/cloud/operator/api/v1alpha1/dynamographdeploymentrequest_types.go (1 hunks)
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeploymentrequests.yaml (3 hunks)
deploy/cloud/operator/config/samples/nvidia.com_v1alpha1_dynamographdeploymentrequest.yaml (1 hunks)
deploy/cloud/operator/go.mod (1 hunks)
deploy/cloud/operator/internal/controller/dynamographdeploymentrequest_controller.go (4 hunks)
deploy/cloud/operator/internal/controller/dynamographdeploymentrequest_controller_test.go (16 hunks)
deploy/cloud/operator/internal/dynamo/graph_test.go (4 hunks)
docs/kubernetes/api_reference.md (9 hunks)

💤 Files with no reviewable changes (3)

benchmarks/profiler/deploy/profile_sla_job.yaml
benchmarks/profiler/deploy/profile_sla_moe_job.yaml
benchmarks/profiler/deploy/profile_sla_aic_job.yaml

🧰 Additional context used

🧬 Code graph analysis (3)

deploy/cloud/operator/internal/dynamo/graph_test.go (1)

deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go (2)

DynamoComponentDeploymentSpec (38-46)

DynamoComponentDeploymentSharedSpec (48-106)

deploy/cloud/operator/internal/controller/dynamographdeploymentrequest_controller_test.go (2)

deploy/cloud/operator/api/v1alpha1/dynamographdeploymentrequest_types.go (3)

ProfilingConfigSpec (50-63)

DynamoGraphDeploymentRequest (207-216)

DynamoGraphDeploymentRequestSpec (91-122)

deploy/cloud/operator/internal/controller/dynamographdeploymentrequest_controller.go (1)

StatePending (52-52)

deploy/cloud/operator/internal/controller/dynamographdeploymentrequest_controller.go (1)

deploy/cloud/operator/internal/controller_common/predicate.go (1)

Config (55-71)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)

GitHub Check: vllm (amd64)
GitHub Check: trtllm (amd64)
GitHub Check: vllm (arm64)
GitHub Check: operator (arm64)
GitHub Check: sglang
GitHub Check: Build and Test - dynamo

🔇 Additional comments (15)

deploy/cloud/operator/go.mod (1)

77-84: LGTM!

The indirect dependency updates to newer patch versions are routine maintenance and pose no concerns.

benchmarks/profiler/utils/profiler_argparse.py (1)

298-300: LGTM!

The clarified error message and inline comment improve user experience by explicitly stating that at least one of --model or --config is required. This aligns well with the new backend field design.

benchmarks/profiler/deploy/profile_sla_dgdr.yaml (1)

1-32: LGTM!

This DGDR manifest correctly uses the new top-level modelName and backend fields while omitting deployment.model and engine.backend from profilingConfig, as these are now automatically populated by the controller. The structure aligns well with the updated CRD design.

deploy/cloud/helm/platform/components/operator/templates/deployment.yaml (1)

130-135: Verify the RBAC configuration change for namespace-restricted mode.

The conditional logic has been inverted, and when namespaceRestriction.enabled is true, the deployment now uses the dgdr-profiling-nodes cluster role instead of dgdr-profiling and omits the planner role argument. This suggests different RBAC requirements for namespace-scoped versus cluster-wide deployments.

Please verify that:

The dgdr-profiling-nodes cluster role has the appropriate permissions for namespace-restricted profiling operations

The planner functionality is intentionally disabled or not needed in namespace-restricted mode

Corresponding ClusterRole/RoleBinding resources have been updated to match this change

deploy/cloud/operator/api/v1alpha1/dynamographdeploymentrequest_types.go (1)

94-108: LGTM!

The new Backend field is properly defined with appropriate validation (required, enum constraint for vllm/sglang/trtllm), and the updated comments clearly document the auto-population behavior for deployment.model and engine.backend in the profiling config. This implementation aligns well with the PR objectives.

benchmarks/profiler/deploy/profile_sla_moe_dgdr.yaml (1)

1-42: LGTM!

This MoE profiling manifest correctly uses the new top-level backend field and properly configures MoE-specific settings (is_moe_model: true) while omitting engine.backend from profilingConfig. The configMapRef pattern for referencing the base disaggregation config is appropriate for MoE models.

benchmarks/profiler/deploy/profile_sla_aic_dgdr.yaml (1)

1-33: LGTM!

This AI Configurator profiling manifest correctly uses the new top-level backend field and properly configures AIC-specific settings (use_ai_configurator: true, aic_system, aic_model_name, aic_backend_version) while omitting engine.backend from profilingConfig. The structure aligns well with the updated CRD design for simulation-based profiling.

docs/kubernetes/api_reference.md (4)

278-280: New top-level fields are well-documented with clear auto-population semantics.

The documentation clearly explains that modelName and backend are automatically propagated into profilingConfig.config.deployment.model and profilingConfig.config.engine.backend respectively, which aligns with the PR goal of simplifying the API surface. The validation notes properly indicate these fields are required and should not be duplicated in the profiling config.

282-282: DeploymentOverrides field documentation is clear and optional.

The field is properly marked as optional and the description correctly notes it only applies when autoApply is true. This prevents confusion about when these overrides take effect.

300-300: Status fields correctly marked optional with proper descriptions.

The backend, profilingResults, generatedDeployment, and deployment status fields are appropriately marked as optional since they're populated by the controller during the DGDR lifecycle. Descriptions are clear about their role and population logic.

Also applies to: 303-305

279-279: Backend enum values are consistent across the codebase.

Verification confirms that the backend enum [vllm sglang trtllm] in the documentation matches the CRD definitions, Go type constants, and controller validation logic. All three backends are consistently referenced across:

CRD files: deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeploymentrequests.yaml and deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeploymentrequests.yaml

Go types: deploy/cloud/operator/internal/dynamo/graph.go (lines 592-594)

Controller validation: deploy/cloud/operator/internal/controller/dynamographdeploymentrequest_controller.go (lines 153-155)

No inconsistencies detected.

deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml (2)

527-529: Addition of controller-manager to queue-reader-binding is appropriate.

The controller-manager service account now has access to queue resources, which aligns with the profiling workflow and cluster-wide scheduling needs.

530-576: New cluster-resource-reader ClusterRole provides appropriate read-only access.

The new ClusterRole grants read-only access to cluster-scoped resources (nodes and clusterroles) needed for GPU discovery and RBAC verification. The rule set is minimal and follows the principle of least privilege. Naming and labeling are consistent with existing RBAC conventions.

deploy/cloud/helm/platform/components/operator/templates/profiling-job-rbac.yaml (2)

89-104: Namespace-restricted mode properly adds nodes access via ClusterRoleBinding.

The addition correctly binds the cluster-scoped nodes ClusterRole to the profiling-job ServiceAccount in namespace-restricted mode. This is the appropriate pattern for cluster-scoped resources that need to be accessed from a namespace-restricted context. The conditional template is correctly structured.

111-111: No dangling references detected—renaming is correctly applied and consistent.

Verification confirms the ClusterRole name change from dgdr-profiling-nodes to dgdr-profiling in cluster-wide mode is applied consistently:

ClusterRole metadata.name (line 111): dgdr-profiling

ClusterRoleBinding metadata.name (line 143): dgdr-profiling

roleRef.name (line 150): dgdr-profiling

The operator receives the role name dynamically via the --dgdr-profiling-cluster-role-name flag, avoiding hardcoded references. Helm templates correctly pass the appropriate name based on deployment mode (namespace-restricted retains -dgdr-profiling-nodes, cluster-wide uses -dgdr-profiling), and both match their corresponding RBAC definitions.

deploy/cloud/operator/internal/controller/dynamographdeploymentrequest_controller.go

Signed-off-by: Hannah Zhang <[email protected]>

copy-pr-bot · 2025-10-24T01:50:38Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

tedzhouhk · 2025-10-24T18:36:07Z

we need to update the docs: pre_deployment_profiling.md and sla_planner_quickstart.md

tedzhouhk · 2025-10-24T19:00:04Z

benchmarks/profiler/deploy/profile_sla_aic_dgdr.yaml

+metadata:
+  name: sla-aic
+spec:
+  modelName: Qwen/Qwen3-32B


does this modelname override profile_sla.py's deployment.model?

hhzhang16 added 7 commits October 23, 2025 10:29

feat: add backend as high-level dgdr item

b9043b5

Signed-off-by: Hannah Zhang <[email protected]>

feat: ran all the make commands

4852183

Signed-off-by: Hannah Zhang <[email protected]>

fix: clusterrole/node permissions

f2666c9

Signed-off-by: Hannah Zhang <[email protected]>

fix: profiler argparse check for config

64ec986

Signed-off-by: Hannah Zhang <[email protected]>

feat: AIC DGDR

ad09ee0

Signed-off-by: Hannah Zhang <[email protected]>

feat: add oline and moe dgdrs

4dd5813

Signed-off-by: Hannah Zhang <[email protected]>

Merge branch 'main' of github.com:ai-dynamo/dynamo into hannahz/dep-540…

4989175

…-turn-profiling-k8s-jobs-into-sample-dgdr-requests

hhzhang16 requested review from a team as code owners October 24, 2025 00:50

pull-request-size bot added the size/XL label Oct 24, 2025

github-actions bot added the feat label Oct 24, 2025

coderabbitai bot reviewed Oct 24, 2025

View reviewed changes

deploy/cloud/operator/internal/controller/dynamographdeploymentrequest_controller.go Show resolved Hide resolved

feat: validation of config types

b3c0f8f

Signed-off-by: Hannah Zhang <[email protected]>

tedzhouhk reviewed Oct 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: turn profiling k8s jobs into sample DGDR requests #3864

feat: turn profiling k8s jobs into sample DGDR requests #3864

Uh oh!

hhzhang16 commented Oct 24, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 24, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

copy-pr-bot bot commented Oct 24, 2025

Uh oh!

tedzhouhk commented Oct 24, 2025

Uh oh!

tedzhouhk Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: turn profiling k8s jobs into sample DGDR requests #3864

Are you sure you want to change the base?

feat: turn profiling k8s jobs into sample DGDR requests #3864

Uh oh!

Conversation

hhzhang16 commented Oct 24, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Oct 24, 2025

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

copy-pr-bot bot commented Oct 24, 2025

Uh oh!

tedzhouhk commented Oct 24, 2025

Uh oh!

tedzhouhk Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hhzhang16 commented Oct 24, 2025 •

edited by coderabbitai bot

Loading