-
Notifications
You must be signed in to change notification settings - Fork 655
feat: turn profiling k8s jobs into sample DGDR requests #3864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: turn profiling k8s jobs into sample DGDR requests #3864
Conversation
Signed-off-by: Hannah Zhang <[email protected]>
Signed-off-by: Hannah Zhang <[email protected]>
Signed-off-by: Hannah Zhang <[email protected]>
Signed-off-by: Hannah Zhang <[email protected]>
Signed-off-by: Hannah Zhang <[email protected]>
Signed-off-by: Hannah Zhang <[email protected]>
…-turn-profiling-k8s-jobs-into-sample-dgdr-requests
WalkthroughThis pull request refactors the profiling infrastructure by introducing a top-level Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~35 minutes The changes span multiple file categories (YAML manifests, Go types, controller logic, RBAC templates, and documentation) with moderate logic density. While the refactoring introduces a new API field and modifies controller behavior, the changes are cohesive and serve a unified purpose. The heterogeneity of file types requires separate reasoning for each category, but repetitive patterns (similar manifest structures, consistent CRD updates) reduce complexity. Test updates align well with implementation changes. Poem
Pre-merge checks❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
deploy/cloud/operator/Makefile (1)
270-270: Consider pinning the CRD reference docs version for reproducibility.Using
latestfor build tool versions can lead to non-reproducible builds and unexpected behavior when the tool is updated. Consider pinning to a specific version (e.g.,v0.0.12or a newer stable release) to ensure consistent documentation generation across different environments and time periods.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
deploy/cloud/operator/go.sumis excluded by!**/*.sum
📒 Files selected for processing (20)
benchmarks/profiler/deploy/profile_sla_aic_dgdr.yaml(1 hunks)benchmarks/profiler/deploy/profile_sla_aic_job.yaml(0 hunks)benchmarks/profiler/deploy/profile_sla_dgdr.yaml(1 hunks)benchmarks/profiler/deploy/profile_sla_job.yaml(0 hunks)benchmarks/profiler/deploy/profile_sla_moe_dgdr.yaml(1 hunks)benchmarks/profiler/deploy/profile_sla_moe_job.yaml(0 hunks)benchmarks/profiler/utils/profiler_argparse.py(1 hunks)deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeploymentrequests.yaml(3 hunks)deploy/cloud/helm/platform/components/operator/templates/deployment.yaml(1 hunks)deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml(1 hunks)deploy/cloud/helm/platform/components/operator/templates/profiling-job-rbac.yaml(2 hunks)deploy/cloud/operator/Makefile(1 hunks)deploy/cloud/operator/api/v1alpha1/dynamographdeploymentrequest_types.go(1 hunks)deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeploymentrequests.yaml(3 hunks)deploy/cloud/operator/config/samples/nvidia.com_v1alpha1_dynamographdeploymentrequest.yaml(1 hunks)deploy/cloud/operator/go.mod(1 hunks)deploy/cloud/operator/internal/controller/dynamographdeploymentrequest_controller.go(4 hunks)deploy/cloud/operator/internal/controller/dynamographdeploymentrequest_controller_test.go(16 hunks)deploy/cloud/operator/internal/dynamo/graph_test.go(4 hunks)docs/kubernetes/api_reference.md(9 hunks)
💤 Files with no reviewable changes (3)
- benchmarks/profiler/deploy/profile_sla_job.yaml
- benchmarks/profiler/deploy/profile_sla_moe_job.yaml
- benchmarks/profiler/deploy/profile_sla_aic_job.yaml
🧰 Additional context used
🧬 Code graph analysis (3)
deploy/cloud/operator/internal/dynamo/graph_test.go (1)
deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go (2)
DynamoComponentDeploymentSpec(38-46)DynamoComponentDeploymentSharedSpec(48-106)
deploy/cloud/operator/internal/controller/dynamographdeploymentrequest_controller_test.go (2)
deploy/cloud/operator/api/v1alpha1/dynamographdeploymentrequest_types.go (3)
ProfilingConfigSpec(50-63)DynamoGraphDeploymentRequest(207-216)DynamoGraphDeploymentRequestSpec(91-122)deploy/cloud/operator/internal/controller/dynamographdeploymentrequest_controller.go (1)
StatePending(52-52)
deploy/cloud/operator/internal/controller/dynamographdeploymentrequest_controller.go (1)
deploy/cloud/operator/internal/controller_common/predicate.go (1)
Config(55-71)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: vllm (amd64)
- GitHub Check: trtllm (amd64)
- GitHub Check: vllm (arm64)
- GitHub Check: operator (arm64)
- GitHub Check: sglang
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (15)
deploy/cloud/operator/go.mod (1)
77-84: LGTM!The indirect dependency updates to newer patch versions are routine maintenance and pose no concerns.
benchmarks/profiler/utils/profiler_argparse.py (1)
298-300: LGTM!The clarified error message and inline comment improve user experience by explicitly stating that at least one of
--modelor--configis required. This aligns well with the new backend field design.benchmarks/profiler/deploy/profile_sla_dgdr.yaml (1)
1-32: LGTM!This DGDR manifest correctly uses the new top-level
modelNameandbackendfields while omittingdeployment.modelandengine.backendfromprofilingConfig, as these are now automatically populated by the controller. The structure aligns well with the updated CRD design.deploy/cloud/helm/platform/components/operator/templates/deployment.yaml (1)
130-135: Verify the RBAC configuration change for namespace-restricted mode.The conditional logic has been inverted, and when
namespaceRestriction.enabledis true, the deployment now uses thedgdr-profiling-nodescluster role instead ofdgdr-profilingand omits theplannerrole argument. This suggests different RBAC requirements for namespace-scoped versus cluster-wide deployments.Please verify that:
- The
dgdr-profiling-nodescluster role has the appropriate permissions for namespace-restricted profiling operations- The planner functionality is intentionally disabled or not needed in namespace-restricted mode
- Corresponding ClusterRole/RoleBinding resources have been updated to match this change
deploy/cloud/operator/api/v1alpha1/dynamographdeploymentrequest_types.go (1)
94-108: LGTM!The new
Backendfield is properly defined with appropriate validation (required, enum constraint for vllm/sglang/trtllm), and the updated comments clearly document the auto-population behavior fordeployment.modelandengine.backendin the profiling config. This implementation aligns well with the PR objectives.benchmarks/profiler/deploy/profile_sla_moe_dgdr.yaml (1)
1-42: LGTM!This MoE profiling manifest correctly uses the new top-level
backendfield and properly configures MoE-specific settings (is_moe_model: true) while omittingengine.backendfromprofilingConfig. TheconfigMapRefpattern for referencing the base disaggregation config is appropriate for MoE models.benchmarks/profiler/deploy/profile_sla_aic_dgdr.yaml (1)
1-33: LGTM!This AI Configurator profiling manifest correctly uses the new top-level
backendfield and properly configures AIC-specific settings (use_ai_configurator: true,aic_system,aic_model_name,aic_backend_version) while omittingengine.backendfromprofilingConfig. The structure aligns well with the updated CRD design for simulation-based profiling.docs/kubernetes/api_reference.md (4)
278-280: New top-level fields are well-documented with clear auto-population semantics.The documentation clearly explains that
modelNameandbackendare automatically propagated intoprofilingConfig.config.deployment.modelandprofilingConfig.config.engine.backendrespectively, which aligns with the PR goal of simplifying the API surface. The validation notes properly indicate these fields are required and should not be duplicated in the profiling config.
282-282: DeploymentOverrides field documentation is clear and optional.The field is properly marked as optional and the description correctly notes it only applies when
autoApplyis true. This prevents confusion about when these overrides take effect.
300-300: Status fields correctly marked optional with proper descriptions.The
backend,profilingResults,generatedDeployment, anddeploymentstatus fields are appropriately marked as optional since they're populated by the controller during the DGDR lifecycle. Descriptions are clear about their role and population logic.Also applies to: 303-305
279-279: Backend enum values are consistent across the codebase.Verification confirms that the backend enum
[vllm sglang trtllm]in the documentation matches the CRD definitions, Go type constants, and controller validation logic. All three backends are consistently referenced across:
- CRD files:
deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeploymentrequests.yamlanddeploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeploymentrequests.yaml- Go types:
deploy/cloud/operator/internal/dynamo/graph.go(lines 592-594)- Controller validation:
deploy/cloud/operator/internal/controller/dynamographdeploymentrequest_controller.go(lines 153-155)No inconsistencies detected.
deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml (2)
527-529: Addition of controller-manager to queue-reader-binding is appropriate.The controller-manager service account now has access to queue resources, which aligns with the profiling workflow and cluster-wide scheduling needs.
530-576: New cluster-resource-reader ClusterRole provides appropriate read-only access.The new ClusterRole grants read-only access to cluster-scoped resources (nodes and clusterroles) needed for GPU discovery and RBAC verification. The rule set is minimal and follows the principle of least privilege. Naming and labeling are consistent with existing RBAC conventions.
deploy/cloud/helm/platform/components/operator/templates/profiling-job-rbac.yaml (2)
89-104: Namespace-restricted mode properly adds nodes access via ClusterRoleBinding.The addition correctly binds the cluster-scoped nodes ClusterRole to the profiling-job ServiceAccount in namespace-restricted mode. This is the appropriate pattern for cluster-scoped resources that need to be accessed from a namespace-restricted context. The conditional template is correctly structured.
111-111: No dangling references detected—renaming is correctly applied and consistent.Verification confirms the ClusterRole name change from
dgdr-profiling-nodestodgdr-profilingin cluster-wide mode is applied consistently:
- ClusterRole metadata.name (line 111):
dgdr-profiling- ClusterRoleBinding metadata.name (line 143):
dgdr-profiling- roleRef.name (line 150):
dgdr-profilingThe operator receives the role name dynamically via the
--dgdr-profiling-cluster-role-nameflag, avoiding hardcoded references. Helm templates correctly pass the appropriate name based on deployment mode (namespace-restricted retains-dgdr-profiling-nodes, cluster-wide uses-dgdr-profiling), and both match their corresponding RBAC definitions.
deploy/cloud/operator/internal/controller/dynamographdeploymentrequest_controller.go
Show resolved
Hide resolved
Signed-off-by: Hannah Zhang <[email protected]>
|
we need to update the docs: |
| metadata: | ||
| name: sla-aic | ||
| spec: | ||
| modelName: Qwen/Qwen3-32B |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this modelname override profile_sla.py's deployment.model?
Overview:
Rework the profiling Kubernetes Job manifests into DGDR manifests. Also some bug fixes to get everything working E2E again.
Details:
--configvalidation check in profiler if the user provides--modelbackendinto high-level DGDR fieldWhere should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
Release Notes
New Features
backendfield to DynamoGraphDeploymentRequest to specify inference backend (vllm, sglang, trtllm) at the top level.Refactor
modelNameandbackendare now top-level spec fields automatically mapped to profiling configuration, reducing configuration complexity.Chores