Skip to content

Conversation

@biswapanda
Copy link
Contributor

@biswapanda biswapanda commented Oct 23, 2025

Overview:

Details:

closes nvbug: https://nvbugspro.nvidia.com/bug/5609103
closes: DYN-1338

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

  • Documentation

    • Significantly expanded recipes documentation with comprehensive quick-start guides, prerequisites checklist, manual deployment steps, configuration guidance, monitoring commands, and troubleshooting resources.
  • New Features

    • Added automated model download and caching capability for supported foundation models.
  • Bug Fixes

    • Improved deployment workflow to correctly handle job names from configuration files.

@biswapanda biswapanda requested review from a team as code owners October 23, 2025 22:05
@github-actions github-actions bot added the fix label Oct 23, 2025
@biswapanda biswapanda self-assigned this Oct 23, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 23, 2025

Walkthrough

Comprehensive documentation rewrite for recipes with expanded guides, prerequisites, and deployment instructions. New Kubernetes Job manifest added for automated model caching. Script updated to dynamically reference job names instead of using hardcoded values.

Changes

Cohort / File(s) Summary
Documentation Enhancement
recipes/README.md
Restructured from compact table format to comprehensive narrative documentation. Added descriptive model overview, Quick Start section with deployment options, expanded prerequisites checklist, detailed Usage section for run.sh, manual deployment steps with model-specific examples (Llama-3-70B, GPT-OSS-120B, DeepSeek-R1), configuration subsection, enhanced monitoring and troubleshooting guidance, and resources links.
Infrastructure Configuration
recipes/deepseek-r1/model-cache/model-download.yaml
New Kubernetes Job manifest for orchestrated model downloads from HuggingFace. Configures Python 3.10-slim container with huggingface_hub dependencies, secret-based authentication, environment variables for model identification and revision, PersistentVolumeClaim mount for /model-store, and Job completion policies (backoffLimit: 3, completions: 1, parallelism: 1).
Script Logic Update
recipes/run.sh
Replaced hardcoded job naming (job/model-download-${MODEL}) with dynamic extraction of actual job name from model-download.yaml. Script now reads MODEL_DOWNLOAD_JOB_NAME from manifest and uses it for job status monitoring and logging.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

The changes are heterogeneous in nature (documentation, infrastructure configuration, and script logic), but each component involves straightforward modifications without complex reasoning requirements. Documentation rewrites lack logical density, YAML configuration is standard Kubernetes manifest structure, and the script change is a direct refactor from hardcoded to dynamic naming.

Poem

🐰 Hop, hop, hooray! The docs now glow so bright,
With cached models ready and jobs running right,
No hardcoded names—just dynamic delight,
These recipes take flight! ✨

Pre-merge checks

❌ Failed checks (2 warnings, 1 inconclusive)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The pull request description contains all required sections from the template (Overview, Details, Where should the reviewer start, and Related Issues), but every section is populated only with placeholder or template text. The Overview, Details, and Where should the reviewer start sections contain only HTML comments with no actual content, and the Related Issues section lists only a placeholder issue number "#xxx". This means the description provides no substantive information about what changes were made, why they were made, or where reviewers should focus their attention. Fill in all required sections with actual content: provide a clear overview of the purpose and scope of the changes, describe the specific modifications made across the three affected files (README.md, model-download.yaml, and run.sh), identify which files need close reviewer attention, and reference any related GitHub issues using the action keywords (Closes/Fixes/Resolves). This will help reviewers understand the context and rationale for the changes.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Title Check ❓ Inconclusive The title "fix: model recipes" is vague and uses non-descriptive language that lacks specificity. While it is related to the changeset (which does address issues in the recipes area), the title does not clearly convey what is actually being fixed. The PR contains significant changes including a README rewrite, a new Kubernetes manifest, and modifications to the run.sh script, but the title provides no indication of which of these represents the primary focus or what specific problem is being addressed. Consider revising the title to be more specific and descriptive, such as "fix: dynamic job name extraction in model recipes" or "docs: expand recipes documentation with deployment examples" depending on which change is most significant. A more concrete title will help reviewers immediately understand the primary intent of the changes.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (3)
recipes/deepseek-r1/model-cache/model-download.yaml (3)

19-19: Use pinned container image digest for reproducibility.

The image python:3.10-slim uses a mutable tag, so different runs could pull different versions. Use a specific version tag or SHA256 digest to ensure consistent builds and easier debugging.

Example:

-          image: python:3.10-slim
+          image: python:3.10-slim@sha256:YOUR_DIGEST_HERE

15-40: Add security context to restrict privilege escalation and enforce least-privilege.

The container currently runs as root without privilege escalation restrictions. Add a securityContext to align with Kubernetes security best practices:

     spec:
       restartPolicy: Never
+      securityContext:
+        runAsNonRoot: true
+        runAsUser: 1000
+        fsGroup: 1000
       containers:
         - name: model-download
           image: python:3.10-slim
           command: ["sh", "-c"]
+          securityContext:
+            allowPrivilegeEscalation: false
+            capabilities:
+              drop:
+                - ALL

Note: If the model download requires root access, document why and accept the security trade-off explicitly.


17-37: Add resource requests and limits to prevent unbounded cluster resource consumption.

The container lacks CPU/memory resource constraints, which can destabilize the cluster if the download is large or the pip install is resource-intensive.

       containers:
         - name: model-download
           image: python:3.10-slim
+          resources:
+            requests:
+              cpu: "2"
+              memory: "4Gi"
+            limits:
+              cpu: "4"
+              memory: "8Gi"
           command: ["sh", "-c"]

Adjust values based on your model size and cluster capacity.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 640c2d3 and 1dd9d7e.

📒 Files selected for processing (3)
  • recipes/README.md (1 hunks)
  • recipes/deepseek-r1/model-cache/model-download.yaml (1 hunks)
  • recipes/run.sh (1 hunks)
🧰 Additional context used
🪛 Checkov (3.2.334)
recipes/deepseek-r1/model-cache/model-download.yaml

[medium] 3-44: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 3-44: Minimize the admission of root containers

(CKV_K8S_23)

🪛 markdownlint-cli2 (0.18.1)
recipes/README.md

320-320: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: trtllm (arm64)
  • GitHub Check: vllm (arm64)
  • GitHub Check: operator (arm64)
  • GitHub Check: vllm (amd64)
  • GitHub Check: operator (amd64)
  • GitHub Check: Build and Test - dynamo

@grahamking
Copy link
Contributor

Could you add some detail to the title?

@alec-flowers
Copy link
Contributor

Should we also put the model-cache and model-download in the same yaml so they get executed at once instead of two commands?

kubectl apply -f hf_hub_secret/hf_hub_secret.yaml -n ${NAMESPACE}

# Deploy model with automatic download and benchmarking
./run.sh --model llama-3-70b --framework vllm agg
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel agg should also be a flag. It should be clear you have to provide 3 args - --model, --framework, --deployment or something like that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added --deployment

./run.sh --model llama-3-70b --framework vllm agg

# Or skip model download if model has been already downloaded to model cache PVC
./run.sh --skip-model-cache --model llama-3-70b --framework vllm agg
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the model download technically be a no-op? Is there harm in having it run if it will just exit successfully if it finds the model.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, removed the slip-model-cache

kubectl delete namespace $NAMESPACE
```

## Contributing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I'm glad we added a contributing. But its 500 lines down at the bottom. Maybe its own CONTRIBUTING.md

kubectl exec -n $NAMESPACE -it $(kubectl get pods -n $NAMESPACE -l job-name=$PERF_JOB_NAME -o jsonpath='{.items[0].metadata.name}') -- ls -la /model-cache/perf/
```

## Model-Specific Examples
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just add these model specific examples as ReadMe.md under the model folder itself. This markdown is too long now.

# Replace "your-storage-class-name" with your actual storage class

## Model Download
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be step 1 in manual model deployment?

kubectl apply -n $NAMESPACE -f <model>/<framework>/<mode>/perf.yaml
```

## Prerequisites
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this could just be a link to somewhere with a good explanation? Do we need this extra text here? Someone can just go to the link and make sure its setup. Maybe you could have a command or two to check that thing are setup properly but don't think we should have an entire explanation here. It takes up too much valuable readme space.

./run.sh --skip-model-cache --model llama-3-70b --framework vllm agg
```

### Option 2: Manual Deployment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this manual deployment. Short and sweet. Do we need the other ones with all the extra information? Maybe that can be a different markdown? These are basically just duplicating info so we should just keep one or the other.

See

Manual Model Deployment

@alec-flowers
Copy link
Contributor

Main feedback is the Readme needs to be much more concise. There is a lot of extras in there I think you can just remove.

@biswapanda biswapanda changed the title fix: model recipes fix: consistent model recipes and update simplified doc Oct 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants