fix: consistent model recipes and update simplified doc #3858

biswapanda · 2025-10-23T22:05:16Z

Overview:

Details:

closes nvbug: https://nvbugspro.nvidia.com/bug/5609103
closes: DYN-1338

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: #xxx

Summary by CodeRabbit

Documentation
- Significantly expanded recipes documentation with comprehensive quick-start guides, prerequisites checklist, manual deployment steps, configuration guidance, monitoring commands, and troubleshooting resources.
New Features
- Added automated model download and caching capability for supported foundation models.
Bug Fixes
- Improved deployment workflow to correctly handle job names from configuration files.

coderabbitai · 2025-10-23T22:11:59Z

Walkthrough

Comprehensive documentation rewrite for recipes with expanded guides, prerequisites, and deployment instructions. New Kubernetes Job manifest added for automated model caching. Script updated to dynamically reference job names instead of using hardcoded values.

Changes

Cohort / File(s)	Summary
Documentation Enhancement `recipes/README.md`	Restructured from compact table format to comprehensive narrative documentation. Added descriptive model overview, Quick Start section with deployment options, expanded prerequisites checklist, detailed Usage section for run.sh, manual deployment steps with model-specific examples (Llama-3-70B, GPT-OSS-120B, DeepSeek-R1), configuration subsection, enhanced monitoring and troubleshooting guidance, and resources links.
Infrastructure Configuration `recipes/deepseek-r1/model-cache/model-download.yaml`	New Kubernetes Job manifest for orchestrated model downloads from HuggingFace. Configures Python 3.10-slim container with huggingface_hub dependencies, secret-based authentication, environment variables for model identification and revision, PersistentVolumeClaim mount for /model-store, and Job completion policies (backoffLimit: 3, completions: 1, parallelism: 1).
Script Logic Update `recipes/run.sh`	Replaced hardcoded job naming (job/model-download-${MODEL}) with dynamic extraction of actual job name from model-download.yaml. Script now reads MODEL_DOWNLOAD_JOB_NAME from manifest and uses it for job status monitoring and logging.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

The changes are heterogeneous in nature (documentation, infrastructure configuration, and script logic), but each component involves straightforward modifications without complex reasoning requirements. Documentation rewrites lack logical density, YAML configuration is standard Kubernetes manifest structure, and the script change is a direct refactor from hardcoded to dynamic naming.

Poem

🐰 Hop, hop, hooray! The docs now glow so bright,
With cached models ready and jobs running right,
No hardcoded names—just dynamic delight,
These recipes take flight! ✨

Pre-merge checks

❌ Failed checks (2 warnings, 1 inconclusive)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The pull request description contains all required sections from the template (Overview, Details, Where should the reviewer start, and Related Issues), but every section is populated only with placeholder or template text. The Overview, Details, and Where should the reviewer start sections contain only HTML comments with no actual content, and the Related Issues section lists only a placeholder issue number "#xxx". This means the description provides no substantive information about what changes were made, why they were made, or where reviewers should focus their attention.	Fill in all required sections with actual content: provide a clear overview of the purpose and scope of the changes, describe the specific modifications made across the three affected files (README.md, model-download.yaml, and run.sh), identify which files need close reviewer attention, and reference any related GitHub issues using the action keywords (Closes/Fixes/Resolves). This will help reviewers understand the context and rationale for the changes.
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Title Check	❓ Inconclusive	The title "fix: model recipes" is vague and uses non-descriptive language that lacks specificity. While it is related to the changeset (which does address issues in the recipes area), the title does not clearly convey what is actually being fixed. The PR contains significant changes including a README rewrite, a new Kubernetes manifest, and modifications to the run.sh script, but the title provides no indication of which of these represents the primary focus or what specific problem is being addressed.	Consider revising the title to be more specific and descriptive, such as "fix: dynamic job name extraction in model recipes" or "docs: expand recipes documentation with deployment examples" depending on which change is most significant. A more concrete title will help reviewers immediately understand the primary intent of the changes.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (3)

recipes/deepseek-r1/model-cache/model-download.yaml (3)
19-19: Use pinned container image digest for reproducibility.

The image python:3.10-slim uses a mutable tag, so different runs could pull different versions. Use a specific version tag or SHA256 digest to ensure consistent builds and easier debugging.

Example:
-          image: python:3.10-slim
+          image: python:3.10-slim@sha256:YOUR_DIGEST_HERE
15-40: Add security context to restrict privilege escalation and enforce least-privilege.

The container currently runs as root without privilege escalation restrictions. Add a securityContext to align with Kubernetes security best practices:
     spec:
       restartPolicy: Never
+      securityContext:
+        runAsNonRoot: true
+        runAsUser: 1000
+        fsGroup: 1000
       containers:
         - name: model-download
           image: python:3.10-slim
           command: ["sh", "-c"]
+          securityContext:
+            allowPrivilegeEscalation: false
+            capabilities:
+              drop:
+                - ALL
Note: If the model download requires root access, document why and accept the security trade-off explicitly.

17-37: Add resource requests and limits to prevent unbounded cluster resource consumption.

The container lacks CPU/memory resource constraints, which can destabilize the cluster if the download is large or the pip install is resource-intensive.
       containers:
         - name: model-download
           image: python:3.10-slim
+          resources:
+            requests:
+              cpu: "2"
+              memory: "4Gi"
+            limits:
+              cpu: "4"
+              memory: "8Gi"
           command: ["sh", "-c"]
Adjust values based on your model size and cluster capacity.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 640c2d3 and 1dd9d7e.

📒 Files selected for processing (3)

recipes/README.md (1 hunks)
recipes/deepseek-r1/model-cache/model-download.yaml (1 hunks)
recipes/run.sh (1 hunks)

🧰 Additional context used

🪛 Checkov (3.2.334)

recipes/deepseek-r1/model-cache/model-download.yaml

[medium] 3-44: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)

[medium] 3-44: Minimize the admission of root containers

(CKV_K8S_23)

🪛 markdownlint-cli2 (0.18.1)

recipes/README.md

320-320: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)

GitHub Check: trtllm (arm64)
GitHub Check: vllm (arm64)
GitHub Check: operator (arm64)
GitHub Check: vllm (amd64)
GitHub Check: operator (amd64)
GitHub Check: Build and Test - dynamo

recipes/deepseek-r1/model-cache/model-download.yaml

recipes/README.md

recipes/run.sh

grahamking · 2025-10-24T13:00:41Z

Could you add some detail to the title?

alec-flowers · 2025-10-24T17:14:55Z

Should we also put the model-cache and model-download in the same yaml so they get executed at once instead of two commands?

alec-flowers · 2025-10-24T17:16:51Z

recipes/README.md

+kubectl apply -f hf_hub_secret/hf_hub_secret.yaml -n ${NAMESPACE}
+
+# Deploy model with automatic download and benchmarking
+./run.sh --model llama-3-70b --framework vllm agg


I feel agg should also be a flag. It should be clear you have to provide 3 args - --model, --framework, --deployment or something like that.

added --deployment

alec-flowers · 2025-10-24T17:17:26Z

recipes/README.md

+./run.sh --model llama-3-70b --framework vllm agg
+
+# Or skip model download if model has been already downloaded to model cache PVC
+./run.sh --skip-model-cache --model llama-3-70b --framework vllm agg


Shouldn't the model download technically be a no-op? Is there harm in having it run if it will just exit successfully if it finds the model.

agreed, removed the slip-model-cache

alec-flowers · 2025-10-24T17:19:12Z

recipes/README.md

+kubectl delete namespace $NAMESPACE
+```
+
+## Contributing


Yes I'm glad we added a contributing. But its 500 lines down at the bottom. Maybe its own CONTRIBUTING.md

alec-flowers · 2025-10-24T17:20:03Z

recipes/README.md

+kubectl exec -n $NAMESPACE -it $(kubectl get pods -n $NAMESPACE -l job-name=$PERF_JOB_NAME -o jsonpath='{.items[0].metadata.name}') -- ls -la /model-cache/perf/
+```
+
+## Model-Specific Examples


Can we just add these model specific examples as ReadMe.md under the model folder itself. This markdown is too long now.

alec-flowers · 2025-10-24T17:21:23Z

recipes/README.md

+
+# Replace "your-storage-class-name" with your actual storage class
+
+## Model Download


Shouldn't this be step 1 in manual model deployment?

alec-flowers · 2025-10-24T17:22:57Z

recipes/README.md

+kubectl apply -n $NAMESPACE -f <model>/<framework>/<mode>/perf.yaml
+```

 ## Prerequisites


I feel like this could just be a link to somewhere with a good explanation? Do we need this extra text here? Someone can just go to the link and make sure its setup. Maybe you could have a command or two to check that thing are setup properly but don't think we should have an entire explanation here. It takes up too much valuable readme space.

alec-flowers · 2025-10-24T17:24:26Z

recipes/README.md

+./run.sh --skip-model-cache --model llama-3-70b --framework vllm agg
+```
+
+### Option 2: Manual Deployment


I like this manual deployment. Short and sweet. Do we need the other ones with all the extra information? Maybe that can be a different markdown? These are basically just duplicating info so we should just keep one or the other.

See

Manual Model Deployment

alec-flowers · 2025-10-24T17:25:13Z

Main feedback is the Readme needs to be much more concise. There is a lot of extras in there I think you can just remove.

biswapanda added 3 commits October 23, 2025 14:24

fix

a96738c

update docs

f778f0d

fix dsr1

1dd9d7e

biswapanda requested review from a team as code owners October 23, 2025 22:05

pull-request-size bot added the size/L label Oct 23, 2025

github-actions bot added the fix label Oct 23, 2025

biswapanda self-assigned this Oct 23, 2025

coderabbitai bot reviewed Oct 23, 2025

View reviewed changes

recipes/deepseek-r1/model-cache/model-download.yaml Show resolved Hide resolved

recipes/README.md Outdated Show resolved Hide resolved

recipes/README.md Outdated Show resolved Hide resolved

recipes/README.md Show resolved Hide resolved

recipes/run.sh Outdated Show resolved Hide resolved

fix dsr1 and oss-gpt120b models

e0f9123

copy-pr-bot bot temporarily deployed to GITLAB October 23, 2025 22:20 Inactive

fix

02b8b03

copy-pr-bot bot temporarily deployed to GITLAB October 23, 2025 22:44 Inactive

update run.sh

b90c5b6

copy-pr-bot bot temporarily deployed to GITLAB October 23, 2025 22:46 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 23, 2025 22:51 Inactive

update readme

bc0b77a

pull-request-size bot added size/XL and removed size/L labels Oct 23, 2025

biswapanda added 2 commits October 23, 2025 15:56

update

406e7bc

fix

1570665

alec-flowers reviewed Oct 24, 2025

View reviewed changes

biswapanda changed the title ~~fix: model recipes~~ fix: consistent model recipes and update simplified doc Oct 24, 2025

update

2a82e67

copy-pr-bot bot temporarily deployed to GITLAB October 25, 2025 00:07 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 25, 2025 00:08 Inactive


		# Replace "your-storage-class-name" with your actual storage class

		## Model Download

fix: consistent model recipes and update simplified doc #3858

Are you sure you want to change the base?

fix: consistent model recipes and update simplified doc #3858

Uh oh!

Conversation

biswapanda commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 23, 2025

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

grahamking commented Oct 24, 2025

Uh oh!

alec-flowers commented Oct 24, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Manual Model Deployment

Uh oh!

alec-flowers commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

biswapanda commented Oct 23, 2025 •

edited

Loading