Skip to content

Conversation

@keivenchang
Copy link
Contributor

@keivenchang keivenchang commented Oct 24, 2025

Overview:

Add HuggingFace model cache checking to sanity_check.py for better pre-deployment validation.

Details:

  • Add HuggingFaceInfo class to check ~/.cache/huggingface/hub
  • Display model count in default mode, detailed info with --thorough-check
  • Check HF_TOKEN environment variable status
  • Update documentation and example output
  • Remove deprecated deploy/dynamo_check.py

Where should the reviewer start?

HuggingFaceInfo class in deploy/sanity_check.py (lines 1242-1410)

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

/coderabbit profile chill

Summary by CodeRabbit

  • New Features
    • Deployment verification now includes HuggingFace model cache analysis.
    • Thorough-check mode displays detailed HuggingFace model cache information and token status.
    • Help documentation updated to reflect HuggingFace cache monitoring.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 24, 2025

Walkthrough

The changes add a new HuggingFaceInfo class to analyze and report HuggingFace model cache state as part of the diagnostic system. This class integrates into the SystemInfo diagnostic tree, with optional detailed model enumeration in thorough-check mode. Help text is updated to reflect this addition.

Changes

Cohort / File(s) Summary
HuggingFace Cache Diagnostics
deploy/sanity_check.py
Adds HuggingFaceInfo class to detect and report cached Hugging Face models under ~/.cache/huggingface/hub, with optional thorough-check mode for detailed model enumeration and HF_TOKEN indicator. Integrates into SystemInfo construction via add_child(). Updates help text and intro narrative to reflect HuggingFace cache inspection. Note: Duplicate class definition present in file.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Areas requiring extra attention:

  • Duplicate HuggingFaceInfo class definition — Two identical class definitions appear in the same file; verify intent and consolidate or remove one copy
  • Integration into SystemInfo subtree — Confirm add_child() call is correctly positioned and doesn't introduce circular dependencies
  • Cache path assumptions — Validate that ~/.cache/huggingface/hub path handling is robust across platforms and edge cases
  • HF_TOKEN handling — Ensure sensitive token information is appropriately handled and not leaked in output

Poem

🐰 A cache so deep, with models galore,
HuggingFace treasures we now explore!
Though duplicates dance where one should be,
Our diagnostics bloom, for all to see! 🌱

Pre-merge checks

✅ Passed checks (3 passed)
Check name Status Explanation
Title Check ✅ Passed The PR title "feat: add HuggingFace cache checking to sanity_check.py" directly and clearly describes the primary change in the changeset. According to the raw summary and PR objectives, the main contribution is the introduction of a new HuggingFaceInfo class that adds HuggingFace model cache checking capability to the sanity_check.py file. The title is concise, specific, and uses conventional commit formatting (feat:), making it easy for teammates scanning history to understand the main addition. While the PR also includes the deletion of deploy/dynamo_check.py and documentation updates, these are secondary to the primary feature being added, which the title appropriately captures.
Description Check ✅ Passed The PR description follows the required template structure with all key sections properly filled out. The Overview section provides a clear summary of the purpose, the Details section outlines specific changes including the new HuggingFaceInfo class, display modes, HF_TOKEN checking, and removal of the deprecated file, and the "Where should the reviewer start?" section helpfully points reviewers to the primary code location (lines 1242-1410). The Related Issues section is present as required, though empty, which is acceptable if no GitHub issues are associated. The description provides sufficient context and specificity for reviewers to understand the changes without being overly verbose.
Docstring Coverage ✅ Passed Docstring coverage is 81.82% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (4)
deploy/sanity_check.py (4)

16-16: Standardize “Hugging Face” naming and sync help text.

Use “Hugging Face” consistently and align the options text with argparse help.

- - HuggingFace model cache (detailed with --thorough-check)
+ - Hugging Face model cache (detailed with --thorough-check)
@@
-    --thorough-check  Enable thorough checking (file permissions, directory sizes, HuggingFace model details)
+    --thorough-check  Enable thorough checking (file permissions, directory sizes, disk space, Hugging Face model details)

Also applies to: 92-93


1361-1367: Tighten exceptions, add debug logs, and fix unused loop var (ruff).

Narrow broad excepts, add debug logging instead of silent pass, and rename dirnames to _dirnames.

-                        try:
-                            stat_info = os.stat(item_path)
+                        try:
+                            stat_info = os.stat(item_path)
                             # Use the earlier of creation time or modification time
                             download_time = min(stat_info.st_ctime, stat_info.st_mtime)
                             download_date = self._format_timestamp_pdt(download_time)
-                        except Exception:
+                        except OSError as e:
+                            logging.debug("HF cache: stat failed for %s: %s", item_path, e)
                             download_date = "unknown"
@@
-        except Exception:
-            pass
+        except OSError as e:
+            logging.debug("HF cache: listing failed for %s: %s", cache_path, e)
@@
-            for dirpath, dirnames, filenames in os.walk(directory):
+            for dirpath, _dirnames, filenames in os.walk(directory):
                 for filename in filenames:
@@
-        except Exception:
-            pass
+        except OSError as e:
+            logging.debug("HF cache: size scan failed for %s: %s", directory, e)

Based on static analysis hints.

Also applies to: 1370-1377, 1386-1389, 1394-1395


1308-1319: Use model name as the node label for readability.

Makes the tree easier to scan than “Model N”.

-            model_node = NodeInfo(
-                label=f"Model {i+1}",
-                desc=f"{model_name}, downloaded={download_date}, size={size_str}",
-                status=NodeStatus.INFO,
-            )
+            model_node = NodeInfo(
+                label=model_name,
+                desc=f"downloaded={download_date}, size={size_str}",
+                status=NodeStatus.INFO,
+            )

1320-1329: Optional: also show when HF_TOKEN is not set.

If helpful, add a small INFO/WARNING when HF_TOKEN is unset to hint at auth-protected models.

     def _add_hf_token_info(self):
         """Add HF_TOKEN information if the environment variable is set."""
-        if os.environ.get("HF_TOKEN"):
+        if os.environ.get("HF_TOKEN"):
             token_node = NodeInfo(
                 label="HF_TOKEN",
                 desc="<set>",
                 status=NodeStatus.INFO,
             )
             self.add_child(token_node)
+        else:
+            self.add_child(NodeInfo(label="HF_TOKEN", desc="not set", status=NodeStatus.INFO))
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 48b622c and 1e91bb4.

📒 Files selected for processing (2)
  • deploy/dynamo_check.py (0 hunks)
  • deploy/sanity_check.py (5 hunks)
💤 Files with no reviewable changes (1)
  • deploy/dynamo_check.py
🧰 Additional context used
🪛 Ruff (0.14.1)
deploy/sanity_check.py

1365-1365: Do not catch blind exception: Exception

(BLE001)


1372-1372: Do not catch blind exception: Exception

(BLE001)


1376-1377: try-except-pass detected, consider logging the exception

(S110)


1376-1376: Do not catch blind exception: Exception

(BLE001)


1386-1386: Loop control variable dirnames not used within loop body

Rename unused dirnames to _dirnames

(B007)


1394-1395: try-except-pass detected, consider logging the exception

(S110)


1394-1394: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: vllm (arm64)
  • GitHub Check: operator (arm64)
  • GitHub Check: operator (amd64)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (4)
deploy/sanity_check.py (4)

19-26: Docs update looks good.

Clear rationale for standalone behavior and hard-coded paths.


47-50: Example output changes look fine.


57-57: Example HF cache line looks good.

Will reflect accurate counts after model filtering fix below.


337-339: Integration into SystemInfo is correct.

Runs in non-terse mode; respects --thorough-check.

- Only count models--* directories, excluding datasets--, spaces--, blobs
- Gate size calculation on thorough_check flag to keep default mode fast
- Add compute_sizes parameter with documentation

Signed-off-by: Keiven Chang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant