feat(install): slim install for remote/NIM-only inference on Intel Mac#1830
feat(install): slim install for remote/NIM-only inference on Intel Mac#1830charlesbluca wants to merge 26 commits intoNVIDIA:mainfrom
Conversation
…d macOS Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
macOS CI runners (Apple Silicon) can be slow to start the Ray raylet, causing the node to time out during startup. Setting the wait to 120s prevents premature failures on resource-constrained runners. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Guard cv2, sklearn, langdetect, minio, pymilvus, and unstructured_client behind try/except or deferred imports so nemo_retriever and nv_ingest_api can be imported without GPU/local-inference packages installed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- pdfium.py: numpy channel-swap fallback when cv2 is None - extract.py: PIL encode fallback for page image rendering without cv2 - transforms.py: PIL fallbacks for numpy_to_base64, base64_to_numpy - table_and_chart.py (api + nemo_retriever): guard bare sklearn DBSCAN imports with ImportError fallback using naive y-grid row grouping - ocr/shared.py: same guard for _blocks_to_pseudo_markdown DBSCAN - lancedb_store.py: skip index creation when no embedding rows written - text_embed/runtime.py: raise on embedding failure instead of silent no-op - graph/executor.py: per-stage debug logging for inprocess pipeline - text_embed/cpu_operator.py: debug logging for embed input/output stats Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Windows also supports CUDA torch wheels; extend the platform marker to sys_platform == 'linux' or sys_platform == 'win32' so Windows users get the GPU-enabled wheel from the PyTorch CUDA index rather than falling through to the CPU-only PyPI wheel. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Greptile SummaryThis PR restructures the
|
| Filename | Overview |
|---|---|
| nemo_retriever/pyproject.toml | Dependency restructuring looks correct: lancedb promoted to core, [remote] collapsed into base, [local] merged, conflicts block removed; torch/torchvision platform markers properly applied. |
| nemo_retriever/src/nemo_retriever/text_embed/runtime.py | Error handling correctly narrowed from BaseException to Exception; print() replaced with structured logger.error(); error payload uses empty list embedding which is safely skipped by _build_lancedb_rows_from_df. |
| nemo_retriever/src/nemo_retriever/text_embed/cpu_operator.py | Added diagnostic debug logging for embedding operator; import logging as _logging placed inside process() method body rather than at module level, which is re-executed on every call. |
| nemo_retriever/src/nemo_retriever/graph/executor.py | Added per-stage debug logging to InprocessExecutor; logger imported and instantiated inside the run method body rather than at module level. |
| api/src/nv_ingest_api/util/image_processing/transforms.py | cv2 import now guarded with try/except; _resize_image_opencv correctly falls back to PIL when cv2 is None, using the hardcoded constant 4 (INTER_LANCZOS4) and Image.LANCZOS. |
| client/src/nv_ingest_client/util/vdb/milvus.py | All optional imports (Minio, MilvusClient, AnnSearchRequest, csr_array, etc.) now have None sentinels in the except ImportError block; previously only CONSISTENCY_BOUNDED was handled. |
| nemo_retriever/src/nemo_retriever/txt/split.py | Docstrings moved to correct position; early return added in txt_file_to_chunks_df for empty files; txt_bytes_to_chunks_df lacks the symmetric early return for empty byte content, causing unnecessary tokenizer load. |
| .github/workflows/integration-test-library-mode.yml | New workflow for cross-platform integration tests; bare nemo_retriever install (no extras) confirmed correct since lancedb is now a core dep; actions unpinned to mutable tags (pre-existing pattern). |
| api/src/nv_ingest_api/internal/extract/pdf/engines/unstructured_io.py | unstructured_client imports guarded with try/except; no guard at call site means TypeError instead of a clear ImportError when the package is absent and the extractor is invoked. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["pip install nemo_retriever\n(bare / slim install)"] --> B["Core deps always present:\nlancedb, pypdfium2, pillow,\nnltk, markitdown,\nlangchain-nvidia-ai-endpoints,\nnv-ingest-api, nv-ingest-client"]
A --> C["[local] extra\ntorch, transformers,\nnemotron models, vllm"]
A --> D["[multimedia] extra\nsoundfile, scipy, cairosvg"]
A --> E["[stores] extra\nduckdb, neo4j"]
A --> F["[benchmarks] extra\ndatasets, open-clip-torch"]
C & D & E & F --> G["[all] convenience\nnemo_retriever[local,multimedia,stores,benchmarks]"]
B --> H["Import guards\n(cv2=None, sklearn=None, langdetect=None)"]
H --> I["PIL fallback in\n_resize_image_opencv\n& _render_page_to_base64"]
H --> J["Grid fallback in\nDBSCAN clustering"]
H --> K["UNKNOWN fallback in\nlanguage detection"]
Prompt To Fix All With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/graph/executor.py
Line: 132-134
Comment:
**Logger instantiated inside method body**
`import logging as _logging` and `_logging.getLogger(__name__)` are called on every invocation of `execute()`. While Python caches both the module and the logger, this is unconventional — the logger should be declared at module level (or class level) once. The same pattern appears inside `_BatchEmbedCPUActor.process()` in `cpu_operator.py`.
```suggestion
import logging
logger = logging.getLogger(__name__)
```
Then replace `_exec_log` references with `logger`.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/txt/split.py
Line: 269-272
Comment:
**`txt_bytes_to_chunks_df` missing early return for empty content**
`txt_file_to_chunks_df` gained an early return before loading the tokenizer when the file is empty. `txt_bytes_to_chunks_df` decodes the bytes but then immediately loads the tokenizer even for empty/whitespace-only content, causing an unnecessary HuggingFace model lookup. Adding the same guard keeps both functions consistent.
```suggestion
path = str(Path(path).resolve())
raw = content_bytes.decode(encoding, errors="replace")
if not raw or not raw.strip():
return pd.DataFrame(
columns=["text", "path", "page_number", "metadata"],
).astype({"page_number": "int64"})
model_id = tokenizer_model_id or DEFAULT_TOKENIZER_MODEL_ID
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: api/src/nv_ingest_api/internal/extract/pdf/engines/unstructured_io.py
Line: 191-195
Comment:
**Missing call-site guard for optional `UnstructuredClient`**
When `unstructured_client` is not installed, all five sentinels (`UnstructuredClient`, `operations`, `shared`, `BackoffStrategy`, `RetryConfig`) are `None`. Calling `UnstructuredClient(...)` directly will raise `TypeError: 'NoneType' object is not callable`, which is harder to diagnose than an explicit error. A short guard at the top of the function body would give the caller an actionable message:
```python
if UnstructuredClient is None:
raise ImportError(
"unstructured_client is required for the unstructured_io extractor. "
"Install it with: pip install unstructured-client"
)
```
How can I resolve this? If you propose a fix, please make it concise.Reviews (4): Last reviewed commit: "Fix README install instructions to refle..." | Re-trigger Greptile
|
|
||
| - name: Set up Python 3.12 | ||
| uses: actions/setup-python@v5 | ||
| with: | ||
| python-version: '3.12' | ||
|
|
||
| - name: Install uv | ||
| run: pip install uv |
There was a problem hiding this comment.
GitHub Actions not pinned to commit SHA
Both actions/checkout@v4 and actions/setup-python@v5 use mutable version tags. Per the repository's github-actions-security rule, third-party actions must be pinned to a full commit SHA to prevent supply-chain attacks.
| - name: Set up Python 3.12 | |
| uses: actions/setup-python@v5 | |
| with: | |
| python-version: '3.12' | |
| - name: Install uv | |
| run: pip install uv | |
| - name: Check out repository code | |
| uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2 | |
| with: | |
| ref: ${{ inputs.source-ref != '' && inputs.source-ref || github.ref }} | |
| - name: Set up Python 3.12 | |
| uses: actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065 # v5.6.0 | |
| with: | |
| python-version: '3.12' |
Prompt To Fix With AI
This is a comment left during a code review.
Path: .github/workflows/integration-test-library-mode.yml
Line: 47-54
Comment:
**GitHub Actions not pinned to commit SHA**
Both `actions/checkout@v4` and `actions/setup-python@v5` use mutable version tags. Per the repository's `github-actions-security` rule, third-party actions must be pinned to a full commit SHA to prevent supply-chain attacks.
```suggestion
- name: Check out repository code
uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
with:
ref: ${{ inputs.source-ref != '' && inputs.source-ref || github.ref }}
- name: Set up Python 3.12
uses: actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065 # v5.6.0
with:
python-version: '3.12'
```
How can I resolve this? If you propose a fix, please make it concise.- Wrap bare `import lancedb` in try/except in lancedb_store.py; add None guard in handle_lancedb for [remote]-only installs - Make graph_pipeline.py import of handle_lancedb lazy to avoid startup crash when lancedb is not installed - Add None sentinels for all optional pymilvus/minio/scipy imports in milvus.py (previously only CONSISTENCY_BOUNDED was set) - Guard cv2.INTER_LANCZOS4 and cv2.resize in _resize_image_opencv with PIL fallback when cv2 is unavailable - Restore graceful degradation in embed_text_main_text_embed: return error-payload DataFrame instead of re-raising on embedding failure Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…izer - Restore handle_lancedb as module-level import in graph_pipeline.py so monkeypatch.setattr works in batch_pipeline tests (safe now that lancedb_store.py guards its own import lancedb) - Early-return in txt_file_to_chunks_df before calling tokenizer when file content is empty, matching the pattern in html/convert.py - Mock _get_tokenizer/_get_txt_tokenizer in txt and html unit tests to avoid hitting HuggingFace (proxy-blocked in this environment) - Skip audio benchmark test gracefully when Ray cannot write to the filesystem (EROFS) rather than failing Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Move docstrings in txt_file_to_chunks_df and txt_bytes_to_chunks_df to immediately follow the def line so they are recognized as actual docstrings (previously appeared after variable assignments) - Remove conflicts block in pyproject.toml that referenced undefined cpu/cu130 extras; torch wheel selection is already handled by sys_platform markers in [tool.uv.sources] Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Merge remote extra into base dependencies (document parsing and NIM client libs are now core; no extra needed for remote-only installs) - Move lancedb into base dependencies (it is the default VDB solution) - Merge local-cpu and local-gpu into a single local extra (GPU assumed) - Drop lancedb from [stores]; update [all] to reference [local] - Remove now-unnecessary lancedb import guards in lancedb_store.py - Update integration-test workflow to install bare nemo_retriever - Update DEPENDENCY_LAYERS.md to reflect new structure Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Replace core-only install with [local] extra for local GPU path; add separate remote-only install command for NIM inference users - Reframe Step 2 torch override: clarifies it is a PyPI/CUDA workaround, not a completion of the GPU stack; mark it skip-able for remote users - Add note before pipeline examples that they require [local] + CUDA torch - Fix [svg] extra reference to [multimedia] Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…workflow Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Description
Simplifies the
nemo_retrieverinstall story by collapsing the dependency tier hierarchy and fixing several optional-import bugs surfaced during code review.Dependency restructuring
[remote]extra removed — its packages (pypdfium2,pillow,nltk,markitdown,langchain-nvidia-ai-endpoints) are now in the base install. A barepip install nemo_retrieveris now the correct remote/NIM-inference install.lancedbmoved to core — it is the default VDB solution and is now always present;[stores]retains only the supplementary backends (duckdb,neo4j).local-cpu+local-gpumerged into[local]— GPU is assumed for local model inference; the CPU-only tier is removed.conflictsblock removed frompyproject.toml— it referenced undefinedcpu/cu130extras; torch wheel selection is already handled bysys_platformmarkers.Import guard cleanup
try/except ImportErroraroundimport lancedbinlancedb_store.py— no longer needed sincelancedbis a core dep.if lancedb is Noneguard inhandle_lancedband the redundant lazy local import in_write_rows_to_lancedb.handle_lancedbas a module-level import ingraph_pipeline.py(was made lazy as a workaround; safe now that lancedb is always installed).Other code review fixes
Nonesentinels for all optional pymilvus/minio/scipy names inmilvus.pyexcept ImportErrorblock (previously onlyCONSISTENCY_BOUNDEDwas set).cv2.INTER_LANCZOS4/cv2.resizein_resize_image_opencvwith a PIL fallback whencv2is unavailable.embed_text_main_text_embed: returns an error-payload DataFrame on failure instead of re-raising (restores original pipeline contract).txt_file_to_chunks_dfandtxt_bytes_to_chunks_df(appeared after variable assignments, not recognized by Python).txt_file_to_chunks_dffor empty files before the tokenizer is loaded.Test fixes
_get_tokenizer/_get_txt_tokenizerin txt and html unit tests to avoid HuggingFace network calls.EROFS) rather than failing.Docs / CI
DEPENDENCY_LAYERS.mdto reflect new structure.integration-test-library-mode.ymlto install barenemo_retriever(no extra needed).Checklist