Skip to content

feat(install): slim install for remote/NIM-only inference on Intel Mac#1830

Draft
charlesbluca wants to merge 26 commits intoNVIDIA:mainfrom
charlesbluca:slim-install
Draft

feat(install): slim install for remote/NIM-only inference on Intel Mac#1830
charlesbluca wants to merge 26 commits intoNVIDIA:mainfrom
charlesbluca:slim-install

Conversation

@charlesbluca
Copy link
Copy Markdown
Collaborator

@charlesbluca charlesbluca commented Apr 9, 2026

Description

Simplifies the nemo_retriever install story by collapsing the dependency tier hierarchy and fixing several optional-import bugs surfaced during code review.

Dependency restructuring

  • [remote] extra removed — its packages (pypdfium2, pillow, nltk, markitdown, langchain-nvidia-ai-endpoints) are now in the base install. A bare pip install nemo_retriever is now the correct remote/NIM-inference install.
  • lancedb moved to core — it is the default VDB solution and is now always present; [stores] retains only the supplementary backends (duckdb, neo4j).
  • local-cpu + local-gpu merged into [local] — GPU is assumed for local model inference; the CPU-only tier is removed.
  • conflicts block removed from pyproject.toml — it referenced undefined cpu/cu130 extras; torch wheel selection is already handled by sys_platform markers.

Import guard cleanup

  • Removed try/except ImportError around import lancedb in lancedb_store.py — no longer needed since lancedb is a core dep.
  • Removed if lancedb is None guard in handle_lancedb and the redundant lazy local import in _write_rows_to_lancedb.
  • Restored handle_lancedb as a module-level import in graph_pipeline.py (was made lazy as a workaround; safe now that lancedb is always installed).

Other code review fixes

  • Added None sentinels for all optional pymilvus/minio/scipy names in milvus.py except ImportError block (previously only CONSISTENCY_BOUNDED was set).
  • Guarded cv2.INTER_LANCZOS4 / cv2.resize in _resize_image_opencv with a PIL fallback when cv2 is unavailable.
  • Restored graceful degradation in embed_text_main_text_embed: returns an error-payload DataFrame on failure instead of re-raising (restores original pipeline contract).
  • Fixed misplaced docstrings in txt_file_to_chunks_df and txt_bytes_to_chunks_df (appeared after variable assignments, not recognized by Python).
  • Added early return in txt_file_to_chunks_df for empty files before the tokenizer is loaded.

Test fixes

  • Mocked _get_tokenizer / _get_txt_tokenizer in txt and html unit tests to avoid HuggingFace network calls.
  • Skips audio benchmark test gracefully when Ray cannot write to the filesystem (EROFS) rather than failing.

Docs / CI

  • Updated DEPENDENCY_LAYERS.md to reflect new structure.
  • Updated integration-test-library-mode.yml to install bare nemo_retriever (no extra needed).

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

charlesbluca and others added 16 commits April 8, 2026 17:56
…d macOS

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
macOS CI runners (Apple Silicon) can be slow to start the Ray raylet,
causing the node to time out during startup. Setting the wait to 120s
prevents premature failures on resource-constrained runners.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Guard cv2, sklearn, langdetect, minio, pymilvus, and unstructured_client
behind try/except or deferred imports so nemo_retriever and nv_ingest_api
can be imported without GPU/local-inference packages installed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- pdfium.py: numpy channel-swap fallback when cv2 is None
- extract.py: PIL encode fallback for page image rendering without cv2
- transforms.py: PIL fallbacks for numpy_to_base64, base64_to_numpy
- table_and_chart.py (api + nemo_retriever): guard bare sklearn DBSCAN
  imports with ImportError fallback using naive y-grid row grouping
- ocr/shared.py: same guard for _blocks_to_pseudo_markdown DBSCAN
- lancedb_store.py: skip index creation when no embedding rows written
- text_embed/runtime.py: raise on embedding failure instead of silent no-op
- graph/executor.py: per-stage debug logging for inprocess pipeline
- text_embed/cpu_operator.py: debug logging for embed input/output stats

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Windows also supports CUDA torch wheels; extend the platform marker to
sys_platform == 'linux' or sys_platform == 'win32' so Windows users
get the GPU-enabled wheel from the PyTorch CUDA index rather than
falling through to the CPU-only PyPI wheel.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@charlesbluca charlesbluca requested review from a team as code owners April 9, 2026 15:15
@charlesbluca charlesbluca requested a review from nkmcalli April 9, 2026 15:15
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 9, 2026

Greptile Summary

This PR restructures the nemo_retriever package to support a slim remote/NIM-only install (bare pip install nemo_retriever) suitable for Intel Mac. Key changes include: moving lancedb and several previously-optional packages to core deps, merging local-cpu/local-gpu into a single [local] extra, removing the [remote] extra tier, fixing try/except ImportError guards for cv2, sklearn, langdetect, and pymilvus, and restoring graceful embedding failure handling (narrowed from BaseException to Exception with proper structured logging).

Confidence Score: 5/5

PR is safe to merge — all previously raised P0/P1 issues are addressed and remaining findings are non-blocking P2 style suggestions.

All critical issues from prior review threads (milvus None sentinels, cv2 attribute error, lancedb import, GitHub Actions SHA, BaseException swallowing) are resolved in this PR. The three remaining comments are P2: logger-inside-method style, a missing early-return for empty bytes, and a missing call-site guard in the legacy unstructured_io extractor. None of these block functionality on the target slim install path.

No files require special attention; the most impactful changes in pyproject.toml and lancedb_store.py look correct.

Vulnerabilities

No security concerns identified. No hardcoded credentials, secrets, or tokens were found. The new workflow file passes API keys via ${{ secrets.NGC_NV_DEVELOPER_NVCF }} correctly. No SQL injection vectors were introduced.

Important Files Changed

Filename Overview
nemo_retriever/pyproject.toml Dependency restructuring looks correct: lancedb promoted to core, [remote] collapsed into base, [local] merged, conflicts block removed; torch/torchvision platform markers properly applied.
nemo_retriever/src/nemo_retriever/text_embed/runtime.py Error handling correctly narrowed from BaseException to Exception; print() replaced with structured logger.error(); error payload uses empty list embedding which is safely skipped by _build_lancedb_rows_from_df.
nemo_retriever/src/nemo_retriever/text_embed/cpu_operator.py Added diagnostic debug logging for embedding operator; import logging as _logging placed inside process() method body rather than at module level, which is re-executed on every call.
nemo_retriever/src/nemo_retriever/graph/executor.py Added per-stage debug logging to InprocessExecutor; logger imported and instantiated inside the run method body rather than at module level.
api/src/nv_ingest_api/util/image_processing/transforms.py cv2 import now guarded with try/except; _resize_image_opencv correctly falls back to PIL when cv2 is None, using the hardcoded constant 4 (INTER_LANCZOS4) and Image.LANCZOS.
client/src/nv_ingest_client/util/vdb/milvus.py All optional imports (Minio, MilvusClient, AnnSearchRequest, csr_array, etc.) now have None sentinels in the except ImportError block; previously only CONSISTENCY_BOUNDED was handled.
nemo_retriever/src/nemo_retriever/txt/split.py Docstrings moved to correct position; early return added in txt_file_to_chunks_df for empty files; txt_bytes_to_chunks_df lacks the symmetric early return for empty byte content, causing unnecessary tokenizer load.
.github/workflows/integration-test-library-mode.yml New workflow for cross-platform integration tests; bare nemo_retriever install (no extras) confirmed correct since lancedb is now a core dep; actions unpinned to mutable tags (pre-existing pattern).
api/src/nv_ingest_api/internal/extract/pdf/engines/unstructured_io.py unstructured_client imports guarded with try/except; no guard at call site means TypeError instead of a clear ImportError when the package is absent and the extractor is invoked.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["pip install nemo_retriever\n(bare / slim install)"] --> B["Core deps always present:\nlancedb, pypdfium2, pillow,\nnltk, markitdown,\nlangchain-nvidia-ai-endpoints,\nnv-ingest-api, nv-ingest-client"]

    A --> C["[local] extra\ntorch, transformers,\nnemotron models, vllm"]
    A --> D["[multimedia] extra\nsoundfile, scipy, cairosvg"]
    A --> E["[stores] extra\nduckdb, neo4j"]
    A --> F["[benchmarks] extra\ndatasets, open-clip-torch"]
    C & D & E & F --> G["[all] convenience\nnemo_retriever[local,multimedia,stores,benchmarks]"]

    B --> H["Import guards\n(cv2=None, sklearn=None, langdetect=None)"]
    H --> I["PIL fallback in\n_resize_image_opencv\n& _render_page_to_base64"]
    H --> J["Grid fallback in\nDBSCAN clustering"]
    H --> K["UNKNOWN fallback in\nlanguage detection"]
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/graph/executor.py
Line: 132-134

Comment:
**Logger instantiated inside method body**

`import logging as _logging` and `_logging.getLogger(__name__)` are called on every invocation of `execute()`. While Python caches both the module and the logger, this is unconventional — the logger should be declared at module level (or class level) once. The same pattern appears inside `_BatchEmbedCPUActor.process()` in `cpu_operator.py`.

```suggestion
import logging

logger = logging.getLogger(__name__)
```

Then replace `_exec_log` references with `logger`.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/txt/split.py
Line: 269-272

Comment:
**`txt_bytes_to_chunks_df` missing early return for empty content**

`txt_file_to_chunks_df` gained an early return before loading the tokenizer when the file is empty. `txt_bytes_to_chunks_df` decodes the bytes but then immediately loads the tokenizer even for empty/whitespace-only content, causing an unnecessary HuggingFace model lookup. Adding the same guard keeps both functions consistent.

```suggestion
    path = str(Path(path).resolve())
    raw = content_bytes.decode(encoding, errors="replace")
    if not raw or not raw.strip():
        return pd.DataFrame(
            columns=["text", "path", "page_number", "metadata"],
        ).astype({"page_number": "int64"})
    model_id = tokenizer_model_id or DEFAULT_TOKENIZER_MODEL_ID
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: api/src/nv_ingest_api/internal/extract/pdf/engines/unstructured_io.py
Line: 191-195

Comment:
**Missing call-site guard for optional `UnstructuredClient`**

When `unstructured_client` is not installed, all five sentinels (`UnstructuredClient`, `operations`, `shared`, `BackoffStrategy`, `RetryConfig`) are `None`. Calling `UnstructuredClient(...)` directly will raise `TypeError: 'NoneType' object is not callable`, which is harder to diagnose than an explicit error. A short guard at the top of the function body would give the caller an actionable message:

```python
if UnstructuredClient is None:
    raise ImportError(
        "unstructured_client is required for the unstructured_io extractor. "
        "Install it with: pip install unstructured-client"
    )
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (4): Last reviewed commit: "Fix README install instructions to refle..." | Re-trigger Greptile

Comment on lines +47 to +54

- name: Set up Python 3.12
uses: actions/setup-python@v5
with:
python-version: '3.12'

- name: Install uv
run: pip install uv
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 GitHub Actions not pinned to commit SHA

Both actions/checkout@v4 and actions/setup-python@v5 use mutable version tags. Per the repository's github-actions-security rule, third-party actions must be pinned to a full commit SHA to prevent supply-chain attacks.

Suggested change
- name: Set up Python 3.12
uses: actions/setup-python@v5
with:
python-version: '3.12'
- name: Install uv
run: pip install uv
- name: Check out repository code
uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
with:
ref: ${{ inputs.source-ref != '' && inputs.source-ref || github.ref }}
- name: Set up Python 3.12
uses: actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065 # v5.6.0
with:
python-version: '3.12'
Prompt To Fix With AI
This is a comment left during a code review.
Path: .github/workflows/integration-test-library-mode.yml
Line: 47-54

Comment:
**GitHub Actions not pinned to commit SHA**

Both `actions/checkout@v4` and `actions/setup-python@v5` use mutable version tags. Per the repository's `github-actions-security` rule, third-party actions must be pinned to a full commit SHA to prevent supply-chain attacks.

```suggestion
      - name: Check out repository code
        uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683  # v4.2.2
        with:
          ref: ${{ inputs.source-ref != '' && inputs.source-ref || github.ref }}

      - name: Set up Python 3.12
        uses: actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065  # v5.6.0
        with:
          python-version: '3.12'
```

How can I resolve this? If you propose a fix, please make it concise.

@charlesbluca charlesbluca marked this pull request as draft April 9, 2026 15:33
charlesbluca and others added 5 commits April 9, 2026 15:53
- Wrap bare `import lancedb` in try/except in lancedb_store.py; add None
  guard in handle_lancedb for [remote]-only installs
- Make graph_pipeline.py import of handle_lancedb lazy to avoid startup
  crash when lancedb is not installed
- Add None sentinels for all optional pymilvus/minio/scipy imports in
  milvus.py (previously only CONSISTENCY_BOUNDED was set)
- Guard cv2.INTER_LANCZOS4 and cv2.resize in _resize_image_opencv with
  PIL fallback when cv2 is unavailable
- Restore graceful degradation in embed_text_main_text_embed: return
  error-payload DataFrame instead of re-raising on embedding failure

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…izer

- Restore handle_lancedb as module-level import in graph_pipeline.py so
  monkeypatch.setattr works in batch_pipeline tests (safe now that
  lancedb_store.py guards its own import lancedb)
- Early-return in txt_file_to_chunks_df before calling tokenizer when
  file content is empty, matching the pattern in html/convert.py
- Mock _get_tokenizer/_get_txt_tokenizer in txt and html unit tests to
  avoid hitting HuggingFace (proxy-blocked in this environment)
- Skip audio benchmark test gracefully when Ray cannot write to the
  filesystem (EROFS) rather than failing

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Move docstrings in txt_file_to_chunks_df and txt_bytes_to_chunks_df
  to immediately follow the def line so they are recognized as actual
  docstrings (previously appeared after variable assignments)
- Remove conflicts block in pyproject.toml that referenced undefined
  cpu/cu130 extras; torch wheel selection is already handled by
  sys_platform markers in [tool.uv.sources]

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Merge remote extra into base dependencies (document parsing and NIM
  client libs are now core; no extra needed for remote-only installs)
- Move lancedb into base dependencies (it is the default VDB solution)
- Merge local-cpu and local-gpu into a single local extra (GPU assumed)
- Drop lancedb from [stores]; update [all] to reference [local]
- Remove now-unnecessary lancedb import guards in lancedb_store.py
- Update integration-test workflow to install bare nemo_retriever
- Update DEPENDENCY_LAYERS.md to reflect new structure

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
charlesbluca and others added 3 commits April 9, 2026 16:52
- Replace core-only install with [local] extra for local GPU path; add
  separate remote-only install command for NIM inference users
- Reframe Step 2 torch override: clarifies it is a PyPI/CUDA workaround,
  not a completion of the GPU stack; mark it skip-able for remote users
- Add note before pipeline examples that they require [local] + CUDA torch
- Fix [svg] extra reference to [multimedia]

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…workflow

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant