Skip to content
Merged
Show file tree
Hide file tree
Changes from 18 commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
0ef1eae
Add workflow_dispatch integration test for library mode on Windows an…
charlesbluca Apr 8, 2026
6a21df3
Split torch cu130 deps into explicit group
charlesbluca Apr 8, 2026
bbe51d3
ci: increase RAY_raylet_start_wait_time_s for macOS integration tests
charlesbluca Apr 8, 2026
027c154
Use inprocess run mode for now
charlesbluca Apr 8, 2026
79e8f0b
Pass API key via --api-key flag using NGC_NV_DEVELOPER_NVCF secret
charlesbluca Apr 8, 2026
1f31946
Initial plan & refactors for slimmer instal
charlesbluca Apr 8, 2026
8c45923
Drop nv-ingest as dep
charlesbluca Apr 8, 2026
531015e
Make heavy optional deps lazy for slim Intel Mac install
charlesbluca Apr 9, 2026
6190f04
Fix Intel Mac slim-install blockers in PDF/image/embed pipeline
charlesbluca Apr 9, 2026
264ab24
Merge remote-tracking branch 'upstream/main' into slim-install
charlesbluca Apr 9, 2026
fd53df6
Use CUDA torch index for Windows as well as Linux
charlesbluca Apr 9, 2026
5bef6f7
Merge branch 'slim-install'
charlesbluca Apr 9, 2026
26475c8
Add macOS x64 to workflow
charlesbluca Apr 9, 2026
4b120f0
torch cuda index rename
charlesbluca Apr 9, 2026
9f2035b
Merge branch 'slim-install'
charlesbluca Apr 9, 2026
45f2732
Try switching to macos-26-intel
charlesbluca Apr 9, 2026
9e01344
Modify unit test install
charlesbluca Apr 9, 2026
326de9a
Linting
charlesbluca Apr 9, 2026
0722e6f
Guard optional imports and restore graceful embedding failure handling
charlesbluca Apr 9, 2026
4591f62
Fix test failures from lazy import change and network-dependent token…
charlesbluca Apr 9, 2026
4273b5a
Fix misplaced docstrings and remove invalid uv conflicts block
charlesbluca Apr 9, 2026
25622df
Simplify dependency groups; move remote and lancedb to core
charlesbluca Apr 9, 2026
48a8953
Drop agent doc
charlesbluca Apr 9, 2026
03dc39f
Fix README install instructions to reflect simplified dependency groups
charlesbluca Apr 9, 2026
ec99f30
ci: add nightly schedule trigger and fix secret name in library mode …
charlesbluca Apr 9, 2026
92f777a
Compat code for ray[data] 2.49
charlesbluca Apr 9, 2026
eabdf74
Merge upstream/main into slim-install
charlesbluca Apr 13, 2026
1085a2c
Merge remote-tracking branch 'upstream/main' into slim-install
charlesbluca Apr 13, 2026
9e8dbeb
fix(embed): avoid doubling /embeddings on HTTP embedding URLs
charlesbluca Apr 13, 2026
dec707d
test(embed): align BatchEmbedCPUActor test with local HF default
charlesbluca Apr 13, 2026
fec7dd8
Merge branch 'main' into slim-install
charlesbluca Apr 14, 2026
0988bd7
fix(embed): remote-only CPU embed actor; drop inprocess debug logs
charlesbluca Apr 15, 2026
e51cfd9
Merge branch 'main' into slim-install
charlesbluca Apr 15, 2026
e35a7de
Merge branch 'main' into slim-install
jdye64 Apr 15, 2026
cf71664
Merge branch 'main' into slim-install
charlesbluca Apr 15, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
77 changes: 77 additions & 0 deletions .github/workflows/integration-test-library-mode.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# SPDX-FileCopyrightText: Copyright (c) 2024-25, NVIDIA CORPORATION & AFFILIATES.
# All rights reserved.
# SPDX-License-Identifier: Apache-2.0

name: Library Mode Integration Tests (Windows & macOS)

on:
workflow_dispatch:
inputs:
source-ref:
description: 'Git ref to test (branch, tag, or SHA). Defaults to the dispatched branch.'
required: false
type: string
default: ''

jobs:
integration-test:
name: Integration Tests (${{ matrix.os-label }})
runs-on: ${{ matrix.runner }}
timeout-minutes: 90

strategy:
fail-fast: false
matrix:
include:
- runner: windows-latest
os-label: windows-x64
- runner: macos-26
os-label: macos-arm64
- runner: macos-26-intel
os-label: macos-x64

env:
# NIM endpoint URLs — edit these directly to point at different deployments
PAGE_ELEMENTS_INVOKE_URL: "https://ai.api.nvidia.com/v1/cv/nvidia/nemotron-page-elements-v3"
OCR_INVOKE_URL: "https://ai.api.nvidia.com/v1/cv/nvidia/nemoretriever-ocr-v1"
GRAPHIC_ELEMENTS_INVOKE_URL: "https://ai.api.nvidia.com/v1/cv/nvidia/nemotron-graphic-elements-v1"
TABLE_STRUCTURE_INVOKE_URL: "https://ai.api.nvidia.com/v1/cv/nvidia/nemotron-table-structure-v1"
EMBED_INVOKE_URL: "https://integrate.api.nvidia.com/v1"
EMBED_MODEL_NAME: "nvidia/llama-nemotron-embed-1b-v2"

steps:
- name: Check out repository code
uses: actions/checkout@v4
with:
ref: ${{ inputs.source-ref != '' && inputs.source-ref || github.ref }}

- name: Set up Python 3.12
uses: actions/setup-python@v5
with:
Comment on lines +47 to +53
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 security Missing permissions: block on new workflow

Both integration-test-library-mode.yml (new) and retriever-unit-tests.yml (modified) are missing an explicit permissions: block. Without one, the GITHUB_TOKEN inherits repository-default permissions, which can be write-scoped depending on org settings. Per the github-actions-security rule, every workflow must declare least-privilege scope. These workflows only need contents: read.

Add at the workflow (or job) level:

permissions:
  contents: read

The same fix applies to retriever-unit-tests.yml.

Rule Used: GitHub Actions workflows must: pin third-party act... (source)

Prompt To Fix With AI
This is a comment left during a code review.
Path: .github/workflows/integration-test-library-mode.yml
Line: 47-53

Comment:
**Missing `permissions:` block on new workflow**

Both `integration-test-library-mode.yml` (new) and `retriever-unit-tests.yml` (modified) are missing an explicit `permissions:` block. Without one, the `GITHUB_TOKEN` inherits repository-default permissions, which can be write-scoped depending on org settings. Per the `github-actions-security` rule, every workflow must declare least-privilege scope. These workflows only need `contents: read`.

Add at the workflow (or job) level:
```yaml
permissions:
  contents: read
```
The same fix applies to `retriever-unit-tests.yml`.

**Rule Used:** GitHub Actions workflows must: pin third-party act... ([source](https://app.greptile.com/review/custom-context?memory=github-actions-security))

How can I resolve this? If you propose a fix, please make it concise.

python-version: '3.12'

- name: Install uv
run: pip install uv
Comment on lines +50 to +57
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 GitHub Actions not pinned to commit SHA

Both actions/checkout@v4 and actions/setup-python@v5 use mutable version tags. Per the repository's github-actions-security rule, third-party actions must be pinned to a full commit SHA to prevent supply-chain attacks.

Suggested change
- name: Set up Python 3.12
uses: actions/setup-python@v5
with:
python-version: '3.12'
- name: Install uv
run: pip install uv
- name: Check out repository code
uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
with:
ref: ${{ inputs.source-ref != '' && inputs.source-ref || github.ref }}
- name: Set up Python 3.12
uses: actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065 # v5.6.0
with:
python-version: '3.12'
Prompt To Fix With AI
This is a comment left during a code review.
Path: .github/workflows/integration-test-library-mode.yml
Line: 47-54

Comment:
**GitHub Actions not pinned to commit SHA**

Both `actions/checkout@v4` and `actions/setup-python@v5` use mutable version tags. Per the repository's `github-actions-security` rule, third-party actions must be pinned to a full commit SHA to prevent supply-chain attacks.

```suggestion
      - name: Check out repository code
        uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683  # v4.2.2
        with:
          ref: ${{ inputs.source-ref != '' && inputs.source-ref || github.ref }}

      - name: Set up Python 3.12
        uses: actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065  # v5.6.0
        with:
          python-version: '3.12'
```

How can I resolve this? If you propose a fix, please make it concise.


- name: Install nemo-retriever and dependencies
Comment thread
charlesbluca marked this conversation as resolved.
shell: bash
run: |
uv pip install --system -e api/ -e client/ -e "nemo_retriever[remote]"

- name: Run graph pipeline on PDFs
shell: bash
env:
PYTHONPATH: nemo_retriever/src
run: |
python -m nemo_retriever.examples.graph_pipeline ./data \
--run-mode inprocess \
--input-type pdf \
--api-key "${{ secrets.NGC_NV_DEVELOPER_NVCF }}" \
--page-elements-invoke-url "$PAGE_ELEMENTS_INVOKE_URL" \
--ocr-invoke-url "$OCR_INVOKE_URL" \
--use-graphic-elements \
--graphic-elements-invoke-url "$GRAPHIC_ELEMENTS_INVOKE_URL" \
--use-table-structure \
--table-structure-invoke-url "$TABLE_STRUCTURE_INVOKE_URL" \
--embed-invoke-url "$EMBED_INVOKE_URL" \
--embed-model-name "$EMBED_MODEL_NAME"
3 changes: 1 addition & 2 deletions .github/workflows/retriever-unit-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,7 @@ jobs:

- name: Install unit test dependencies
run: |
uv pip install --system -e src/ -e api/ -e client/
uv pip install --system -e nemo_retriever
uv pip install --system -e nemo_retriever[all,dev]

- name: Run retriever unit tests
env:
Expand Down
121 changes: 121 additions & 0 deletions DEPENDENCY_LAYERS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
# Dependency Layering Plan

This document describes the restructured optional-extras model for `nemo_retriever/pyproject.toml`.

## Problem

The previous `pyproject.toml` listed ~50 packages as required dependencies, meaning every install
pulled in torch, vLLM, CUDA wheels, nemotron models, GPU monitoring tooling, etc. — regardless of
whether the user intended to run local models or simply call remote NIM endpoints. This made the
package impossible to install on Intel Macs and unnecessarily heavy everywhere.

## Solution: Layered Optional Extras

Dependencies are now split into a slim base plus composable optional extras. Each tier builds on
the previous via self-referencing extras.

### Tier hierarchy

```
nemo_retriever ← slim base: ray, fastapi, pydantic, HTTP clients, nv-ingest*
└── [remote] ← adds: pypdfium2, pillow, nltk, markitdown, langchain-nvidia-ai-endpoints
└── [local-cpu] ← adds: torch CPU, transformers, nemotron models (ARM Mac compatible)
└── [local-gpu] ← adds: nvidia-ml-py, vLLM (Linux/CUDA only)
└── [multimedia] ← adds: soundfile + scipy (ASR), cairosvg (SVG)
(can also be combined with any tier independently)

[stores] ← lancedb, duckdb, duckdb-engine, neo4j (independent, add to any tier)
[benchmarks] ← datasets, open-clip-torch (BEIR evaluation only)
[dev] ← build, pytest
[all] ← local-gpu + multimedia + stores + benchmarks
```

### Install commands by use case

| Use case | Platform | Command |
|---|---|---|
| All remote (NIM) inference | Intel Mac, any | `uv pip install "nemo_retriever[remote,stores]"` |
| Local PDF ingestion, CPU | ARM Mac | `uv pip install "nemo_retriever[local-cpu,stores]"` |
| Local PDF ingestion, GPU | Linux + CUDA | `uv pip install "nemo_retriever[local-gpu,stores]"` |
| Full multimedia (GPU + audio + SVG) | Linux + CUDA | `uv pip install "nemo_retriever[local-gpu,multimedia,stores]"` |
| Everything | Linux + CUDA | `uv pip install "nemo_retriever[all]"` |

## What Each Extra Contains

### Base (always installed)
Pure framework infrastructure — no ML, no storage.

- `ray[data,serve]` — pipeline orchestration
- `pandas`, `numpy`, `tqdm` — data handling
- `fastapi`, `uvicorn`, `python-multipart` — service API
- `httpx`, `requests`, `urllib3` — HTTP clients
- `pydantic`, `typer`, `pyyaml`, `rich` — config, CLI, output
- `universal-pathlib`, `debugpy` — utilities
- `nv-ingest`, `nv-ingest-api`, `nv-ingest-client` — core ingest packages

### `[remote]`
Everything needed to run the full pipeline via remote NIM endpoints. No GPU, no local models.
Installs cleanly on Intel Macs.

- `pypdfium2` — PDF page splitting and rendering
- `pillow` — image I/O
- `nltk` — text splitting utilities
- `markitdown` — HTML/document-to-markdown conversion
- `langchain-nvidia-ai-endpoints` — LLM/SQL via NVIDIA NIM

### `[local-cpu]`
Adds local HuggingFace model inference. On Linux, torch resolves to a CUDA wheel from the
PyTorch index; on Mac it falls through to the PyPI CPU wheel.

- `transformers`, `tokenizers`, `accelerate==1.12.0` — HuggingFace model loading
- `torch~=2.9.1`, `torchvision` — PyTorch (CPU on Mac, CUDA on Linux)
- `einops`, `easydict`, `addict`, `timm`, `albumentations`, `scikit-learn` — model utilities
- `nemotron-page-elements-v3`, `nemotron-graphic-elements-v1`, `nemotron-table-structure-v1` — layout/table/chart detection
- `nemotron-ocr` — end-to-end OCR (Linux only)

### `[local-gpu]`
Adds GPU monitoring and fast LLM inference on top of `[local-cpu]`.

- `nvidia-ml-py` — GPU memory and utilization monitoring
- `vllm==0.16.0` — fast GPU-accelerated LLM inference (Linux only)

### `[multimedia]`
Specialized media format support. Can be combined with any inference tier.

- `soundfile`, `scipy` — audio file I/O and resampling for local Parakeet ASR
- `cairosvg` — SVG-to-image rendering (requires `libcairo` system library)

### `[stores]`
Vector, SQL, and graph storage backends. Independent of inference tier.

- `lancedb` — vector database for embedding storage and hybrid search
- `duckdb`, `duckdb-engine` — SQL execution on structured/tabular data
- `neo4j` — graph database for knowledge graph ingestion

### `[benchmarks]`
BEIR evaluation tools. Not needed for production use.

- `datasets` — HuggingFace datasets (used in `recall/beir.py`)
- `open-clip-torch` — OpenAI CLIP implementation

## Torch Index Configuration

`[tool.uv.sources]` uses a platform marker so the right torch wheel is resolved automatically:

```toml
torch = [
{ index = "pytorch-cu130", marker = "sys_platform == 'linux'" },
# Mac: falls through to PyPI CPU wheel
]
```

No manual intervention needed — `uv` picks the right wheel per platform.

## Cleanups Applied

The following bugs in the original flat deps list were fixed:

- `accelerate` was listed twice (`>=1.1.0` and `==1.12.0`) — kept `==1.12.0` only
- `tqdm` was listed twice — deduplicated
- `typer` was listed twice — deduplicated
- `[svg]` extra merged into `[multimedia]` (cairosvg is a media format conversion tool)
Original file line number Diff line number Diff line change
Expand Up @@ -25,11 +25,19 @@

import pandas as pd
import pypdfium2 as pdfium
from unstructured_client import UnstructuredClient
from unstructured_client.models import operations
from unstructured_client.models import shared
from unstructured_client.utils import BackoffStrategy
from unstructured_client.utils import RetryConfig

try:
from unstructured_client import UnstructuredClient
from unstructured_client.models import operations
from unstructured_client.models import shared
from unstructured_client.utils import BackoffStrategy
from unstructured_client.utils import RetryConfig
except ImportError:
UnstructuredClient = None
operations = None
shared = None
BackoffStrategy = None
RetryConfig = None

from nv_ingest_api.internal.enums.common import AccessLevelEnum, DocumentTypeEnum
from nv_ingest_api.internal.enums.common import ContentTypeEnum
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,15 +6,19 @@
from typing import Optional

import backoff
import cv2
import numpy as np
import requests

from nv_ingest_api.internal.primitives.nim.model_interface.decorators import multiprocessing_cache
from nv_ingest_api.util.image_processing.transforms import pad_image, normalize_image
from nv_ingest_api.util.string_processing import generate_url, remove_url_endpoints

cv2.setNumThreads(1)
try:
import cv2

cv2.setNumThreads(1)
except ImportError:
cv2 = None
logger = logging.getLogger(__name__)


Expand Down
6 changes: 4 additions & 2 deletions api/src/nv_ingest_api/util/detectors/language.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,6 @@
# SPDX-License-Identifier: Apache-2.0


import langdetect

from nv_ingest_api.internal.enums.common import LanguageEnum
from nv_ingest_api.util.exception_handlers.detectors import langdetect_exception_handler

Expand All @@ -24,6 +22,10 @@ def detect_language(text):
LanguageEnum
A value from `LanguageEnum` detected language code.
"""
try:
import langdetect
except ImportError:
return LanguageEnum.UNKNOWN

try:
language = langdetect.detect(text)
Expand Down
14 changes: 9 additions & 5 deletions api/src/nv_ingest_api/util/exception_handlers/detectors.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,10 @@
from typing import Callable
from typing import Dict

from langdetect.lang_detect_exception import LangDetectException
try:
from langdetect.lang_detect_exception import LangDetectException as _LangDetectException
except ImportError:
_LangDetectException = None

from nv_ingest_api.internal.enums.common import LanguageEnum

Expand Down Expand Up @@ -66,9 +69,10 @@ def langdetect_exception_handler(func: Callable, **kwargs: Dict[str, Any]) -> Ca
def inner_function(*args, **kwargs):
try:
return func(*args, **kwargs)
except LangDetectException as e:
log_error_message = f"LangDetectException: {e}"
logger.warning(log_error_message)
return LanguageEnum.UNKNOWN
except Exception as e:
if _LangDetectException is not None and isinstance(e, _LangDetectException):
logger.warning(f"LangDetectException: {e}")
return LanguageEnum.UNKNOWN
raise

return inner_function
15 changes: 10 additions & 5 deletions api/src/nv_ingest_api/util/image_processing/table_and_chart.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@

import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN


logger = logging.getLogger(__name__)
Expand Down Expand Up @@ -173,10 +172,14 @@ def convert_ocr_response_to_psuedo_markdown(bboxes, texts):
)
preds_df = preds_df.sort_values("y0")

dbscan = DBSCAN(eps=10, min_samples=1)
dbscan.fit(preds_df["y0"].values[:, None])
try:
from sklearn.cluster import DBSCAN

preds_df["cluster"] = dbscan.labels_
dbscan = DBSCAN(eps=10, min_samples=1)
dbscan.fit(preds_df["y0"].values[:, None])
preds_df["cluster"] = dbscan.labels_
except ImportError:
preds_df["cluster"] = (preds_df["y0"] / 10).round().astype(int)
preds_df = preds_df.sort_values(["cluster", "x0"])

results = ""
Expand Down Expand Up @@ -483,12 +486,14 @@ def reorder_boxes(boxes, texts, confs, mode="top_left", dbscan_eps=10):
if dbscan_eps:
do_naive_sorting = False
try:
from sklearn.cluster import DBSCAN

dbscan = DBSCAN(eps=dbscan_eps, min_samples=1)
dbscan.fit(df["y"].values[:, None])
df["cluster"] = dbscan.labels_
df["cluster_centers"] = df.groupby("cluster")["y"].transform("mean").astype(int)
df = df.sort_values(["cluster_centers", "x"], ascending=[True, True], ignore_index=True)
except ValueError:
except (ImportError, ValueError):
do_naive_sorting = True
else:
do_naive_sorting = True
Expand Down
Loading
Loading