Skip to content

Commit c69f047

Browse files
committed
Merge upstream/develop
Signed-off-by: Kurt Heiss <kheiss@nvidia.com> Made-with: Cursor
2 parents 9826944 + e09ddb7 commit c69f047

125 files changed

Lines changed: 5681 additions & 2789 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/ci-pipeline.yml

Lines changed: 179 additions & 16 deletions
Large diffs are not rendered by default.

README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -99,7 +99,7 @@ This modular design ensures efficient query processing, accurate retrieval of in
9999

100100
- Response Generation (Inference)
101101

102-
- [NVIDIA NIM llama-3.3-nemotron-super-49b-v1.5](https://build.nvidia.com/nvidia/llama-3_3-nemotron-super-49b-v1_5)
102+
- [NVIDIA NIM nemotron-3-super-120b-a12b](https://build.nvidia.com/nvidia/nemotron-3-super-120b-a12b)
103103

104104
- Retriever and Extraction Models
105105

@@ -108,7 +108,7 @@ This modular design ensures efficient query processing, accurate retrieval of in
108108
- [NeMo Retriever Page Elements NIM](https://build.nvidia.com/nvidia/nemotron-page-elements-v3)
109109
- [NeMo Retriever Table Structure NIM](https://build.nvidia.com/nvidia/nemotron-table-structure-v1)
110110
- [NeMo Retriever Graphic Elements NIM](https://build.nvidia.com/nvidia/nemotron-graphic-elements-v1)
111-
- [NeMo Retriever OCR NIM](https://build.nvidia.com/nvidia/nemoretriever-ocr)
111+
- [Nemotron OCR NIM](https://build.nvidia.com/nvidia/nemotron-ocr)
112112

113113
- Optional NIMs
114114

@@ -124,7 +124,7 @@ This modular design ensures efficient query processing, accurate retrieval of in
124124

125125
- **RAG Orchestrator Server** – Coordinates interactions between the user, retrievers, vector database, and inference models, ensuring multi-turn and context-aware query handling. This is [LangChain](https://www.langchain.com/)-based.
126126

127-
- **Vector Database (accelerated with NVIDIA cuVS)** – Stores and searches embeddings at scale with GPU-accelerated indexing and retrieval for low-latency performance. You can use [Milvus Vector Database](https://milvus.io/) or [Elasticsearch](https://www.elastic.co/elasticsearch/vector-database).
127+
- **Vector Database (accelerated with NVIDIA cuVS)** – Stores and searches embeddings at scale with GPU-accelerated indexing and retrieval for low-latency performance. The default is [Elasticsearch](https://www.elastic.co/elasticsearch/vector-database). Another alternative is [Milvus](https://milvus.io/) (GPU-accelerated).
128128

129129
- **NeMo Retriever Extraction** – A high-performance ingestion microservice for parsing multimodal content. For more information about the ingestion pipeline, see [NeMo Retriever Extraction Overview](https://docs.nvidia.com/nemo/retriever/latest/extraction/overview/)
130130

@@ -229,5 +229,5 @@ The following models that are built with Llama are governed by the Llama 3.2 Com
229229

230230
## Additional Information
231231

232-
The [Llama 3.1 Community License Agreement](https://www.llama.com/llama3_1/license/) for the llama-3.1-nemotron-nano-vl-8b-v1, llama-3.1-nemoguard-8b-content-safety and llama-3.1-nemoguard-8b-topic-control models. The [Llama 3.2 Community License Agreement](https://www.llama.com/llama3_2/license/) for the nvidia/llama-nemotron-embed-1b-v2, nvidia/llama-nemotron-rerank-1b-v2 and llama-3.2-nemoretriever-1b-vlm-embed-v1 models. The [Llama 3.3 Community License Agreement](https://github.com/meta-llama/llama-models/blob/main/models/llama3_3/LICENSE) for the llama-3.3-nemotron-super-49b-v1.5 models. Built with Llama. Apache 2.0 for NVIDIA Ingest and for the nemoretriever-page-elements-v2, nemotron-table-structure-v1, nemotron-graphic-elements-v1, paddleocr and nemoretriever-ocr-v1 models.
232+
The [Llama 3.1 Community License Agreement](https://www.llama.com/llama3_1/license/) for the llama-3.1-nemotron-nano-vl-8b-v1, llama-3.1-nemoguard-8b-content-safety and llama-3.1-nemoguard-8b-topic-control models. The [Llama 3.2 Community License Agreement](https://www.llama.com/llama3_2/license/) for the nvidia/llama-nemotron-embed-1b-v2, nvidia/llama-nemotron-rerank-1b-v2 and llama-3.2-nemoretriever-1b-vlm-embed-v1 models. The [Llama 3.3 Community License Agreement](https://github.com/meta-llama/llama-models/blob/main/models/llama3_3/LICENSE) for the llama-3.3-nemotron-super-49b-v1.5 models. Built with Llama. Apache 2.0 for NVIDIA Ingest and for the nemoretriever-page-elements-v2, nemotron-table-structure-v1, nemotron-graphic-elements-v1, paddleocr and nemotron-ocr-v1 models.
233233

deploy/compose/.env

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -24,10 +24,10 @@ export APP_FILTEREXPRESSIONGENERATOR_SERVERURL=nim-llm:8000
2424
export SUMMARY_LLM_SERVERURL=nim-llm:8000
2525
export APP_EMBEDDINGS_SERVERURL=nemotron-embedding-ms:8000/v1
2626
export APP_RANKING_SERVERURL=nemotron-ranking-ms:8000
27-
export OCR_GRPC_ENDPOINT=nemoretriever-ocr:8001
28-
export OCR_HTTP_ENDPOINT=http://nemoretriever-ocr:8000/v1/infer
27+
export OCR_GRPC_ENDPOINT=nemotron-ocr:8001
28+
export OCR_HTTP_ENDPOINT=http://nemotron-ocr:8000/v1/infer
2929
export OCR_INFER_PROTOCOL=grpc
30-
export OCR_MODEL_NAME=scene_text_ensemble
30+
export OCR_MODEL_NAME=pipeline
3131
export YOLOX_GRPC_ENDPOINT=page-elements:8001
3232
export YOLOX_INFER_PROTOCOL=grpc
3333
export YOLOX_GRAPHIC_ELEMENTS_GRPC_ENDPOINT=graphic-elements:8001
@@ -41,23 +41,23 @@ export YOLOX_TABLE_STRUCTURE_INFER_PROTOCOL=grpc
4141

4242
# export APP_EMBEDDINGS_SERVERURL=https://integrate.api.nvidia.com/v1
4343
# export APP_LLM_SERVERURL=""
44-
# export APP_LLM_MODELNAME=nvidia/llama-3.3-nemotron-super-49b-v1.5
45-
# export APP_FILTEREXPRESSIONGENERATOR_MODELNAME=nvidia/llama-3.3-nemotron-super-49b-v1.5
44+
# export APP_LLM_MODELNAME=nvidia/nemotron-3-super-120b-a12b
45+
# export APP_FILTEREXPRESSIONGENERATOR_MODELNAME=nvidia/nemotron-3-super-120b-a12b
4646
# export APP_FILTEREXPRESSIONGENERATOR_SERVERURL=""
47-
# export SUMMARY_LLM="nvidia/llama-3.3-nemotron-super-49b-v1.5"
47+
# export SUMMARY_LLM="nvidia/nemotron-3-super-120b-a12b"
4848
# export APP_RANKING_SERVERURL=""
4949
# export SUMMARY_LLM_SERVERURL=""
50-
# export OCR_HTTP_ENDPOINT=https://ai.api.nvidia.com/v1/cv/nvidia/nemoretriever-ocr
50+
# export OCR_HTTP_ENDPOINT=https://ai.api.nvidia.com/v1/cv/nvidia/nemotron-ocr-v1
5151
# export OCR_INFER_PROTOCOL=http
52-
# export OCR_MODEL_NAME=scene_text_ensemble
52+
# export OCR_MODEL_NAME=pipeline
5353
# export YOLOX_HTTP_ENDPOINT=https://ai.api.nvidia.com/v1/cv/nvidia/nemotron-page-elements-v3
5454
# export YOLOX_INFER_PROTOCOL=http
5555
# export YOLOX_GRAPHIC_ELEMENTS_HTTP_ENDPOINT=https://ai.api.nvidia.com/v1/cv/nvidia/nemotron-graphic-elements-v1
5656
# export YOLOX_GRAPHIC_ELEMENTS_INFER_PROTOCOL=http
5757
# export YOLOX_TABLE_STRUCTURE_HTTP_ENDPOINT=https://ai.api.nvidia.com/v1/cv/nvidia/nemotron-table-structure-v1
5858
# export YOLOX_TABLE_STRUCTURE_INFER_PROTOCOL=http
5959
# export APP_QUERYREWRITER_SERVERURL=""
60-
# export APP_QUERYREWRITER_MODELNAME="nvidia/llama-3.3-nemotron-super-49b-v1.5"
60+
# export APP_QUERYREWRITER_MODELNAME="nvidia/nemotron-3-super-120b-a12b"
6161

6262

6363
# ==========================

deploy/compose/docker-compose-ingestor-server.yaml

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -30,10 +30,10 @@ services:
3030
##===Vector DB specific configurations===
3131
# URL on which vectorstore is hosted
3232
# For custom operators, point to your service (e.g., http://your-custom-vdb:1234)
33-
APP_VECTORSTORE_URL: ${APP_VECTORSTORE_URL:-http://milvus:19530}
33+
APP_VECTORSTORE_URL: ${APP_VECTORSTORE_URL:-http://elasticsearch:9200}
3434
# Type of vectordb used to store embedding. Supported built-ins: "milvus", "elasticsearch".
3535
# You can also provide your custom value (e.g., "your_custom_vdb") when you register it in `_get_vdb_op`.
36-
APP_VECTORSTORE_NAME: ${APP_VECTORSTORE_NAME:-"milvus"}
36+
APP_VECTORSTORE_NAME: ${APP_VECTORSTORE_NAME:-"elasticsearch"}
3737

3838
# Type of vectordb search to be used
3939
APP_VECTORSTORE_SEARCHTYPE: ${APP_VECTORSTORE_SEARCHTYPE:-"dense"} # Can be dense or hybrid
@@ -44,10 +44,10 @@ services:
4444
# Weight for sparse vector search in case of "weighted" Hybrid Search
4545
APP_VECTORSTORE_SPARSE_WEIGHT: ${APP_VECTORSTORE_SPARSE_WEIGHT:-0.5}
4646

47-
# Boolean to enable GPU index for milvus vectorstore specific to nvingest
48-
APP_VECTORSTORE_ENABLEGPUINDEX: ${APP_VECTORSTORE_ENABLEGPUINDEX:-True}
49-
# Boolean to control GPU search for milvus vectorstore specific to nvingest
50-
APP_VECTORSTORE_ENABLEGPUSEARCH: ${APP_VECTORSTORE_ENABLEGPUSEARCH:-True}
47+
# Milvus only (ignored for Elasticsearch). Set True when using Milvus + GPU.
48+
APP_VECTORSTORE_ENABLEGPUINDEX: ${APP_VECTORSTORE_ENABLEGPUINDEX:-False}
49+
# Milvus only (ignored for Elasticsearch). Set True when using Milvus + GPU.
50+
APP_VECTORSTORE_ENABLEGPUSEARCH: ${APP_VECTORSTORE_ENABLEGPUSEARCH:-False}
5151
# Username for vector store
5252
APP_VECTORSTORE_USERNAME: ${APP_VECTORSTORE_USERNAME:-""}
5353
APP_VECTORSTORE_PASSWORD: ${APP_VECTORSTORE_PASSWORD:-""}
@@ -124,7 +124,7 @@ services:
124124
ENABLE_CITATIONS: ${ENABLE_CITATIONS:-True}
125125

126126
# Choose the summary model to use for document summary
127-
SUMMARY_LLM: ${SUMMARY_LLM:-nvidia/llama-3.3-nemotron-super-49b-v1.5}
127+
SUMMARY_LLM: ${SUMMARY_LLM:-nvidia/nemotron-3-super-120b-a12b}
128128
SUMMARY_LLM_SERVERURL: ${SUMMARY_LLM_SERVERURL-${APP_LLM_SERVERURL-"nim-llm:8000"}}
129129
SUMMARY_LLM_MAX_CHUNK_LENGTH: ${SUMMARY_LLM_MAX_CHUNK_LENGTH:-9000}
130130
SUMMARY_CHUNK_OVERLAP: ${SUMMARY_CHUNK_OVERLAP:-400}
@@ -140,15 +140,15 @@ services:
140140
REDIS_DB: ${REDIS_DB:-0}
141141
ENABLE_REDIS_BACKEND: ${ENABLE_REDIS_BACKEND:-False}
142142

143-
# Bulk upload to MinIO
144-
ENABLE_MINIO_BULK_UPLOAD: ${ENABLE_MINIO_BULK_UPLOAD:-True}
145143
TEMP_DIR: ${TEMP_DIR:-/tmp-data}
146144
INGESTOR_SERVER_DATA_DIR: ${INGESTOR_SERVER_DATA_DIR:-/data/}
147145

148146
# NV-Ingest Batch Mode Configurations
149147
NV_INGEST_FILES_PER_BATCH: ${NV_INGEST_FILES_PER_BATCH:-16}
150148
NV_INGEST_CONCURRENT_BATCHES: ${NV_INGEST_CONCURRENT_BATCHES:-4}
151149
ENABLE_NV_INGEST_DYNAMIC_BATCHING: ${ENABLE_NV_INGEST_DYNAMIC_BATCHING:-True}
150+
# Max memory budget (MB) for a single ingestion job; used for dynamic batch sizing
151+
INGESTION_MAX_MEMORY_BUDGET_MB: ${INGESTION_MAX_MEMORY_BUDGET_MB:-1024}
152152

153153
# Tracing
154154
APP_TRACING_ENABLED: ${APP_TRACING_ENABLED:-"False"}
@@ -169,7 +169,7 @@ services:
169169
- "6379:6379"
170170

171171
nv-ingest-ms-runtime:
172-
image: nvcr.io/nvidia/nemo-microservices/nv-ingest:26.1.2
172+
image: nvcr.io/nvidia/nemo-microservices/nv-ingest:26.3.0
173173
# cpuset: "0-15" # Uncomment to restrict this container to CPU cores 0–15
174174
shm_size: 40gb # Should be at minimum 30% of assigned memory per Ray documentation
175175
volumes:
@@ -220,12 +220,12 @@ services:
220220
- NV_INGEST_MAX_UTIL=${NV_INGEST_MAX_UTIL:-48}
221221
- OTEL_EXPORTER_OTLP_ENDPOINT=otel-collector:4317
222222
# Self-hosted ocr endpoints.
223-
- OCR_GRPC_ENDPOINT=${OCR_GRPC_ENDPOINT:-nemoretriever-ocr:8001}
224-
- OCR_HTTP_ENDPOINT=${OCR_HTTP_ENDPOINT:-http://nemoretriever-ocr:8000/v1/infer}
223+
- OCR_GRPC_ENDPOINT=${OCR_GRPC_ENDPOINT:-nemotron-ocr:8001}
224+
- OCR_HTTP_ENDPOINT=${OCR_HTTP_ENDPOINT:-http://nemotron-ocr:8000/v1/infer}
225225
- OCR_INFER_PROTOCOL=${OCR_INFER_PROTOCOL:-grpc}
226-
- OCR_MODEL_NAME=${OCR_MODEL_NAME:-scene_text_ensemble}
226+
- OCR_MODEL_NAME=${OCR_MODEL_NAME:-pipeline}
227227
# build.nvidia.com hosted ocr endpoints.
228-
#- OCR_HTTP_ENDPOINT=https://ai.api.nvidia.com/v1/cv/nvidia/nemoretriever-ocr
228+
#- OCR_HTTP_ENDPOINT=https://ai.api.nvidia.com/v1/cv/nvidia/nemotron-ocr-v1
229229
#- OCR_INFER_PROTOCOL=http
230230
- PDF_SPLIT_PAGE_COUNT=${PDF_SPLIT_PAGE_COUNT:-32}
231231
- REDIS_INGEST_TASK_QUEUE=ingest_task_queue

deploy/compose/docker-compose-rag-server.yaml

Lines changed: 19 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -31,10 +31,10 @@ services:
3131
##===Vector DB specific configurations===
3232
# URL on which vectorstore is hosted
3333
# For custom operators, point to your service (e.g., http://your-custom-vdb:1234)
34-
APP_VECTORSTORE_URL: ${APP_VECTORSTORE_URL:-http://milvus:19530}
34+
APP_VECTORSTORE_URL: ${APP_VECTORSTORE_URL:-http://elasticsearch:9200}
3535
# Type of vectordb used to store embedding. Supported built-ins: "milvus", "elasticsearch".
3636
# You can also provide your custom value (e.g., "your_custom_vdb") when you register it in `_get_vdb_op`.
37-
APP_VECTORSTORE_NAME: ${APP_VECTORSTORE_NAME:-"milvus"}
37+
APP_VECTORSTORE_NAME: ${APP_VECTORSTORE_NAME:-"elasticsearch"}
3838
# Type of index to be used for vectorstore
3939
APP_VECTORSTORE_INDEXTYPE: ${APP_VECTORSTORE_INDEXTYPE:-"GPU_CAGRA"}
4040

@@ -47,8 +47,8 @@ services:
4747
# Weight for sparse vector search in case of "weighted" Hybrid Search
4848
APP_VECTORSTORE_SPARSE_WEIGHT: ${APP_VECTORSTORE_SPARSE_WEIGHT:-0.5}
4949

50-
# Boolean to control GPU search for milvus vectorstore specific to rag-server
51-
APP_VECTORSTORE_ENABLEGPUSEARCH: ${APP_VECTORSTORE_ENABLEGPUSEARCH:-True}
50+
# Milvus only (ignored for Elasticsearch). Set True when using Milvus + GPU.
51+
APP_VECTORSTORE_ENABLEGPUSEARCH: ${APP_VECTORSTORE_ENABLEGPUSEARCH:-False}
5252
# ef: Parameter controlling query time/accuracy trade-off. Higher ef leads to more accurate but slower search.
5353
APP_VECTORSTORE_EF: ${APP_VECTORSTORE_EF:-100} # Must be greater or equal to VECTOR_DB_TOPK
5454
# Username for vector store
@@ -66,27 +66,33 @@ services:
6666
# Top K from vector DB, which goes as input to reranker model if enabled, else goes to LLM prompt
6767
VECTOR_DB_TOPK: ${VECTOR_DB_TOPK:-100}
6868

69+
# Fetch full page context: when True, fetches ALL chunks for retrieved pages and organizes by page
70+
# Useful for PDFs where we have page numbers in file
71+
APP_FETCH_FULL_PAGE_CONTEXT: ${APP_FETCH_FULL_PAGE_CONTEXT:-false}
72+
# N pages before/after each retrieved page (0=disabled, 1=+/-1 page). Requires APP_FETCH_FULL_PAGE_CONTEXT=true
73+
APP_FETCH_NEIGHBORING_PAGES: ${APP_FETCH_NEIGHBORING_PAGES:-0}
74+
6975
##===LLM Model specific configurations===
70-
APP_LLM_MODELNAME: ${APP_LLM_MODELNAME:-"nvidia/llama-3.3-nemotron-super-49b-v1.5"}
76+
APP_LLM_MODELNAME: ${APP_LLM_MODELNAME:-"nvidia/nemotron-3-super-120b-a12b"}
7177
# url on which llm model is hosted. If "", Nvidia hosted API is used
7278
APP_LLM_SERVERURL: ${APP_LLM_SERVERURL-"nim-llm:8000"}
7379
# LLM model parameters
74-
LLM_MAX_TOKENS: ${LLM_MAX_TOKENS:-32768}
80+
LLM_MAX_TOKENS: ${LLM_MAX_TOKENS:-131072}
7581
LLM_TEMPERATURE: ${LLM_TEMPERATURE:-0}
7682
LLM_TOP_P: ${LLM_TOP_P:-1.0}
7783

78-
# Reasoning configuration (supported by Nemotron 3 and other reasoning models)
79-
LLM_ENABLE_THINKING: ${LLM_ENABLE_THINKING:-false}
80-
LLM_REASONING_BUDGET: ${LLM_REASONING_BUDGET:-0}
81-
LLM_LOW_EFFORT: ${LLM_LOW_EFFORT:-false}
84+
# Reasoning configuration (enabled by default for Nemotron 3 Super)
85+
LLM_ENABLE_THINKING: ${LLM_ENABLE_THINKING:-true}
86+
LLM_REASONING_BUDGET: ${LLM_REASONING_BUDGET:-256}
87+
LLM_LOW_EFFORT: ${LLM_LOW_EFFORT:-true}
8288

8389
##===Query Rewriter Model specific configurations===
84-
APP_QUERYREWRITER_MODELNAME: ${APP_QUERYREWRITER_MODELNAME:-"nvidia/llama-3.3-nemotron-super-49b-v1.5"}
90+
APP_QUERYREWRITER_MODELNAME: ${APP_QUERYREWRITER_MODELNAME:-"nvidia/nemotron-3-super-120b-a12b"}
8591
# url on which query rewriter model is hosted. If "", Nvidia hosted API is used
8692
APP_QUERYREWRITER_SERVERURL: ${APP_QUERYREWRITER_SERVERURL-"nim-llm:8000"}
8793

8894
##===Filter Expression Generator Model specific configurations===
89-
APP_FILTEREXPRESSIONGENERATOR_MODELNAME: ${APP_FILTEREXPRESSIONGENERATOR_MODELNAME:-"nvidia/llama-3.3-nemotron-super-49b-v1.5"}
95+
APP_FILTEREXPRESSIONGENERATOR_MODELNAME: ${APP_FILTEREXPRESSIONGENERATOR_MODELNAME:-"nvidia/nemotron-3-super-120b-a12b"}
9096
# url on which filter expression generator model is hosted. If "", Nvidia hosted API is used
9197
APP_FILTEREXPRESSIONGENERATOR_SERVERURL: ${APP_FILTEREXPRESSIONGENERATOR_SERVERURL-"nim-llm:8000"}
9298
# enable filter expression generator for natural language to filter expression conversion
@@ -189,7 +195,7 @@ services:
189195
# Minimum groundedness score threshold (0-2)
190196
RESPONSE_GROUNDEDNESS_THRESHOLD: ${RESPONSE_GROUNDEDNESS_THRESHOLD:-1}
191197
# reflection llm
192-
REFLECTION_LLM: ${REFLECTION_LLM:-"nvidia/llama-3.3-nemotron-super-49b-v1.5"}
198+
REFLECTION_LLM: ${REFLECTION_LLM:-"nvidia/nemotron-3-super-120b-a12b"}
193199
# reflection llm server url. If "", Nvidia hosted API is used
194200
REFLECTION_LLM_SERVERURL: ${REFLECTION_LLM_SERVERURL-"nim-llm:8000"}
195201
# enable iterative query decomposition
@@ -220,7 +226,6 @@ services:
220226
# Environment variables for Vite build
221227
VITE_API_CHAT_URL: ${VITE_API_CHAT_URL:-http://rag-server:8081/v1}
222228
VITE_API_VDB_URL: ${VITE_API_VDB_URL:-http://ingestor-server:8082/v1}
223-
VITE_MILVUS_URL: http://milvus:19530
224229
DOWNLOAD_LEGAL_COMPLIANCE: ${DOWNLOAD_LEGAL_COMPLIANCE:-false}
225230
ports:
226231
- "8090:3000"
@@ -230,7 +235,6 @@ services:
230235
# Runtime environment variables for Vite
231236
VITE_API_CHAT_URL: ${VITE_API_CHAT_URL:-http://rag-server:8081/v1}
232237
VITE_API_VDB_URL: ${VITE_API_VDB_URL:-http://ingestor-server:8082/v1}
233-
VITE_MILVUS_URL: http://milvus:19530
234238
depends_on:
235239
- rag-server
236240

0 commit comments

Comments
 (0)