Skip to content

Conversation

@robodev-r2d2
Copy link
Contributor

@robodev-r2d2 robodev-r2d2 commented Oct 18, 2025

Adds an optional semantic chunker to the admin pipeline and centralizes embedding implementations in rag-core-lib (with re-exports in rag-core-api). Updates Helm to support chunker selection and preloads NLTK data; bumps deps.

Key points:

  • New SemanticTextChunker with min/max enforcement & sentence-aware rebalancing.

  • Select via CHUNKER_CLASS_TYPE_CHUNKER_TYPE (recursive default, semantic optional).

  • Embeddings moved to rag-core-lib (STACKIT/Ollama), re-exported in rag-core-api.

  • Helm: new config/secret wiring; NLTK data fetched at startup; values.yaml gains CHUNKER_* knobs.

  • Deps: add langchain-experimental, nltk; bump fastapi, uvicorn, langfuse; tests for semantic chunker.

  • Minor: de-dup related IDs in page summaries.

@a-klos a-klos added the codex Vibe coded label Oct 18, 2025
@a-klos a-klos marked this pull request as draft October 18, 2025 14:54
- Added optional max/min chunk size enforcement to SemanticTextChunker using RecursiveCharacterTextSplitter.
- Introduced new parameters: `breakpoint_threshold_amount`, `overlap`, and `recursive_text_splitter`.
- Implemented logic to rebalance chunks to meet minimum size requirements.
- Updated chunking logic to handle oversized chunks and ensure they are split appropriately.
- Enhanced documentation for clarity on new features and parameters.

fix: Ensure related metadata is unique in PageSummaryEnhancer

- Modified PageSummaryEnhancer to ensure the "related" metadata list contains unique IDs.

refactor: Update ChunkerSettings to reflect new chunking parameters

- Removed deprecated parameters and added `breakpoint_threshold_amount`, `buffer_size`, and `min_size`.
- Adjusted validation logic to accommodate changes in chunking strategy.

chore: Update dependencies and improve project structure

- Updated FastAPI, langchain, and other dependencies to their latest versions.
- Introduced ChunkerType enumeration for better chunker type management.
- Created ChunkerClassTypeSettings for environment-based configuration of chunker implementations.

test: Add comprehensive tests for chunking behavior

- Implemented tests to validate max/min chunk size enforcement and rebalance logic.
- Ensured existing tests are updated to reflect changes in parameter names and functionality.
@a-klos a-klos marked this pull request as ready for review October 27, 2025 08:36
@a-klos a-klos self-requested a review October 27, 2025 08:49
@a-klos a-klos changed the title feat: centralize embeddings in rag-core-lib and add semantic chunker feat: add semantic chunker & centralize embeddings in rag-core-lib; helm + deps updates Oct 27, 2025
@a-klos a-klos merged commit 66570f9 into stackitcloud:main Oct 27, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

codex Vibe coded

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants