Skip to content

Conversation

@robodev-r2d2
Copy link
Owner

Summary

  • add an explicit Confluence parameters schema to the extractor OpenAPI spec so CQL support is reflected in generated clients
  • document the Confluence loader options, including the optional CQL filter, in the libraries README

Testing

  • pytest libs/extractor-api-lib/tests -k confluence (fails: ModuleNotFoundError: No module named 'langchain_core')

https://chatgpt.com/codex/tasks/task_e_68f3a27830648326835fe507c2685ad7

robodev-r2d2 and others added 5 commits October 20, 2025 15:40
… paths in Tiltfile (stackitcloud#143)

Summary:
This PR resolves Linux dev environment issues by standardizing the
Poetry virtualenv location and ensuring dev dependencies are reliably
installed when building with dev=1. Adjust tiltfile, so that tilt
performs clean reloads, when changes are triggered.

Changes:
- Standardize Poetry virtualenv to /opt/.venv across services.
- Set POETRY_VIRTUALENVS_CREATE=false and
POETRY_VIRTUALENVS_IN_PROJECT=false to reuse the prebuilt venv.
- Export VIRTUAL_ENV and prepend /opt/.venv/bin to PATH in both build
and runtime stages, including for nonroot.
- Add cache-busting tied to the dev build arg to force correct
installation of dev dependencies.
- Clean up redundant PATH exports and ensure /etc/environment reflects
the unified venv path.
- Adjust tiltfile sync and ignore during image build

Scope:
- services/admin-backend/Dockerfile
- services/document-extractor/Dockerfile
- services/mcp-server/Dockerfile
- services/rag-backend/Dockerfile

Fixes: stackitcloud#142

---------

Co-authored-by: Andreas Klos <[email protected]>
Co-authored-by: Andreas Klos <[email protected]>
Add copilot instructions file for customized and better copilot
generation.

---------

Co-authored-by: Andreas Klos <[email protected]>
Add codex instructions file for customized and better codex generation.
Adjust the rephrasing chain prompt, increase fault tolerance and adjust
chat graph connections, so that the nodes are executed sequentially. Adjust determine language node in answer graph. Its now based on llms and has as fallback langdetect and as fallback from langdetect, 'en'.
robodev-r2d2 and others added 9 commits October 27, 2025 12:20
…elm + deps updates (stackitcloud#148)

Summary
- Adds an optional Semantic Chunker to the admin-api-lib and centralizes embedding implementations in rag-core-lib (rag-core-api now re-exports).
- Helm chart gains chunker selection + tuning; admin container now preloads NLTK data at startup.
- Dependency updates across admin libs/services; new tests for chunking logic.

Motivation
- Provide more accurate chunk boundaries (semantic-aware) while retaining the existing recursive splitter as the default.
- Deduplicate/embedder logic across projects to reduce drift and config duplication.

Key changes
- Admin chunking
  - New `SemanticTextChunker` backed by LangChain’s `SemanticChunker`, with optional min/max enforcement via `RecursiveCharacterTextSplitter`.
  - Trailing undersized chunks are sentence-aware rebalanced (NLTK Punkt with regex fallback) to avoid tiny tails.
  - Configurable via:
    - `CHUNKER_CLASS_TYPE_CHUNKER_TYPE`: `recursive` (default) or `semantic`
    - `CHUNKER_MAX_SIZE` (default `1000`), `CHUNKER_OVERLAP` (default `100`)
    - Semantic-only: `CHUNKER_BREAKPOINT_THRESHOLD_TYPE` (default `percentile`), `CHUNKER_BREAKPOINT_THRESHOLD_AMOUNT` (default `95`), `CHUNKER_BUFFER_SIZE` (default `1`), `CHUNKER_MIN_SIZE` (default `200`)
- DI wiring
  - `DependencyContainer` selects chunker (`recursive` or `semantic`) and, for semantic mode, resolves embeddings via `EmbedderClassTypeSettings`:
    - `stackit` → `StackitEmbedder` (with shared retry settings)
    - `ollama` → `LangchainCommunityEmbedder(OllamaEmbeddings)`
  - Container bootstrapping simplified in `main.py` (internalizes class-type wiring).
- Embeddings centralization
  - New in `rag-core-lib`: `impl/embeddings/*` and embedder settings (`stackit`, `ollama`, `fake`), plus `EmbedderType` and base `Embedder`.
  - `rag-core-api` re-exports these for backward compatibility (no breaking imports).
- Helm / deployment
  - Values (`infrastructure/rag/values.yaml`): new `adminBackend.envs.chunker.*` keys for selection & tuning (chart default `recursive`; overlap default now `100`).
  - Deployment: mounts NLTK data dir and fetches `punkt` + `averaged_perceptron_tagger_eng` at startup; adds `configmap.chunkerName` and `secret.stackitEmbedderName` to env sources.
- Behavior fixes & docs
  - De-duplicate `meta["related"]` in page summaries.
  - Docs: libs README adds “Chunker configuration (multiple chunkers)” and updates DI tables to rag-core-lib classes; admin-backend README adds “Chunking modes”.
- Tests
  - New `semantic_text_chunker_test.py` exercising: supported-kwargs passthrough to LC chunker, empty-input behavior, min/max enforcement + balancing, sentence-aware split.

Configuration / migration
- Default remains `recursive` splitter; to enable semantic chunking:
  1) Set `CHUNKER_CLASS_TYPE_CHUNKER_TYPE=semantic`.
  2) Choose embeddings via `EMBEDDER_CLASS_TYPE_EMBEDDER_TYPE` (`stackit` or `ollama`) and configure:
     - STACKIT: `STACKIT_EMBEDDER_MODEL`, `STACKIT_EMBEDDER_BASE_URL`, `STACKIT_EMBEDDER_API_KEY` (+ optional retry overrides).
     - Ollama: `OLLAMA_EMBEDDER_MODEL`, `OLLAMA_EMBEDDER_BASE_URL`.
  3) Ensure Helm chart has corresponding ConfigMaps/Secrets (`stackitEmbedder`, etc.).
- NLTK data is preloaded on container start; no runtime downloads required.

Dependencies
- Add: `langchain-experimental`, `nltk` (and transitive `joblib`).
- Bump: `fastapi` (0.118.x), `uvicorn` (0.37.x), `langfuse` (3.6.x), `langchain`/`community`/`core` minor versions, `requests` (2.32.5).
- Test note: ensure LC packages (`langchain_core`, etc.) are present to run unit tests locally.

Risks & mitigations
- Startup time increases slightly due to NLTK data fetch → mitigated via one-time download into an emptyDir.
- Semantic mode depends on external embeddings; ensure credentials/secrets are present before switching default.
- Chunk size tuning may affect vector DB costs; start with defaults and adjust based on retrieval quality.

Docs
- libs/README.md: “2.4 Chunker configuration (multiple chunkers)” and corrected DI references.
- services/admin-backend/README.md: “Chunking modes” and Helm guidance.
…icated documentaion for each lib (stackitcloud#151)

This pull request introduces major improvements to documentation,
metadata, and configuration for the three main Python libraries in the
STACKIT RAG template: `admin-api-lib`, `extractor-api-lib`, and
`rag-core-api`. The changes focus on adding comprehensive README files
for each library, updating package metadata in `pyproject.toml` for
clarity and compliance, and refining dependency and configuration
management. These updates make the libraries easier to understand,
install, and extend, and improve maintainability for both operators and
developers.

**Documentation enhancements:**

* Added detailed `README.md` files for `libs/admin-api-lib`,
`libs/extractor-api-lib`, and `libs/rag-core-api`, describing module
responsibilities, features, endpoints, configuration, usage, extension,
and contribution guidelines.
[[1]](diffhunk://#diff-0064014deac3d21031c406697c008f92f0bb2783aa7eaaaf264a2345eea2cc9eR1-R96)
[[2]](diffhunk://#diff-9879d55539dbabcfd9190ec32b1828dfe5874d5e40d32816db8208de3aeeed1aR1-R94)
[[3]](diffhunk://#diff-eb80132f5f4660c40ce8a60f375daec36d19a5e070d120a478f60d74384183d9R1-R96)

**Package metadata and configuration improvements:**

* Updated `pyproject.toml` for all three libraries to include new
version numbers (`v3.2.1`), expanded author and maintainer information,
license, repository, homepage, and readme fields for better package
distribution and compliance.
[[1]](diffhunk://#diff-9c5aeb0db77c2eec077d07ddc3b3810ae1a4a1e50ee7061fba37a46706c513fbL7-R19)
[[2]](diffhunk://#diff-dede389bcfb615c4b45cd1da7ac14cbe9535305f41f19cce09e321c91a8bb323L7-R19)
[[3]](diffhunk://#diff-9c4162cc1c16dd4c7ec5e95e79df285e8c0882a1db7ff2892c746a0537d26c36L7-R19)
* Improved dependency specification in
`libs/extractor-api-lib/pyproject.toml` by switching `fasttext` to a
stable PyPI version and adjusting other package versions.
* Refined pytest and flake8 configuration for consistency and clarity,
such as changing `log_cli` to boolean and updating exclusions.
[[1]](diffhunk://#diff-dede389bcfb615c4b45cd1da7ac14cbe9535305f41f19cce09e321c91a8bb323L139-R148)
[[2]](diffhunk://#diff-9c5aeb0db77c2eec077d07ddc3b3810ae1a4a1e50ee7061fba37a46706c513fbL7-R19)

These changes collectively strengthen the documentation, usability, and
maintainability of the STACKIT RAG template libraries, making them more
accessible for new users and contributors.

---------

Co-authored-by: Copilot <[email protected]>
stackitcloud#152)

This pull request primarily updates version numbers and metadata across
multiple components to align with the latest release (3.2.1) and
standardize package naming and licensing. Additionally, it enhances the
documentation by adding useful badges to the `README.md` for better
project visibility.

**Version and metadata updates:**

* Updated the version to `3.2.1` in `services/frontend/package.json`,
`services/admin-backend/pyproject.toml`,
`services/document-extractor/pyproject.toml`,
`services/mcp-server/pyproject.toml`, and
`services/rag-backend/pyproject.toml` to ensure consistency across all
services.
[[1]](diffhunk://#diff-0d005dbd9d9f66983f95fa01fa375184cf69dac9ae841050c11f07ebcc6789fdL3-R5)
[[2]](diffhunk://#diff-7be99b3586ebefbb9757532b67d9bd826779bfe12db834326790c00f868238e7L55-R55)
[[3]](diffhunk://#diff-bda9860363f25ca7829f0bc0121455b5cfea15f6ecc4e98d168aba411d9653c9L47-R47)
[[4]](diffhunk://#diff-a32cd883126f65652f92c8ecc411d949b7bcf95edb2156c36dc2c1b7063ee690L3-R3)
[[5]](diffhunk://#diff-575f4ba32d7ff340b37eb2f875cb9574553092b79335faadd5f3b6be662b6925L3-R3)
* Changed the license from `MIT` to `Apache-2.0` and added a description
in `services/frontend/package.json` for clearer project identification
and compliance.
* Standardized the package name from `extractor_api_lib` to
`extractor-api-lib` in `libs/extractor-api-lib/pyproject.toml` for
consistency with other packages.

**Documentation improvements:**

* Added a set of badges to the top of `README.md` to display license,
commit activity, issue closure, discussions, PyPI downloads, Kubernetes
readiness, and STACKIT readiness, improving project transparency and
accessibility.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants