forked from stackitcloud/rag-template
-
Notifications
You must be signed in to change notification settings - Fork 0
chore: align Confluence parameters spec with CQL support #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
robodev-r2d2
wants to merge
14
commits into
feat/cql-confluence
Choose a base branch
from
codex/adjust-confluence-extractor-for-cql-support
base: feat/cql-confluence
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
chore: align Confluence parameters spec with CQL support #3
robodev-r2d2
wants to merge
14
commits into
feat/cql-confluence
from
codex/adjust-confluence-extractor-for-cql-support
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
… paths in Tiltfile (stackitcloud#143) Summary: This PR resolves Linux dev environment issues by standardizing the Poetry virtualenv location and ensuring dev dependencies are reliably installed when building with dev=1. Adjust tiltfile, so that tilt performs clean reloads, when changes are triggered. Changes: - Standardize Poetry virtualenv to /opt/.venv across services. - Set POETRY_VIRTUALENVS_CREATE=false and POETRY_VIRTUALENVS_IN_PROJECT=false to reuse the prebuilt venv. - Export VIRTUAL_ENV and prepend /opt/.venv/bin to PATH in both build and runtime stages, including for nonroot. - Add cache-busting tied to the dev build arg to force correct installation of dev dependencies. - Clean up redundant PATH exports and ensure /etc/environment reflects the unified venv path. - Adjust tiltfile sync and ignore during image build Scope: - services/admin-backend/Dockerfile - services/document-extractor/Dockerfile - services/mcp-server/Dockerfile - services/rag-backend/Dockerfile Fixes: stackitcloud#142 --------- Co-authored-by: Andreas Klos <[email protected]> Co-authored-by: Andreas Klos <[email protected]>
Add copilot instructions file for customized and better copilot generation. --------- Co-authored-by: Andreas Klos <[email protected]>
Add codex instructions file for customized and better codex generation.
Adjust the rephrasing chain prompt, increase fault tolerance and adjust chat graph connections, so that the nodes are executed sequentially. Adjust determine language node in answer graph. Its now based on llms and has as fallback langdetect and as fallback from langdetect, 'en'.
…elm + deps updates (stackitcloud#148) Summary - Adds an optional Semantic Chunker to the admin-api-lib and centralizes embedding implementations in rag-core-lib (rag-core-api now re-exports). - Helm chart gains chunker selection + tuning; admin container now preloads NLTK data at startup. - Dependency updates across admin libs/services; new tests for chunking logic. Motivation - Provide more accurate chunk boundaries (semantic-aware) while retaining the existing recursive splitter as the default. - Deduplicate/embedder logic across projects to reduce drift and config duplication. Key changes - Admin chunking - New `SemanticTextChunker` backed by LangChain’s `SemanticChunker`, with optional min/max enforcement via `RecursiveCharacterTextSplitter`. - Trailing undersized chunks are sentence-aware rebalanced (NLTK Punkt with regex fallback) to avoid tiny tails. - Configurable via: - `CHUNKER_CLASS_TYPE_CHUNKER_TYPE`: `recursive` (default) or `semantic` - `CHUNKER_MAX_SIZE` (default `1000`), `CHUNKER_OVERLAP` (default `100`) - Semantic-only: `CHUNKER_BREAKPOINT_THRESHOLD_TYPE` (default `percentile`), `CHUNKER_BREAKPOINT_THRESHOLD_AMOUNT` (default `95`), `CHUNKER_BUFFER_SIZE` (default `1`), `CHUNKER_MIN_SIZE` (default `200`) - DI wiring - `DependencyContainer` selects chunker (`recursive` or `semantic`) and, for semantic mode, resolves embeddings via `EmbedderClassTypeSettings`: - `stackit` → `StackitEmbedder` (with shared retry settings) - `ollama` → `LangchainCommunityEmbedder(OllamaEmbeddings)` - Container bootstrapping simplified in `main.py` (internalizes class-type wiring). - Embeddings centralization - New in `rag-core-lib`: `impl/embeddings/*` and embedder settings (`stackit`, `ollama`, `fake`), plus `EmbedderType` and base `Embedder`. - `rag-core-api` re-exports these for backward compatibility (no breaking imports). - Helm / deployment - Values (`infrastructure/rag/values.yaml`): new `adminBackend.envs.chunker.*` keys for selection & tuning (chart default `recursive`; overlap default now `100`). - Deployment: mounts NLTK data dir and fetches `punkt` + `averaged_perceptron_tagger_eng` at startup; adds `configmap.chunkerName` and `secret.stackitEmbedderName` to env sources. - Behavior fixes & docs - De-duplicate `meta["related"]` in page summaries. - Docs: libs README adds “Chunker configuration (multiple chunkers)” and updates DI tables to rag-core-lib classes; admin-backend README adds “Chunking modes”. - Tests - New `semantic_text_chunker_test.py` exercising: supported-kwargs passthrough to LC chunker, empty-input behavior, min/max enforcement + balancing, sentence-aware split. Configuration / migration - Default remains `recursive` splitter; to enable semantic chunking: 1) Set `CHUNKER_CLASS_TYPE_CHUNKER_TYPE=semantic`. 2) Choose embeddings via `EMBEDDER_CLASS_TYPE_EMBEDDER_TYPE` (`stackit` or `ollama`) and configure: - STACKIT: `STACKIT_EMBEDDER_MODEL`, `STACKIT_EMBEDDER_BASE_URL`, `STACKIT_EMBEDDER_API_KEY` (+ optional retry overrides). - Ollama: `OLLAMA_EMBEDDER_MODEL`, `OLLAMA_EMBEDDER_BASE_URL`. 3) Ensure Helm chart has corresponding ConfigMaps/Secrets (`stackitEmbedder`, etc.). - NLTK data is preloaded on container start; no runtime downloads required. Dependencies - Add: `langchain-experimental`, `nltk` (and transitive `joblib`). - Bump: `fastapi` (0.118.x), `uvicorn` (0.37.x), `langfuse` (3.6.x), `langchain`/`community`/`core` minor versions, `requests` (2.32.5). - Test note: ensure LC packages (`langchain_core`, etc.) are present to run unit tests locally. Risks & mitigations - Startup time increases slightly due to NLTK data fetch → mitigated via one-time download into an emptyDir. - Semantic mode depends on external embeddings; ensure credentials/secrets are present before switching default. - Chunk size tuning may affect vector DB costs; start with defaults and adjust based on retrieval quality. Docs - libs/README.md: “2.4 Chunker configuration (multiple chunkers)” and corrected DI references. - services/admin-backend/README.md: “Chunking modes” and Helm guidance.
…icated documentaion for each lib (stackitcloud#151) This pull request introduces major improvements to documentation, metadata, and configuration for the three main Python libraries in the STACKIT RAG template: `admin-api-lib`, `extractor-api-lib`, and `rag-core-api`. The changes focus on adding comprehensive README files for each library, updating package metadata in `pyproject.toml` for clarity and compliance, and refining dependency and configuration management. These updates make the libraries easier to understand, install, and extend, and improve maintainability for both operators and developers. **Documentation enhancements:** * Added detailed `README.md` files for `libs/admin-api-lib`, `libs/extractor-api-lib`, and `libs/rag-core-api`, describing module responsibilities, features, endpoints, configuration, usage, extension, and contribution guidelines. [[1]](diffhunk://#diff-0064014deac3d21031c406697c008f92f0bb2783aa7eaaaf264a2345eea2cc9eR1-R96) [[2]](diffhunk://#diff-9879d55539dbabcfd9190ec32b1828dfe5874d5e40d32816db8208de3aeeed1aR1-R94) [[3]](diffhunk://#diff-eb80132f5f4660c40ce8a60f375daec36d19a5e070d120a478f60d74384183d9R1-R96) **Package metadata and configuration improvements:** * Updated `pyproject.toml` for all three libraries to include new version numbers (`v3.2.1`), expanded author and maintainer information, license, repository, homepage, and readme fields for better package distribution and compliance. [[1]](diffhunk://#diff-9c5aeb0db77c2eec077d07ddc3b3810ae1a4a1e50ee7061fba37a46706c513fbL7-R19) [[2]](diffhunk://#diff-dede389bcfb615c4b45cd1da7ac14cbe9535305f41f19cce09e321c91a8bb323L7-R19) [[3]](diffhunk://#diff-9c4162cc1c16dd4c7ec5e95e79df285e8c0882a1db7ff2892c746a0537d26c36L7-R19) * Improved dependency specification in `libs/extractor-api-lib/pyproject.toml` by switching `fasttext` to a stable PyPI version and adjusting other package versions. * Refined pytest and flake8 configuration for consistency and clarity, such as changing `log_cli` to boolean and updating exclusions. [[1]](diffhunk://#diff-dede389bcfb615c4b45cd1da7ac14cbe9535305f41f19cce09e321c91a8bb323L139-R148) [[2]](diffhunk://#diff-9c5aeb0db77c2eec077d07ddc3b3810ae1a4a1e50ee7061fba37a46706c513fbL7-R19) These changes collectively strengthen the documentation, usability, and maintainability of the STACKIT RAG template libraries, making them more accessible for new users and contributors. --------- Co-authored-by: Copilot <[email protected]>
stackitcloud#152) This pull request primarily updates version numbers and metadata across multiple components to align with the latest release (3.2.1) and standardize package naming and licensing. Additionally, it enhances the documentation by adding useful badges to the `README.md` for better project visibility. **Version and metadata updates:** * Updated the version to `3.2.1` in `services/frontend/package.json`, `services/admin-backend/pyproject.toml`, `services/document-extractor/pyproject.toml`, `services/mcp-server/pyproject.toml`, and `services/rag-backend/pyproject.toml` to ensure consistency across all services. [[1]](diffhunk://#diff-0d005dbd9d9f66983f95fa01fa375184cf69dac9ae841050c11f07ebcc6789fdL3-R5) [[2]](diffhunk://#diff-7be99b3586ebefbb9757532b67d9bd826779bfe12db834326790c00f868238e7L55-R55) [[3]](diffhunk://#diff-bda9860363f25ca7829f0bc0121455b5cfea15f6ecc4e98d168aba411d9653c9L47-R47) [[4]](diffhunk://#diff-a32cd883126f65652f92c8ecc411d949b7bcf95edb2156c36dc2c1b7063ee690L3-R3) [[5]](diffhunk://#diff-575f4ba32d7ff340b37eb2f875cb9574553092b79335faadd5f3b6be662b6925L3-R3) * Changed the license from `MIT` to `Apache-2.0` and added a description in `services/frontend/package.json` for clearer project identification and compliance. * Standardized the package name from `extractor_api_lib` to `extractor-api-lib` in `libs/extractor-api-lib/pyproject.toml` for consistency with other packages. **Documentation improvements:** * Added a set of badges to the top of `README.md` to display license, commit activity, issue closure, discussions, PyPI downloads, Kubernetes readiness, and STACKIT readiness, improving project transparency and accessibility.
…eholder in Confluence extraction
…enhance test assertion message
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Testing
https://chatgpt.com/codex/tasks/task_e_68f3a27830648326835fe507c2685ad7