This project provides a toolkit to extract data from your Confluence wiki, process it, store it in a ChromaDB vector database, and use it to answer questions with Large Language Models (LLMs). It enables you to build a Retrieval-Augmented Generation (RAG) system based on your Confluence knowledge base.
- Confluence Data Extraction: Fetches pages from specified Confluence spaces using the API.
- Content Processing: Cleans HTML, converts it to Markdown, and preserves page metadata.
- Vector Database Integration: Chunks documents, generates embeddings (using sentence-transformers), and stores them in a persistent ChromaDB database (
chroma_db/
). - LLM Interaction Example: Includes an example script (
vector_search_example.py
) demonstrating how to query the ChromaDB index and use the retrieved context with an LLM to answer questions. - Flexible Output: Saves extracted data as individual Markdown files (
confluence_data/pages/
) and a consolidated JSON (confluence_data/confluence_pages.json
).
- Python 3.7+
- Access to a Confluence instance (Username/Email and API Token)
- Dependencies listed in
requirements.txt
:requests
beautifulsoup4
html2text
python-dotenv
langchain
chromadb
pyyaml
sentence-transformers
numpy
-
Clone the repository:
git clone <your-repository-url> cd <repository-directory>
-
Create and activate a virtual environment (recommended):
python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
-
Install dependencies:
pip install -r requirements.txt
-
Create a
.env
file: Copy the.env.example
file to.env
:cp .env.example .env
-
Edit the
.env
file with your Confluence details:CONFLUENCE_BASE_URL=https://your-instance.atlassian.net/wiki CONFLUENCE_USERNAME=[email protected] CONFLUENCE_API_TOKEN=your-api-token # Optional: Configure output directory for raw data OUTPUT_DIR=confluence_data # Optional: Configure ChromaDB path (defaults to ./chroma_db) # CHROMA_DB_PATH=./chroma_db # Optional: Configure embedding model # EMBEDDING_MODEL=all-MiniLM-L6-v2 # Optional: OpenAI API Key if using OpenAI models in examples # OPENAI_API_KEY=your-openai-api-key
- Get a Confluence API Token:
- Log in to
https://id.atlassian.com/manage-profile/security/api-tokens
- Click "Create API token"
- Give it a label (e.g., "LLM Extractor") and copy the token.
- Log in to
- Get a Confluence API Token:
-
Extract Confluence Data: Run the extractor script. This will fetch pages, process them, and save them to the
OUTPUT_DIR
(default:confluence_data
).python confluence_extractor.py [options]
- Options:
--spaces KEY1 KEY2
: Specify space keys to extract (default: all accessible spaces).--output-dir
: Override the output directory set in.env
.- See
python confluence_extractor.py --help
for all options.
- Options:
-
Build Vector Index & Query with LLM: Run the vector search example script. This script will:
- Load the processed Markdown files from
confluence_data/pages/
. - Chunk the documents.
- Generate embeddings using the specified sentence-transformer model.
- Create or load a persistent ChromaDB index in
chroma_db/
. - Take your query, search the index for relevant document chunks.
- (If an LLM is configured) Send the query and retrieved context to an LLM to generate an answer.
python vector_search_example.py --query "Your question about Confluence content?"
- Options:
--query
: The question you want to ask.--embedding-model
: Override the embedding model (default:all-MiniLM-L6-v2
).--chroma-db-path
: Specify the path to the Chroma database directory.--top-k
: Number of relevant chunks to retrieve (default: 5).- See
python vector_search_example.py --help
for potentially more options related to LLM choice.
- Load the processed Markdown files from
- Extraction:
confluence_extractor.py
fetches and preprocesses data. - Loading:
vector_search_example.py
loads the Markdown files. - Chunking: Documents are split into smaller, manageable chunks.
- Embedding: Each chunk is converted into a numerical vector representation (embedding) using a sentence-transformer model.
- Indexing: Embeddings and corresponding text chunks are stored in a ChromaDB vector database for efficient similarity searching.
- Retrieval: When you provide a query, it's embedded, and ChromaDB finds the most similar chunks from the index.
- Augmentation & Generation: The original query and the retrieved chunks (context) are passed to an LLM, which generates an answer based on the provided information.
- Rate Limiting: Confluence API might have rate limits. The extractor doesn't currently handle this explicitly; consider adding delays if needed.
- Authentication: Double-check your
.env
file for correct Confluence URL, username, and a valid API token. Ensure the token has read permissions. - ChromaDB Errors: Ensure
chroma_db/
is writable. Check ChromaDB documentation for specific issues. - Dependencies: Make sure all dependencies in
requirements.txt
are installed correctly within your virtual environment.
MIT