Confluence Q&A with LLMs using ChromaDB

This project provides a toolkit to extract data from your Confluence wiki, process it, store it in a ChromaDB vector database, and use it to answer questions with Large Language Models (LLMs). It enables you to build a Retrieval-Augmented Generation (RAG) system based on your Confluence knowledge base.

Features

Confluence Data Extraction: Fetches pages from specified Confluence spaces using the API.
Content Processing: Cleans HTML, converts it to Markdown, and preserves page metadata.
Vector Database Integration: Chunks documents, generates embeddings (using sentence-transformers), and stores them in a persistent ChromaDB database (chroma_db/).
LLM Interaction Example: Includes an example script (vector_search_example.py) demonstrating how to query the ChromaDB index and use the retrieved context with an LLM to answer questions.
Flexible Output: Saves extracted data as individual Markdown files (confluence_data/pages/) and a consolidated JSON (confluence_data/confluence_pages.json).

Requirements

Python 3.7+
Access to a Confluence instance (Username/Email and API Token)
Dependencies listed in requirements.txt:
- requests
- beautifulsoup4
- html2text
- python-dotenv
- langchain
- chromadb
- pyyaml
- sentence-transformers
- numpy

Installation

Clone the repository:

git clone <your-repository-url>
cd <repository-directory>

Create and activate a virtual environment (recommended):

python -m venv venv
source venv/bin/activate  # On Windows use `venv\Scripts\activate`

Install dependencies:
```
pip install -r requirements.txt
```

Configuration

Create a .env file: Copy the .env.example file to .env:
```
cp .env.example .env
```

Edit the .env file with your Confluence details:

CONFLUENCE_BASE_URL=https://your-instance.atlassian.net/wiki
CONFLUENCE_USERNAME=[email protected]
CONFLUENCE_API_TOKEN=your-api-token

# Optional: Configure output directory for raw data
OUTPUT_DIR=confluence_data

# Optional: Configure ChromaDB path (defaults to ./chroma_db)
# CHROMA_DB_PATH=./chroma_db

# Optional: Configure embedding model
# EMBEDDING_MODEL=all-MiniLM-L6-v2

# Optional: OpenAI API Key if using OpenAI models in examples
# OPENAI_API_KEY=your-openai-api-key

Get a Confluence API Token:
1. Log in to https://id.atlassian.com/manage-profile/security/api-tokens
2. Click "Create API token"
3. Give it a label (e.g., "LLM Extractor") and copy the token.

Usage

Extract Confluence Data: Run the extractor script. This will fetch pages, process them, and save them to the OUTPUT_DIR (default: confluence_data).
```
python confluence_extractor.py [options]
```
- Options:
  - --spaces KEY1 KEY2: Specify space keys to extract (default: all accessible spaces).
  - --output-dir: Override the output directory set in .env.
  - See python confluence_extractor.py --help for all options.
Build Vector Index & Query with LLM: Run the vector search example script. This script will:
- Load the processed Markdown files from confluence_data/pages/.
- Chunk the documents.
- Generate embeddings using the specified sentence-transformer model.
- Create or load a persistent ChromaDB index in chroma_db/.
- Take your query, search the index for relevant document chunks.
- (If an LLM is configured) Send the query and retrieved context to an LLM to generate an answer.
```
python vector_search_example.py --query "Your question about Confluence content?"
```
- Options:
  - --query: The question you want to ask.
  - --embedding-model: Override the embedding model (default: all-MiniLM-L6-v2).
  - --chroma-db-path: Specify the path to the Chroma database directory.
  - --top-k: Number of relevant chunks to retrieve (default: 5).
  - See python vector_search_example.py --help for potentially more options related to LLM choice.

How it Works (RAG Flow)

Extraction: confluence_extractor.py fetches and preprocesses data.
Loading: vector_search_example.py loads the Markdown files.
Chunking: Documents are split into smaller, manageable chunks.
Embedding: Each chunk is converted into a numerical vector representation (embedding) using a sentence-transformer model.
Indexing: Embeddings and corresponding text chunks are stored in a ChromaDB vector database for efficient similarity searching.
Retrieval: When you provide a query, it's embedded, and ChromaDB finds the most similar chunks from the index.
Augmentation & Generation: The original query and the retrieved chunks (context) are passed to an LLM, which generates an answer based on the provided information.

Troubleshooting

Rate Limiting: Confluence API might have rate limits. The extractor doesn't currently handle this explicitly; consider adding delays if needed.
Authentication: Double-check your .env file for correct Confluence URL, username, and a valid API token. Ensure the token has read permissions.
ChromaDB Errors: Ensure chroma_db/ is writable. Check ChromaDB documentation for specific issues.
Dependencies: Make sure all dependencies in requirements.txt are installed correctly within your virtual environment.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
confluence_extractor.py		confluence_extractor.py
llm_example.py		llm_example.py
requirements.txt		requirements.txt
vector_search_example.py		vector_search_example.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Confluence Q&A with LLMs using ChromaDB

Features

Requirements

Installation

Configuration

Usage

How it Works (RAG Flow)

Troubleshooting

License

About

Uh oh!

Releases

Packages

Languages

elreeda/confluence-llm

Folders and files

Latest commit

History

Repository files navigation

Confluence Q&A with LLMs using ChromaDB

Features

Requirements

Installation

Configuration

Usage

How it Works (RAG Flow)

Troubleshooting

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages