Skip to content

A high-performance RAG system for document processing and context-aware conversations.

License

Notifications You must be signed in to change notification settings

sheygs/doc-insight

Repository files navigation

doc-insight

Doc Insight is a high-performance Retrieval-Augmented Generation (RAG) system that allows users to upload documents (including scanned PDFs), process them using OCR, and engage in context-aware conversations based on the extracted content.

Features

  • FastAPI Backend: Fully asynchronous API for high-concurrency document processing.
  • Advanced OCR: Support for both text-based and scanned PDFs using PyMuPDF and Tesseract.
  • Vector Search: Semantic search powered by PostgreSQL and pgvector with 1536-dimensional OpenAI embeddings.
  • Smart Chunking: Sentence-based text splitting using NLTK to preserve context for retrieval.
  • Background Processing: Non-blocking document chunking and embedding generation via FastAPI background tasks.
  • RAG Pipeline: Context-aware responses using GPT-4o augmented with relevant document chunks.

Architecture

System Architecture

Tech Stack

  • Language: Python 3.12+
  • Environment Manager: uv
  • API Framework: FastAPI
  • Database: PostgreSQL + pgvector
  • LLM & Embeddings: OpenAI API (GPT-4o & text-embedding-3-small)
  • Document Parsing: PyMuPDF (fitz) & Tesseract OCR

Setup & Installation

Prerequisites

Quick Start (Docker)

The fastest way to get up and running:

# 1. Clone the repository
git clone https://github.com/sheygs/doc-insight.git
cd doc-insight

# 2. Create environment file
cp .env.dev .env

# 3. Edit .env with your values
# APP_ENV=development
# APP_PORT=8000
# OPENAI_API_KEY=your_openai_api_key
# POSTGRES_USER=docinsight
# POSTGRES_PASSWORD=docinsight
# POSTGRES_DB=doc_insight_db
# ...

# 4. Start everything
docker compose up -d

# 5. Check logs
docker compose logs -f app

The API is now running at http://localhost:8000 (or your configured APP_PORT)


Local Development Setup

For development with hot-reload:

Step 1: Clone and Install Dependencies

git clone https://github.com/sheygs/doc-insight.git
cd doc-insight

# Install dependencies
uv sync

# Install dev dependencies
uv sync --dev

Step 2: Configure Environment

# Copy example env file
cp .env.dev .env

Edit .env with your values:

APP_ENV=development
APP_PORT=8000
OPENAI_API_KEY=your_openai_api_key
POSTGRES_USER=docinsight
POSTGRES_PASSWORD=docinsight
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_DB=doc_insight_db
...

Step 3: Start the Database

# Start PostgreSQL with pgvector
docker compose up postgres -d

# Verify it's running
docker compose ps

Step 4: Run Database Migrations (Optional)

# Generate initial migration
uv run alembic revision --autogenerate -m "Initial schema"

# Apply migrations
uv run alembic upgrade head

Step 5: Start the Application

# Using the server entry point
uv run python -m app.server

# Or directly with uvicorn
uv run uvicorn app.app:app --reload

Step 6: Verify Installation

# Root endpoint
curl http://localhost:8000/

# Health check with database status
curl http://localhost:8000/health

View the interactive Swagger documentation at http://localhost:8000/docs


Running Tests

# Make sure database is running
docker compose up postgres -d

# Run all tests
uv run pytest tests/ -v

# Run with coverage
uv run pytest tests/ --cov=app

System Architecture

  1. Ingestion: Files are uploaded and validated.
  2. Extraction: Text is extracted via standard parsing or OCR if the file contains images.
  3. Vectorization: Text chunks are converted into embeddings using OpenAI's text-embedding-3-small model.
  4. Storage: Embeddings and metadata are stored in a pgvector enabled PostgreSQL table.
  5. Retrieval: User queries are embedded and compared against the database using cosine similarity.
  6. Augmentation: The top-k relevant chunks are fed into GPT-4o to generate a grounded response.

Database Schema

Database Schema

API Overview

Method Endpoint Description
GET / Root endpoint
GET /health Health check with database status
GET /metrics Prometheus metrics
POST /api/v1/files Upload a PDF or text document
POST /api/v1/ask Query the assistant with a question
GET /api/v1/files List all processed documents with file IDs
GET /api/v1/files/{id}/chunks Retrieve similar chunks for a document
DELETE /api/v1/files/{id} Delete document by ID

API Testing

You can test the API endpoints using the provided script below:

# Ensure the server is running, then run the tests
bash ./scripts/api_tests.sh

# Or make the script executable first
chmod +x scripts/api_tests.sh
./scripts/api_tests.sh

Contributing

Please feel free to submit a Pull Request.

About

A high-performance RAG system for document processing and context-aware conversations.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published