doc-insight

Doc Insight is a high-performance Retrieval-Augmented Generation (RAG) system that allows users to upload documents (including scanned PDFs), process them using OCR, and engage in context-aware conversations based on the extracted content.

Features

FastAPI Backend: Fully asynchronous API for high-concurrency document processing.
Advanced OCR: Support for both text-based and scanned PDFs using PyMuPDF and Tesseract.
Vector Search: Semantic search powered by PostgreSQL and pgvector with 1536-dimensional OpenAI embeddings.
Smart Chunking: Sentence-based text splitting using NLTK to preserve context for retrieval.
Background Processing: Non-blocking document chunking and embedding generation via FastAPI background tasks.
RAG Pipeline: Context-aware responses using GPT-4o augmented with relevant document chunks.

Architecture

Tech Stack

Language: Python 3.12+
Environment Manager: uv
API Framework: FastAPI
Database: PostgreSQL + pgvector
LLM & Embeddings: OpenAI API (GPT-4o & text-embedding-3-small)
Document Parsing: PyMuPDF (fitz) & Tesseract OCR

Setup & Installation

Prerequisites

Docker and Docker Compose
uv (for local development)
OpenAI API Key

Quick Start (Docker)

The fastest way to get up and running:

# 1. Clone the repository
git clone https://github.com/sheygs/doc-insight.git
cd doc-insight

# 2. Create environment file
cp .env.dev .env

# 3. Edit .env with your values
# APP_ENV=development
# APP_PORT=8000
# OPENAI_API_KEY=your_openai_api_key
# POSTGRES_USER=docinsight
# POSTGRES_PASSWORD=docinsight
# POSTGRES_DB=doc_insight_db
# ...

# 4. Start everything
docker compose up -d

# 5. Check logs
docker compose logs -f app

The API is now running at http://localhost:8000 (or your configured APP_PORT)

Local Development Setup

For development with hot-reload:

Step 1: Clone and Install Dependencies

git clone https://github.com/sheygs/doc-insight.git
cd doc-insight

# Install dependencies
uv sync

# Install dev dependencies
uv sync --dev

Step 2: Configure Environment

# Copy example env file
cp .env.dev .env

Edit .env with your values:

APP_ENV=development
APP_PORT=8000
OPENAI_API_KEY=your_openai_api_key
POSTGRES_USER=docinsight
POSTGRES_PASSWORD=docinsight
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_DB=doc_insight_db
...

Step 3: Start the Database

# Start PostgreSQL with pgvector
docker compose up postgres -d

# Verify it's running
docker compose ps

Step 4: Run Database Migrations (Optional)

# Generate initial migration
uv run alembic revision --autogenerate -m "Initial schema"

# Apply migrations
uv run alembic upgrade head

Step 5: Start the Application

# Using the server entry point
uv run python -m app.server

# Or directly with uvicorn
uv run uvicorn app.app:app --reload

Step 6: Verify Installation

# Root endpoint
curl http://localhost:8000/

# Health check with database status
curl http://localhost:8000/health

View the interactive Swagger documentation at http://localhost:8000/docs

Running Tests

# Make sure database is running
docker compose up postgres -d

# Run all tests
uv run pytest tests/ -v

# Run with coverage
uv run pytest tests/ --cov=app

System Architecture

Ingestion: Files are uploaded and validated.
Extraction: Text is extracted via standard parsing or OCR if the file contains images.
Vectorization: Text chunks are converted into embeddings using OpenAI's text-embedding-3-small model.
Storage: Embeddings and metadata are stored in a pgvector enabled PostgreSQL table.
Retrieval: User queries are embedded and compared against the database using cosine similarity.
Augmentation: The top-k relevant chunks are fed into GPT-4o to generate a grounded response.

Database Schema

API Overview

Method	Endpoint	Description
`GET`	`/`	Root endpoint
`GET`	`/health`	Health check with database status
`GET`	`/metrics`	Prometheus metrics
`POST`	`/api/v1/files`	Upload a PDF or text document
`POST`	`/api/v1/ask`	Query the assistant with a question
`GET`	`/api/v1/files`	List all processed documents with file IDs
`GET`	`/api/v1/files/{id}/chunks`	Retrieve similar chunks for a document
`DELETE`	`/api/v1/files/{id}`	Delete document by ID

API Testing

You can test the API endpoints using the provided script below:

# Ensure the server is running, then run the tests
bash ./scripts/api_tests.sh

# Or make the script executable first
chmod +x scripts/api_tests.sh
./scripts/api_tests.sh

Contributing

Please feel free to submit a Pull Request.

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
alembic		alembic
app		app
data/uploads		data/uploads
scripts		scripts
tests		tests
.dockerignore		.dockerignore
.env.dev		.env.dev
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
alembic.ini		alembic.ini
architecture.png		architecture.png
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
schema.png		schema.png
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

doc-insight

Features

Architecture

Tech Stack

Setup & Installation

Prerequisites

Quick Start (Docker)

Local Development Setup

Step 1: Clone and Install Dependencies

Step 2: Configure Environment

Step 3: Start the Database

Step 4: Run Database Migrations (Optional)

Step 5: Start the Application

Step 6: Verify Installation

Running Tests

System Architecture

Database Schema

API Overview

API Testing

Contributing

About

Uh oh!

Releases 3

Packages

Languages

License

sheygs/doc-insight

Folders and files

Latest commit

History

Repository files navigation

doc-insight

Features

Architecture

Tech Stack

Setup & Installation

Prerequisites

Quick Start (Docker)

Local Development Setup

Step 1: Clone and Install Dependencies

Step 2: Configure Environment

Step 3: Start the Database

Step 4: Run Database Migrations (Optional)

Step 5: Start the Application

Step 6: Verify Installation

Running Tests

System Architecture

Database Schema

API Overview

API Testing

Contributing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Languages

Packages