Doc Insight is a high-performance Retrieval-Augmented Generation (RAG) system that allows users to upload documents (including scanned PDFs), process them using OCR, and engage in context-aware conversations based on the extracted content.
- FastAPI Backend: Fully asynchronous API for high-concurrency document processing.
- Advanced OCR: Support for both text-based and scanned PDFs using PyMuPDF and Tesseract.
- Vector Search: Semantic search powered by PostgreSQL and
pgvectorwith 1536-dimensional OpenAI embeddings. - Smart Chunking: Sentence-based text splitting using NLTK to preserve context for retrieval.
- Background Processing: Non-blocking document chunking and embedding generation via FastAPI background tasks.
- RAG Pipeline: Context-aware responses using GPT-4o augmented with relevant document chunks.
- Language: Python 3.12+
- Environment Manager: uv
- API Framework: FastAPI
- Database: PostgreSQL +
pgvector - LLM & Embeddings: OpenAI API (GPT-4o &
text-embedding-3-small) - Document Parsing: PyMuPDF (fitz) & Tesseract OCR
- Docker and Docker Compose
- uv (for local development)
- OpenAI API Key
The fastest way to get up and running:
# 1. Clone the repository
git clone https://github.com/sheygs/doc-insight.git
cd doc-insight
# 2. Create environment file
cp .env.dev .env
# 3. Edit .env with your values
# APP_ENV=development
# APP_PORT=8000
# OPENAI_API_KEY=your_openai_api_key
# POSTGRES_USER=docinsight
# POSTGRES_PASSWORD=docinsight
# POSTGRES_DB=doc_insight_db
# ...
# 4. Start everything
docker compose up -d
# 5. Check logs
docker compose logs -f appThe API is now running at http://localhost:8000 (or your configured APP_PORT)
For development with hot-reload:
git clone https://github.com/sheygs/doc-insight.git
cd doc-insight
# Install dependencies
uv sync
# Install dev dependencies
uv sync --dev# Copy example env file
cp .env.dev .envEdit .env with your values:
APP_ENV=development
APP_PORT=8000
OPENAI_API_KEY=your_openai_api_key
POSTGRES_USER=docinsight
POSTGRES_PASSWORD=docinsight
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_DB=doc_insight_db
...# Start PostgreSQL with pgvector
docker compose up postgres -d
# Verify it's running
docker compose ps# Generate initial migration
uv run alembic revision --autogenerate -m "Initial schema"
# Apply migrations
uv run alembic upgrade head# Using the server entry point
uv run python -m app.server
# Or directly with uvicorn
uv run uvicorn app.app:app --reload# Root endpoint
curl http://localhost:8000/
# Health check with database status
curl http://localhost:8000/health
View the interactive Swagger documentation at http://localhost:8000/docs
# Make sure database is running
docker compose up postgres -d
# Run all tests
uv run pytest tests/ -v
# Run with coverage
uv run pytest tests/ --cov=app- Ingestion: Files are uploaded and validated.
- Extraction: Text is extracted via standard parsing or OCR if the file contains images.
- Vectorization: Text chunks are converted into embeddings using OpenAI's
text-embedding-3-smallmodel. - Storage: Embeddings and metadata are stored in a
pgvectorenabled PostgreSQL table. - Retrieval: User queries are embedded and compared against the database using cosine similarity.
- Augmentation: The top-k relevant chunks are fed into GPT-4o to generate a grounded response.
| Method | Endpoint | Description |
|---|---|---|
GET |
/ |
Root endpoint |
GET |
/health |
Health check with database status |
GET |
/metrics |
Prometheus metrics |
POST |
/api/v1/files |
Upload a PDF or text document |
POST |
/api/v1/ask |
Query the assistant with a question |
GET |
/api/v1/files |
List all processed documents with file IDs |
GET |
/api/v1/files/{id}/chunks |
Retrieve similar chunks for a document |
DELETE |
/api/v1/files/{id} |
Delete document by ID |
You can test the API endpoints using the provided script below:
# Ensure the server is running, then run the tests
bash ./scripts/api_tests.sh
# Or make the script executable first
chmod +x scripts/api_tests.sh
./scripts/api_tests.shPlease feel free to submit a Pull Request.

