PDFBot

PDFBot is an intelligent chatbot designed to generate responses enriched by information from a specific PDF document. Using advanced LLM techniques through OpenAI's GPT-4 API and a FAISS vector database, PDFBot effectively mitigates hallucinations and ensures accurate responses.

System Architecture

The system is composed of the following components:

Chatbot Interface: A console-based interface using prompt_toolkit to interact with the user.
PDF Parser: A component that uses PyMuPDF to extract text from a PDF document.
Embedder: Utilizes SentenceTransformer to make embeddings for all of the sentences extracted from the PDF.
Vector Database: Uses the FAISS (Facebook AI Similarity Search) library to store and retrieve embeddings.
Verifier: Uses the cosine_similarity method from sklearn to validate the chatbot's responses by cross-referencing it with the vector database to ensure accuracy and prevent hallucinations.

System Workflow

Parse the PDF: Extracting text from the PDF document using PyMuPDF.
Generate Embeddings: Encoding the extracted text into embeddings using SentenceTransformer.
Store in Vector Database: Adding the embeddings and corresponding sentences to the FAISS vector database.
User Query: Accepting user queries through the console interface.
Generate Response: Using the GPT-4 model from the OpenAI API to generate a response to the user's query.
Verify Response: Cross-referencing the generated response with the vector database to mitigate hallucinations.
Return Verified Response: Presenting the verified response to the user.

Running the Chatbot

Prerequisites

Python 3.9 or higher
Necessary Python packages (see requirements.txt)

Installation

Clone the repository:

git clone https://github.com/yourusername/pdfbot.git

cd pdfbot

Set up a virtual environment:

python -m venv venv

source venv/bin/activate (On Windows use venv\Scripts\activate)

Install the dependencies:

pip install -r requirements.txt

Add your OpenAI API key to your .env file:

OPENAI_API_KEY=your_openai_api_key

Running the Bot

Ensure your PDF document is named document.pdf and is located in the same directory as the main.py file. Then run this:

python main.py

Using the Bot

Start the bot by running the command above.
Type your queries into the console.
Type 'exit' or 'quit' to end the session.

Hallucination Mitigation Strategy

To make sure that the responses are accurate and that hallucinations are minimized (preferably removed), the following strategy is employed:

Embedding-Based Search: When a response is generated, its embedding is computed using the sentence_transformers method from the SentenceTransformer Library.
Vector Database Search: This embedding is searched against the vector database to find the most relevant sentences from the PDF.
Cosine Similarity Calculation: The similarity between the response embedding and the retrieved embeddings is calculated by calculating the cosines of the angles between the vectors.
Dynamic Threshold: A dynamic threshold based on the mean and standard deviation of the similarities is used to verify if the response is valid and accurate.
Response Verification: If the similarity exceeds the dynamic threshold, the response is considered verified. Otherwise, the response is flagged as potentially inaccurate.

Challenges Encountered

Embedding Dimensionality Issues: The embeddings generated from the sentences didn't match the expected dimensions in the FAISS vector database. This was fixed shortly after.
Handling Large PDFs: Efficiently parsing and processing large PDF documents was difficult without running into memory issues or performance bottlenecks. This was fixed shortly after.
Verification Accuracy: Fine-tuning the verification mechanism took quite a bit of time to ensure it accurately detects hallucinations without being too lenient or too strict.

Other Key Learnings

Effective Use of Vector Databases: Leveraging FAISS for fast and efficient similarity searches, which significantly improves the performance of response verification.
Dynamic Thresholding: I implemented a dynamic threshold based on statistical measures (mean and standard deviation) of similarities to effectively balance between verification strictness and leniency.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.gitignore		.gitignore
README.md		README.md
chatbot.py		chatbot.py
embedder.py		embedder.py
main.py		main.py
pdf_parser.py		pdf_parser.py
requirements.txt		requirements.txt
vector_db.py		vector_db.py
verifier.py		verifier.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDFBot

System Architecture

System Workflow

Running the Chatbot

Prerequisites

Installation

Running the Bot

Using the Bot

Hallucination Mitigation Strategy

Challenges Encountered

Other Key Learnings

About

Uh oh!

Releases

Packages

Uh oh!

Languages

adhyayan-ai/PDFbot

Folders and files

Latest commit

History

Repository files navigation

PDFBot

System Architecture

System Workflow

Running the Chatbot

Prerequisites

Installation

Running the Bot

Using the Bot

Hallucination Mitigation Strategy

Challenges Encountered

Other Key Learnings

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages