Skip to content

adhyayan-ai/PDFbot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDFBot

PDFBot is an intelligent chatbot designed to generate responses enriched by information from a specific PDF document. Using advanced LLM techniques through OpenAI's GPT-4 API and a FAISS vector database, PDFBot effectively mitigates hallucinations and ensures accurate responses.

System Architecture

The system is composed of the following components:

  1. Chatbot Interface: A console-based interface using prompt_toolkit to interact with the user.

  2. PDF Parser: A component that uses PyMuPDF to extract text from a PDF document.

  3. Embedder: Utilizes SentenceTransformer to make embeddings for all of the sentences extracted from the PDF.

  4. Vector Database: Uses the FAISS (Facebook AI Similarity Search) library to store and retrieve embeddings.

  5. Verifier: Uses the cosine_similarity method from sklearn to validate the chatbot's responses by cross-referencing it with the vector database to ensure accuracy and prevent hallucinations.

System Workflow

  1. Parse the PDF: Extracting text from the PDF document using PyMuPDF.

  2. Generate Embeddings: Encoding the extracted text into embeddings using SentenceTransformer.

  3. Store in Vector Database: Adding the embeddings and corresponding sentences to the FAISS vector database.

  4. User Query: Accepting user queries through the console interface.

  5. Generate Response: Using the GPT-4 model from the OpenAI API to generate a response to the user's query.

  6. Verify Response: Cross-referencing the generated response with the vector database to mitigate hallucinations.

  7. Return Verified Response: Presenting the verified response to the user.

Running the Chatbot

Prerequisites

  • Python 3.9 or higher

  • Necessary Python packages (see requirements.txt)

Installation

  1. Clone the repository:

git clone https://github.com/yourusername/pdfbot.git

cd pdfbot

  1. Set up a virtual environment:

python -m venv venv

source venv/bin/activate (On Windows use venv\Scripts\activate)

  1. Install the dependencies:

pip install -r requirements.txt

  1. Add your OpenAI API key to your .env file:

OPENAI_API_KEY=your_openai_api_key

Running the Bot

Ensure your PDF document is named document.pdf and is located in the same directory as the main.py file. Then run this:

python main.py

Using the Bot

  • Start the bot by running the command above.

  • Type your queries into the console.

  • Type 'exit' or 'quit' to end the session.

Hallucination Mitigation Strategy

To make sure that the responses are accurate and that hallucinations are minimized (preferably removed), the following strategy is employed:

  1. Embedding-Based Search: When a response is generated, its embedding is computed using the sentence_transformers method from the SentenceTransformer Library.

  2. Vector Database Search: This embedding is searched against the vector database to find the most relevant sentences from the PDF.

  3. Cosine Similarity Calculation: The similarity between the response embedding and the retrieved embeddings is calculated by calculating the cosines of the angles between the vectors.

  4. Dynamic Threshold: A dynamic threshold based on the mean and standard deviation of the similarities is used to verify if the response is valid and accurate.

  5. Response Verification: If the similarity exceeds the dynamic threshold, the response is considered verified. Otherwise, the response is flagged as potentially inaccurate.

Challenges Encountered

  1. Embedding Dimensionality Issues: The embeddings generated from the sentences didn't match the expected dimensions in the FAISS vector database. This was fixed shortly after.

  2. Handling Large PDFs: Efficiently parsing and processing large PDF documents was difficult without running into memory issues or performance bottlenecks. This was fixed shortly after.

  3. Verification Accuracy: Fine-tuning the verification mechanism took quite a bit of time to ensure it accurately detects hallucinations without being too lenient or too strict.

Other Key Learnings

  1. Effective Use of Vector Databases: Leveraging FAISS for fast and efficient similarity searches, which significantly improves the performance of response verification.

  2. Dynamic Thresholding: I implemented a dynamic threshold based on statistical measures (mean and standard deviation) of similarities to effectively balance between verification strictness and leniency.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages