PDFBot is an intelligent chatbot designed to generate responses enriched by information from a specific PDF document. Using advanced LLM techniques through OpenAI's GPT-4 API and a FAISS vector database, PDFBot effectively mitigates hallucinations and ensures accurate responses.
The system is composed of the following components:
-
Chatbot Interface: A console-based interface using prompt_toolkit to interact with the user.
-
PDF Parser: A component that uses PyMuPDF to extract text from a PDF document.
-
Embedder: Utilizes SentenceTransformer to make embeddings for all of the sentences extracted from the PDF.
-
Vector Database: Uses the FAISS (Facebook AI Similarity Search) library to store and retrieve embeddings.
-
Verifier: Uses the
cosine_similaritymethod from sklearn to validate the chatbot's responses by cross-referencing it with the vector database to ensure accuracy and prevent hallucinations.
-
Parse the PDF: Extracting text from the PDF document using PyMuPDF.
-
Generate Embeddings: Encoding the extracted text into embeddings using SentenceTransformer.
-
Store in Vector Database: Adding the embeddings and corresponding sentences to the FAISS vector database.
-
User Query: Accepting user queries through the console interface.
-
Generate Response: Using the GPT-4 model from the OpenAI API to generate a response to the user's query.
-
Verify Response: Cross-referencing the generated response with the vector database to mitigate hallucinations.
-
Return Verified Response: Presenting the verified response to the user.
-
Python 3.9 or higher
-
Necessary Python packages (see requirements.txt)
- Clone the repository:
git clone https://github.com/yourusername/pdfbot.git
cd pdfbot
- Set up a virtual environment:
python -m venv venv
source venv/bin/activate (On Windows use venv\Scripts\activate)
- Install the dependencies:
pip install -r requirements.txt
- Add your OpenAI API key to your .env file:
OPENAI_API_KEY=your_openai_api_key
Ensure your PDF document is named document.pdf and is located in the same directory as the main.py file. Then run this:
python main.py
-
Start the bot by running the command above.
-
Type your queries into the console.
-
Type 'exit' or 'quit' to end the session.
To make sure that the responses are accurate and that hallucinations are minimized (preferably removed), the following strategy is employed:
-
Embedding-Based Search: When a response is generated, its embedding is computed using the
sentence_transformersmethod from the SentenceTransformer Library. -
Vector Database Search: This embedding is searched against the vector database to find the most relevant sentences from the PDF.
-
Cosine Similarity Calculation: The similarity between the response embedding and the retrieved embeddings is calculated by calculating the cosines of the angles between the vectors.
-
Dynamic Threshold: A dynamic threshold based on the mean and standard deviation of the similarities is used to verify if the response is valid and accurate.
-
Response Verification: If the similarity exceeds the dynamic threshold, the response is considered verified. Otherwise, the response is flagged as potentially inaccurate.
-
Embedding Dimensionality Issues: The embeddings generated from the sentences didn't match the expected dimensions in the FAISS vector database. This was fixed shortly after.
-
Handling Large PDFs: Efficiently parsing and processing large PDF documents was difficult without running into memory issues or performance bottlenecks. This was fixed shortly after.
-
Verification Accuracy: Fine-tuning the verification mechanism took quite a bit of time to ensure it accurately detects hallucinations without being too lenient or too strict.
-
Effective Use of Vector Databases: Leveraging FAISS for fast and efficient similarity searches, which significantly improves the performance of response verification.
-
Dynamic Thresholding: I implemented a dynamic threshold based on statistical measures (mean and standard deviation) of similarities to effectively balance between verification strictness and leniency.