This project implements a private, local semantic search engine designed to manage and search a huge amount of internal sensitive files.
This tool was initially built to solve a critical problem: the difficulty of finding specific information across department servers containing 10,000+ files of various types (PDFs, DOCX, PPTX, etc.) and languages. Traditional keyword search often fails to capture the context and meaning of documents. This semantic search engine provides a context-aware solution, drastically improving the speed and accuracy of internal document retrieval.
The architecture is built on the powerful combination of:
- IBM Docling: For robust document parsing and text extraction from various file types.
- Google's EmbeddingGemma: A high-quality model, hosted on Hugging Face, for generating context-rich semantic embeddings. This model is multilingual, supporting over 100 languages.
- LanceDB: The high-performance, serverless vector database used for storing and querying the generated document embeddings.
Crucially, after the initial model download, this engine runs entirely offline (locally), ensuring complete data privacy and security.
- Offline-First: All embedding and searching processes run locally after the initial model download.
- Private & Local: Designed to ensure data sovereignty by keeping all indexing and searching within your private network.
- Multi-Format Support: Easily index and search documents including PDF, DOCX, PPTX, and plain text files.
- Multilingual Support: Utilizes EmbeddingGemma, which is good on over 100 languages, enabling cross-lingual semantic search.
- Efficient Vector Storage: Leverages LanceDB for rapid, serverless vector search operations.
The models are sourced from Hugging Face. You will need a Hugging Face account token to download the models initially.
-
Clone the repository:
git clone [https://github.com/juliobellano/semantic_search.git](https://github.com/juliobellano/semantic_search.git) cd semantic_search -
Install dependencies: All required Python libraries are listed in
requirements.txt.pip install -r requirements.txt
-
Set up Hugging Face Token: Set your Hugging Face User Access Token as an environment variable. This is only needed for the initial model download.
hf auth login
Once the models are downloaded, you can run the program entirely offline.
The system workflow consists of two primary stages: Indexing Documents and Performing Search Queries.