Skip to content

juliobellano/local_semantic_search

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🔍 Private Semantic Search Engine (Offline-First)

Python Technologies Hugging Face


📝 Overview

This project implements a private, local semantic search engine designed to manage and search a huge amount of internal sensitive files.

💡 Motivation

This tool was initially built to solve a critical problem: the difficulty of finding specific information across department servers containing 10,000+ files of various types (PDFs, DOCX, PPTX, etc.) and languages. Traditional keyword search often fails to capture the context and meaning of documents. This semantic search engine provides a context-aware solution, drastically improving the speed and accuracy of internal document retrieval.

Core Technology Stack

The architecture is built on the powerful combination of:

  • IBM Docling: For robust document parsing and text extraction from various file types.
  • Google's EmbeddingGemma: A high-quality model, hosted on Hugging Face, for generating context-rich semantic embeddings. This model is multilingual, supporting over 100 languages.
  • LanceDB: The high-performance, serverless vector database used for storing and querying the generated document embeddings.

Crucially, after the initial model download, this engine runs entirely offline (locally), ensuring complete data privacy and security.


✨ Key Features

  • Offline-First: All embedding and searching processes run locally after the initial model download.
  • Private & Local: Designed to ensure data sovereignty by keeping all indexing and searching within your private network.
  • Multi-Format Support: Easily index and search documents including PDF, DOCX, PPTX, and plain text files.
  • Multilingual Support: Utilizes EmbeddingGemma, which is good on over 100 languages, enabling cross-lingual semantic search.
  • Efficient Vector Storage: Leverages LanceDB for rapid, serverless vector search operations.

⚙️ Installation

Prerequisites

The models are sourced from Hugging Face. You will need a Hugging Face account token to download the models initially.

  1. Clone the repository:

    git clone [https://github.com/juliobellano/semantic_search.git](https://github.com/juliobellano/semantic_search.git)
    cd semantic_search
  2. Install dependencies: All required Python libraries are listed in requirements.txt.

    pip install -r requirements.txt
  3. Set up Hugging Face Token: Set your Hugging Face User Access Token as an environment variable. This is only needed for the initial model download.

    hf auth login

    Once the models are downloaded, you can run the program entirely offline.


🚀 Usage

The system workflow consists of two primary stages: Indexing Documents and Performing Search Queries.

About

IBM's Docling + Google's EmbeddingGemma + LanceDB. A private, local semantic search engine for departmental files, supports various format (PDF, DOCX, PPTX, etc.) and 100 languages.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages