🔍 Private Semantic Search Engine (Offline-First)

📝 Overview

This project implements a private, local semantic search engine designed to manage and search a huge amount of internal sensitive files.

💡 Motivation

This tool was initially built to solve a critical problem: the difficulty of finding specific information across department servers containing 10,000+ files of various types (PDFs, DOCX, PPTX, etc.) and languages. Traditional keyword search often fails to capture the context and meaning of documents. This semantic search engine provides a context-aware solution, drastically improving the speed and accuracy of internal document retrieval.

Core Technology Stack

The architecture is built on the powerful combination of:

IBM Docling: For robust document parsing and text extraction from various file types.
- Documentation Link
Google's EmbeddingGemma: A high-quality model, hosted on Hugging Face, for generating context-rich semantic embeddings. This model is multilingual, supporting over 100 languages.
- Documentation Link
LanceDB: The high-performance, serverless vector database used for storing and querying the generated document embeddings.

Crucially, after the initial model download, this engine runs entirely offline (locally), ensuring complete data privacy and security.

✨ Key Features

Offline-First: All embedding and searching processes run locally after the initial model download.
Private & Local: Designed to ensure data sovereignty by keeping all indexing and searching within your private network.
Multi-Format Support: Easily index and search documents including PDF, DOCX, PPTX, and plain text files.
Multilingual Support: Utilizes EmbeddingGemma, which is good on over 100 languages, enabling cross-lingual semantic search.
Efficient Vector Storage: Leverages LanceDB for rapid, serverless vector search operations.

⚙️ Installation

Prerequisites

The models are sourced from Hugging Face. You will need a Hugging Face account token to download the models initially.

Clone the repository:

git clone [https://github.com/juliobellano/semantic_search.git](https://github.com/juliobellano/semantic_search.git)
cd semantic_search

Install dependencies: All required Python libraries are listed in requirements.txt.
```
pip install -r requirements.txt
```
Set up Hugging Face Token: Set your Hugging Face User Access Token as an environment variable. This is only needed for the initial model download.
```
hf auth login
```
Once the models are downloaded, you can run the program entirely offline.

🚀 Usage

The system workflow consists of two primary stages: Indexing Documents and Performing Search Queries.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
README.md		README.md
chunking.py		chunking.py
dacling.py		dacling.py
requirements.txt		requirements.txt
search.py		search.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔍 Private Semantic Search Engine (Offline-First)

📝 Overview

💡 Motivation

Core Technology Stack

✨ Key Features

⚙️ Installation

Prerequisites

🚀 Usage

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🔍 Private Semantic Search Engine (Offline-First)

📝 Overview

💡 Motivation

Core Technology Stack

✨ Key Features

⚙️ Installation

Prerequisites

🚀 Usage

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages