Skip to content

machinelearningZH/ogd_ai-search

Repository files navigation

🦄 OGD AI Search

Semantic, lexical, and multilingual search for your OGD metadata catalog.

GitHub License PyPI - Python GitHub Stars GitHub Issues GitHub Issues Current Version linting - Ruff

Contents

Usage

# Clone the repository
git clone https://github.com/statistikZH/ogd_ai-search.git
cd ogd_ai-search

# Install dependencies
pip3 install uv
uv venv
source .venv/bin/activate
uv sync

# Create search index
# Run 01_mdv_search.ipynb to create the Weaviate search index

# Start the app
cd _streamlit
streamlit run ai-search.py

Overview

Search the Canton of Zurich's open government data catalog using hybrid search that combines lexical keyword matching with semantic similarity. The application supports multiple languages, including German and all European languages.

The search uses intfloat/multilingual-e5-small for embeddings via sentence-transformers—a multilingual model optimized for German with a 512-token context length. Search results are powered by Weaviate, an open-source vector database.

What is semantic search?

Semantic search finds text based on meaning rather than exact keywords. For example, searching for disease can return documents containing illness, virus, infection, treatment, or healthcare without the exact word disease appearing.

Using statistical methods and Machine Learning, language models learn word and sentence similarities from large text corpora. While semantic search has many advantages, it is approximate rather than exact and may include false positives or miss relevant entries.

Hybrid search combines lexical and semantic approaches, delivering both exact keyword matches and semantically similar results.

Project Team

Laure Stadler, Chantal Amrhein, Patrick ArneckeStatistisches Amt Zürich: Team Data

Many thanks to Corinna Grobe and our former colleague Adrian Rupp.

Feedback and Contributing

We'd love to hear from you. Share your feedback or ideas by emailing us, opening an issue, or submitting a pull request.

We use Ruff for linting and code formatting with default settings.

Disclaimer

This software (the Software) incorporates models (Models) from Hugging Face and others and has been developed according to and with the intent to be used under Swiss law. Please be aware that the EU Artificial Intelligence Act (EU AI Act) may, under certain circumstances, be applicable to your use of the Software. You are solely responsible for ensuring that your use of the Software as well as of the underlying Models complies with all applicable local, national and international laws and regulations. By using this Software, you acknowledge and agree (a) that it is your responsibility to assess which laws and regulations, in particular regarding the use of AI technologies, are applicable to your intended use and to comply therewith, and (b) that you will hold us harmless from any action, claims, liability or loss in respect of your use of the Software.

Releases

No releases published

Packages

No packages published