Welcome to the Tübingen Search Engine project repository! This project aims to build a search engine focused on Tübingen, integrating web crawling, information retrieval, and a modern web interface.
Visualization source: Jina AI
The project is organized into the following directories:
- /src: Contains the core Python code for web crawling (
/crawler
) and information retrieval (/retriever
). - /frontend: Houses the React/Next.js application for the web interface.
- /backend: Includes the FastAPI-based backend server code.
The main crawler can be found in the /src/crawler
directory. The crawler is responsible for fetching web pages,
extracting text content, and storing the data.
The information retrieval system is located in the /src/retriever_v2
directory. It includes the indexing and search
components. The ensemble retrieval model combines the BM25, embedding-based, and NLI-based retrieval models. The code
can be found in the /src/retriever_v2/main.py
directory.
Follow these instructions to set up and run the Tübingen Search Engine locally.
- Python 3.x (recommended: Python 3.10+)
- Node.js (recommended: Node.js 20.14+)
- npm (Node Package Manager)
Additionally, you need to have the document text files in the /src/retriever_v2/index/docs
directory and the
frontier dataset in the /src/retriever_v2/index/index.csv
.
Clone the repository to your local machine:
git clone https://github.com/baz2z/mse-group-project.git
cd mse-group-project
From the root directory, install the Python dependencies:
pip install -r requirements.txt
Navigate to the frontend/
directory and install Node.js dependencies:
cd frontend/
npm install
This installs the required npm packages for the React/Next.js frontend.
In the backend/
directory, start the FastAPI server:
uvicorn app:app --reload
This command launches the backend server with auto-reloading enabled for development. The server will be accessible at http://localhost:8000.
Start the Frontend Development Server
In the /frontend
directory, start the React/Next.js development server:
npm run dev
Open your browser and navigate to http://localhost:3000 to view the Tübingen Search Engine.