Search Engine API

A high-performance web search engine built in Go, featuring intelligent web crawling, TF-IDF based relevance ranking, and a RESTful API. The engine respects robots.txt policies and supports distributed caching with Redis.

Features

Intelligent Web Crawler: Automatically crawls websites while respecting robots.txt policies and crawl delays
TF-IDF Ranking: Uses Term Frequency-Inverse Document Frequency algorithm for search result relevance
Stemming Support: Implements Snowball stemming algorithm for better search matching
Distributed Caching: Redis integration for high-performance search operations
Multiple Database Support: Compatible with MySQL, SQLite, and SQL Server via GORM
RESTful API: Clean HTTP endpoints for searching and crawling operations
CORS Support: Configurable cross-origin resource sharing for development
Concurrent Processing: Efficient parallel crawling and indexing

Architecture

The search engine consists of several key components:

Crawler: Fetches and processes web pages concurrently
Indexer: Builds and maintains the search index with word frequencies
TF-IDF Calculator: Computes relevance scores for search results
Database Layer: Persistent storage for indexed documents
Redis Cache: Fast access to frequently searched terms
API Server: Gin-based HTTP server handling search and crawl requests

Prerequisites

Go 1.23.0 or higher
Redis server
Database server (MySQL, SQLite, or SQL Server)
Docker (optional, for containerized deployment)

Installation

Clone the repository:

git clone <repository-url>
cd search

Install dependencies:

go mod download

Set up environment variables (see Configuration section)

Configuration

Create a .env file in the project root: cp .example.env .env

MYSQL_USERNAME=<mysql-username>
MYSQL_PASSWORD=<mysql-password>
MYSQL_HOST=<mysql-host:port>
MYSQL_DATABASE=<database-name>
REDIS_HOST=<redis-host:port>
REDIS_PASSWORD=<redis-password>
ENV=development

Environment Variables

Variable	Description	Required	Default
`MYSQL_USERNAME`	MySQL username	Yes	-
`MYSQL_PASSWORD`	MySQL password	Yes	-
`MYSQL_HOST`	MySQL host and port	No	localhost:3306
`MYSQL_DATABASE`	MySQL database name	Yes	-
`REDIS_HOST`	Redis server host and port	Yes	-
`REDIS_PASSWORD`	Redis authentication password	Yes	-
`ENV`	Environment mode (development/production)	No	-

Note: The application automatically constructs the MySQL DSN from the provided credentials in the format: username:password@tcp(host:port)/database?charset=utf8mb4&parseTime=True&loc=Local

Running the Application

Local Development

ENV=development go run .

The server will start on http://localhost:8080

Docker Deployment

Option 1: Using Docker Compose (Recommended)

docker compose up

Option 2: Manual Docker Setup

Build the Docker image:

docker build -t search-engine .

Run the container:

docker run -p 8080:8080 \
  -e MYSQL_USERNAME="your-mysql-username" \
  -e MYSQL_PASSWORD="your-mysql-password" \
  -e MYSQL_HOST="mysql:3306" \
  -e MYSQL_DATABASE="your-database-name" \
  -e REDIS_HOST="redis:6379" \
  -e REDIS_PASSWORD="your-redis-password" \
  search-engine

API Endpoints

Search Documents

Search the indexed documents for a given term.

Endpoint: POST /search

Request Body:

{
  "SearchTerm": "golang programming"
}

Response (200 OK):

{
  "HITS": [
    {
      "URL": "https://example.com/golang-guide",
      "TITLE": "Complete Guide to Golang",
      "DESCRIPTION": "Learn Go programming from basics to advanced",
      "TFIDF": 0.89
    }
  ],
  "TERM": "golang programming"
}

Error Response (404):

{
  "error": "No results found"
}

Start Crawling

Initiate a web crawl for a specified host.

Endpoint: POST /crawl

Request Body:

{
  "Host": "https://example.com"
}

Response (200 OK):

{
  "success": "true",
  "host": "https://example.com"
}

Note: Crawling runs asynchronously in the background. Results will be indexed and available for search as pages are processed.

Serve Documents

Serves locally stored documents from the corpus (primarily for testing).

Endpoint: GET /documents/top10/*any

Example: GET /documents/top10/index.html

How It Works

Crawling Process

Initialization: The crawler fetches and parses robots.txt to determine crawl policies
Queue Management: URLs are queued and deduplicated using a hash set
Concurrent Downloads: Multiple pages are downloaded in parallel with respect to crawl delays
Content Extraction: HTML is parsed to extract text content, links, titles, and descriptions
Stemming: Words are stemmed using the Snowball algorithm for better matching
Indexing: Word frequencies and metadata are stored in the database and Redis cache

Search Algorithm

Query Processing: Search terms are stemmed to match the indexed format
Frequency Lookup: Term frequencies are retrieved from the index
TF-IDF Calculation:
- TF (Term Frequency): termCount / totalWords
- IDF (Inverse Document Frequency): log10(numDocs / (docsContainingWord + 1))
- TF-IDF: TF * IDF
Ranking: Results are sorted by TF-IDF score in descending order
Response: Top results are returned with URLs, titles, descriptions, and relevance scores

Development

Running Tests

go test -v ./...

Code Structure

.
├── main.go           # Application entry point and routing
├── server.go         # HTTP handlers and template execution
├── crawl.go          # Web crawling logic
├── index.go          # Search index interface
├── db_index.go       # Database-backed index implementation
├── memory_index.go   # In-memory index implementation
├── tfidf.go          # TF-IDF calculation and search result ranking
├── extract.go        # HTML content extraction
├── download.go       # HTTP downloading with retry logic
├── clean.go          # URL normalization and validation
├── delay.go          # Crawl delay management
├── redis.go          # Redis client and caching
├── db.go             # Database models and operations
├── structs.go        # Data structures and types
└── stop_words.go     # Stop words filtering

CORS Configuration

In development mode, CORS is enabled for:

http://localhost:3000
http://127.0.0.1:3000

For production, configure CORS according to your frontend domain requirements.

Performance Considerations

Concurrent Crawling: Respects crawl delays while maximizing throughput
Redis Caching: Frequently accessed search terms are cached for fast retrieval
Database Indexing: Proper indexes on URL and word columns for efficient lookups
Stemming: Reduces index size and improves search recall
Batch Processing: URLs are processed in batches to optimize database operations

Limitations

Crawling is limited to HTML content (does not process PDFs, images, etc.)
Only crawls pages within the same host domain
Respects robots.txt but does not handle JavaScript-heavy sites
Search is currently single-term (no phrase matching or boolean operators)

Name		Name	Last commit message	Last commit date
Latest commit History 141 Commits
.github/workflows		.github/workflows
documents		documents
static		static
.example.env		.example.env
.gitignore		.gitignore
.releaserc		.releaserc
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
api.go		api.go
clean.go		clean.go
clean_test.go		clean_test.go
crawl.go		crawl.go
crawl_test.go		crawl_test.go
db.go		db.go
db_index.go		db_index.go
delay.go		delay.go
disallow_test.go		disallow_test.go
docker-compose.yml		docker-compose.yml
download.go		download.go
download_test.go		download_test.go
extract.go		extract.go
extract_test.go		extract_test.go
file.go		file.go
go.mod		go.mod
go.sum		go.sum
index.go		index.go
main.go		main.go
memory_index.go		memory_index.go
myfile_test.go		myfile_test.go
redis.go		redis.go
search_test.go		search_test.go
server.go		server.go
stop_words.go		stop_words.go
structs.go		structs.go
tfidf.go		tfidf.go
tfidf_test.go		tfidf_test.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Search Engine API

Features

Architecture

Prerequisites

Installation

Configuration

Environment Variables

Running the Application

Local Development

Docker Deployment

Option 1: Using Docker Compose (Recommended)

Option 2: Manual Docker Setup

API Endpoints

Search Documents

Start Crawling

Serve Documents

How It Works

Crawling Process

Search Algorithm

Development

Running Tests

Code Structure

CORS Configuration

Performance Considerations

Limitations

About

Uh oh!

Releases 12

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Search Engine API

Features

Architecture

Prerequisites

Installation

Configuration

Environment Variables

Running the Application

Local Development

Docker Deployment

Option 1: Using Docker Compose (Recommended)

Option 2: Manual Docker Setup

API Endpoints

Search Documents

Start Crawling

Serve Documents

How It Works

Crawling Process

Search Algorithm

Development

Running Tests

Code Structure

CORS Configuration

Performance Considerations

Limitations

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 12

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages