A high-performance web search engine built in Go, featuring intelligent web crawling, TF-IDF based relevance ranking, and a RESTful API. The engine respects robots.txt policies and supports distributed caching with Redis.
- Intelligent Web Crawler: Automatically crawls websites while respecting robots.txt policies and crawl delays
- TF-IDF Ranking: Uses Term Frequency-Inverse Document Frequency algorithm for search result relevance
- Stemming Support: Implements Snowball stemming algorithm for better search matching
- Distributed Caching: Redis integration for high-performance search operations
- Multiple Database Support: Compatible with MySQL, SQLite, and SQL Server via GORM
- RESTful API: Clean HTTP endpoints for searching and crawling operations
- CORS Support: Configurable cross-origin resource sharing for development
- Concurrent Processing: Efficient parallel crawling and indexing
The search engine consists of several key components:
- Crawler: Fetches and processes web pages concurrently
- Indexer: Builds and maintains the search index with word frequencies
- TF-IDF Calculator: Computes relevance scores for search results
- Database Layer: Persistent storage for indexed documents
- Redis Cache: Fast access to frequently searched terms
- API Server: Gin-based HTTP server handling search and crawl requests
- Go 1.23.0 or higher
- Redis server
- Database server (MySQL, SQLite, or SQL Server)
- Docker (optional, for containerized deployment)
- Clone the repository:
git clone <repository-url>
cd search- Install dependencies:
go mod download- Set up environment variables (see Configuration section)
Create a .env file in the project root:
cp .example.env .env
MYSQL_USERNAME=<mysql-username>
MYSQL_PASSWORD=<mysql-password>
MYSQL_HOST=<mysql-host:port>
MYSQL_DATABASE=<database-name>
REDIS_HOST=<redis-host:port>
REDIS_PASSWORD=<redis-password>
ENV=development| Variable | Description | Required | Default |
|---|---|---|---|
MYSQL_USERNAME |
MySQL username | Yes | - |
MYSQL_PASSWORD |
MySQL password | Yes | - |
MYSQL_HOST |
MySQL host and port | No | localhost:3306 |
MYSQL_DATABASE |
MySQL database name | Yes | - |
REDIS_HOST |
Redis server host and port | Yes | - |
REDIS_PASSWORD |
Redis authentication password | Yes | - |
ENV |
Environment mode (development/production) | No | - |
Note: The application automatically constructs the MySQL DSN from the provided credentials in the format: username:password@tcp(host:port)/database?charset=utf8mb4&parseTime=True&loc=Local
ENV=development go run .The server will start on http://localhost:8080
docker compose up- Build the Docker image:
docker build -t search-engine .- Run the container:
docker run -p 8080:8080 \
-e MYSQL_USERNAME="your-mysql-username" \
-e MYSQL_PASSWORD="your-mysql-password" \
-e MYSQL_HOST="mysql:3306" \
-e MYSQL_DATABASE="your-database-name" \
-e REDIS_HOST="redis:6379" \
-e REDIS_PASSWORD="your-redis-password" \
search-engineSearch the indexed documents for a given term.
Endpoint: POST /search
Request Body:
{
"SearchTerm": "golang programming"
}Response (200 OK):
{
"HITS": [
{
"URL": "https://example.com/golang-guide",
"TITLE": "Complete Guide to Golang",
"DESCRIPTION": "Learn Go programming from basics to advanced",
"TFIDF": 0.89
}
],
"TERM": "golang programming"
}Error Response (404):
{
"error": "No results found"
}Initiate a web crawl for a specified host.
Endpoint: POST /crawl
Request Body:
{
"Host": "https://example.com"
}Response (200 OK):
{
"success": "true",
"host": "https://example.com"
}Note: Crawling runs asynchronously in the background. Results will be indexed and available for search as pages are processed.
Serves locally stored documents from the corpus (primarily for testing).
Endpoint: GET /documents/top10/*any
Example: GET /documents/top10/index.html
- Initialization: The crawler fetches and parses
robots.txtto determine crawl policies - Queue Management: URLs are queued and deduplicated using a hash set
- Concurrent Downloads: Multiple pages are downloaded in parallel with respect to crawl delays
- Content Extraction: HTML is parsed to extract text content, links, titles, and descriptions
- Stemming: Words are stemmed using the Snowball algorithm for better matching
- Indexing: Word frequencies and metadata are stored in the database and Redis cache
- Query Processing: Search terms are stemmed to match the indexed format
- Frequency Lookup: Term frequencies are retrieved from the index
- TF-IDF Calculation:
- TF (Term Frequency):
termCount / totalWords - IDF (Inverse Document Frequency):
log10(numDocs / (docsContainingWord + 1)) - TF-IDF:
TF * IDF
- TF (Term Frequency):
- Ranking: Results are sorted by TF-IDF score in descending order
- Response: Top results are returned with URLs, titles, descriptions, and relevance scores
go test -v ./....
├── main.go # Application entry point and routing
├── server.go # HTTP handlers and template execution
├── crawl.go # Web crawling logic
├── index.go # Search index interface
├── db_index.go # Database-backed index implementation
├── memory_index.go # In-memory index implementation
├── tfidf.go # TF-IDF calculation and search result ranking
├── extract.go # HTML content extraction
├── download.go # HTTP downloading with retry logic
├── clean.go # URL normalization and validation
├── delay.go # Crawl delay management
├── redis.go # Redis client and caching
├── db.go # Database models and operations
├── structs.go # Data structures and types
└── stop_words.go # Stop words filtering
In development mode, CORS is enabled for:
http://localhost:3000http://127.0.0.1:3000
For production, configure CORS according to your frontend domain requirements.
- Concurrent Crawling: Respects crawl delays while maximizing throughput
- Redis Caching: Frequently accessed search terms are cached for fast retrieval
- Database Indexing: Proper indexes on URL and word columns for efficient lookups
- Stemming: Reduces index size and improves search recall
- Batch Processing: URLs are processed in batches to optimize database operations
- Crawling is limited to HTML content (does not process PDFs, images, etc.)
- Only crawls pages within the same host domain
- Respects robots.txt but does not handle JavaScript-heavy sites
- Search is currently single-term (no phrase matching or boolean operators)