A comprehensive web crawling platform built with Python Flask, featuring real-time monitoring, intelligent search, and a responsive web interface.
- Multi-threaded Web Crawler with configurable depth and rate limiting
- Real-time Status Monitoring with live updates and progress tracking
- Advanced Search Engine with relevance ranking and pagination
- Responsive Web Interface for crawler management and data exploration
- File-based Storage with organized data structure
- Pause/Resume/Stop functionality for active crawlers
- Resume from Files capability for interrupted crawlers
- Comprehensive Unit Tests (41 tests with verbose logging)
- SSL Certificate Handling for secure HTTPS crawling
- Rate Limiting & Back-pressure management
- Download Capabilities for logs and queue data
- Python 3.7 or higher
- Web browser (Chrome, Firefox, Safari, Edge)
-
Clone the repository:
git clone https://github.com/ereneld/crawler.git cd crawler -
Install dependencies:
pip3 install flask flask-cors
-
Start the API server:
python3 app.py
The server will start on
http://localhost:3600 -
Open the web interface:
Open any of these files in your web browser:
- Crawler Dashboard:
demo/crawler.html - Status Monitoring:
demo/status.html - Search Interface:
demo/search.html
- Crawler Dashboard:
- Open
demo/crawler.htmlin your browser - Fill in the crawler parameters:
- Origin URL: Starting point for crawling (e.g.,
https://www.wikipedia.org/) - Max Depth: How deep to crawl (1-1000)
- Hit Rate: Requests per second (0.1-1000.0)
- Queue Capacity: Maximum URLs in queue (100-100000)
- Max URLs to Visit: Total URLs to crawl (0-10000)
- Origin URL: Starting point for crawling (e.g.,
- Click "π·οΈ Start Crawler"
- Monitor progress in real-time
- Real-time Updates: Status page auto-refreshes every 2 seconds
- Download Data: Get logs and queue files as text downloads
- Control Options: Pause, resume, or stop active crawlers
- Statistics: View total URLs visited, words indexed, and active crawlers
- Wait for crawler to index some content
- Open
demo/search.html - Enter search terms and browse paginated results
- Use "π I'm Feeling Lucky" for random word discovery
# Create a new crawler
POST /crawler/create
{
"origin": "https://example.com",
"max_depth": 3,
"hit_rate": 1.0,
"max_queue_capacity": 10000,
"max_urls_to_visit": 100
}
# Get crawler status
GET /crawler/status/{crawler_id}
# Pause crawler
POST /crawler/pause/{crawler_id}
# Resume crawler
POST /crawler/resume/{crawler_id}
# Stop crawler
POST /crawler/stop/{crawler_id}
# Resume from saved files
POST /crawler/resume-from-files/{crawler_id}
# Get all crawlers
GET /crawler/list
# Get crawler statistics
GET /crawler/stats
# Clear all data
POST /crawler/clear# Search indexed content
GET /search?query=python&pageLimit=10&pageOffset=0
# Get random word
GET /search/random# Test HTML parser (21 tests)
python3 utils/__test__/test_html_parser.py
# Test crawler job (20 tests)
python3 utils/__test__/test_crawler_job.py
# Run all tests
python3 utils/__test__/test_html_parser.py && python3 utils/__test__/test_crawler_job.py# Create a test crawler
curl -X POST http://localhost:3600/crawler/create \
-H "Content-Type: application/json" \
-d '{
"origin": "https://www.wikipedia.org/",
"max_depth": 2,
"max_urls_to_visit": 50
}'
# Check status (replace {id} with actual crawler ID)
curl http://localhost:3600/crawler/status/{id}
# Search for content
curl "http://localhost:3600/search?query=wikipedia&pageLimit=5"crawler/
βββ app.py # π Main Flask API server
βββ services/ # ποΈ Business logic layer
β βββ crawler_service.py # Crawler management
β βββ search_service.py # Search functionality
βββ utils/ # π οΈ Core utilities
β βββ crawler_job.py # Multi-threaded crawler
β βββ html_parser.py # HTML parsing
β βββ __test__/ # Unit tests (41 tests)
βββ demo/ # π¨ Web interface
β βββ crawler.html # Main dashboard
β βββ status.html # Status monitoring
β βββ search.html # Search interface
β βββ css/style.css # Styling
β βββ js/ # Frontend JavaScript
βββ data/ # πΎ Storage (auto-created)
β βββ visited_urls.data # Global visited URLs
β βββ crawlers/ # Crawler status files
β βββ storage/ # Word index files
βββ README.md # π This file
- Port: 3600
- Hit Rate: 1.0 requests/second
- Max Depth: 1-1000
- Queue Capacity: 100-100000
- Max URLs to Visit: 0-10000
# Optional: Set custom port
export FLASK_PORT=3600
# Optional: Enable debug mode
export FLASK_ENV=development- Multi-threaded Design: Each crawler runs in its own thread
- Queue Management: URLs are queued with depth tracking
- Rate Limiting: Configurable requests per second
- Back-pressure: Queue capacity limits prevent memory issues
- File Storage: Status, logs, and queue stored separately
- Word Indexing: Content is tokenized and stored by first letter
- Relevance Scoring: Combines frequency, depth, and match quality
- Optimized Lookup: Key-based search with progressive suffix matching
- Pagination: Results are paginated for better performance
- Visited URLs:
{url} {crawler_id} {timestamp} - Word Index:
{word} {relevant_url} {origin_url} {depth} {frequency} - Crawler Status: JSON with metadata and timestamps
- Logs: Timestamped log entries
- Queue:
{url} {depth}space-separated
- Backend: Extend services in
services/directory - API: Add endpoints in
app.py - Frontend: Modify HTML/CSS/JS in
demo/directory - Tests: Add tests in
utils/__test__/
- Python: Follow PEP 8 guidelines
- JavaScript: Use modern ES6+ features
- HTML/CSS: Responsive, accessible design
- Documentation: Comprehensive docstrings and comments
Port already in use:
lsof -i :3600
kill -9 {PID}Permission errors:
chmod -R 755 data/SSL certificate issues:
- The crawler handles SSL issues automatically with fallback
Memory usage:
- Adjust
max_queue_capacityandmax_urls_to_visitfor large crawls
# Enable verbose logging
FLASK_ENV=development python3 app.py
# Monitor data files
watch -n 2 'ls -la data/crawlers/ && echo "=== Storage ===" && ls -la data/storage/'- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature - Make your changes and add tests
- Commit:
git commit -m 'Add amazing feature' - Push:
git push origin feature/amazing-feature - Open a Pull Request
- Write unit tests for new functionality
- Ensure all existing tests pass
- Test both success and error cases
- Use mocking for external dependencies
This project is licensed under the MIT License - see the LICENSE file for details.
- Built for technical assessment
- Uses native Python libraries for core functionality
- Responsive design inspired by modern web standards
- Test coverage ensures reliability and maintainability
For questions or issues:
- Check the Issues page
- Create a new issue with detailed description
- Include system info, error messages, and steps to reproduce
Happy Crawling! π·οΈ