MetaCrawler is an enterprise-grade, polyglot microservice architecture designed for high-performance data extraction and local intelligence. The platform dynamically selects execution environments based on scraping requirements, ranging from high-concurrency static extraction to complex, browser-based automation.
The platform utilizes a decoupled microservice strategy coordinated through a central API gateway:
- Frontend (Next.js): A management dashboard for job orchestration, system telemetry, and data visualization.
- API Gateway (Node.js/Apollo): A unified GraphQL interface providing a secure entry point for all platform operations.
- Intelligence Layer (Python/FastAPI): Manages Natural Language Processing (NLP) enrichment, including sentiment analysis and named entity recognition, alongside site-aware RAG models.
- Extraction Layer (Go/Colly): High-efficiency engine optimized for rapid static page scraping and concurrent crawling.
- Automation Layer (Node.js/Playwright): Dedicated service for dynamic content rendering and complex JavaScript-heavy interaction.
- Infrastructure: Persistent storage is managed via MongoDB, with Redis serving as the high-throughput message broker for asynchronous task distribution.
The system automatically routes extraction tasks to the optimal engine:
- Static Extraction: Leverages Go for maximum throughput and low resource overhead.
- Dynamic Automation: Utilizes Node.js and Playwright for authenticated sessions and client-side rendering.
MetaCrawler enables the construction of local retrieval models from specific web domains. This allows for targeted, secure querying against a crawled corpus without reliance on external third-party LLM providers.
Extracted data is passed through an enrichment pipeline that transforms raw HTML into structured intelligence by identifying key entities and evaluating content sentiment.
Recent architectural improvements have solidified the platform for production:
- Canonical Job Lifecycle: Jobs strictly transition through states:
queued→running→done|failed. - Idempotent Webhook Orchestration: The API Gateway creates jobs instantly and assigns a UUID. For long-running processes (like multi-page crawls or ML models), background Python Celery workers dispatch status updates directly back to the Node.js API Gateway via a
POST /webhook/celerycallback route. - Deterministic Caching: A Redis/Mongo-backed caching layer intercepts duplicate requests automatically using a composite key (URL + Job Type + Execution Params).
- Observability: All microservices (Gateway, Node, Python, Go) output uniform, flat JSON-lines to standard output. When debugging locally via Docker Compose, simply run
docker compose logs -fand grep for a specificjobIdto trace the full lifecycle across language boundaries.
- Docker and Docker Compose
- Node.js (Local development)
-
Initialize the environment configuration:
cp .env.example .env
-
Modify the
.envfile with production-specific variables and service URLs.
The entire stack can be initialized via Docker Compose:
docker compose up --build- Dashboard:
http://localhost:3001 - GraphQL Interface:
http://localhost:4000/graphql
Test suites are maintained independently for each microservice:
- Python:
cd backend-python && pytest tests/ - Go:
cd backend-go && go test ./... -v - Node & Gateway:
cd api-gateway && npm test
Current development is prioritized across the following phases:
- Foundation: Python data layer and multi-database integration.
- Performance: Go-based high-concurrency crawling engine.
- Automation: Playwright integration for dynamic content handling.
- Integration: Unified GraphQL Gateway orchestration.
- Interface: Next.js dashboard and analytical visualization.
- Optimization: Production hardening, volume persistence, and API documentation.
Maintained by @lobo017