This roadmap is designed to guide you through building MetaCrawler from scratch. It is broken down into Phases and Sessions. Each session is designed to be a manageable chunk of work (approx. 2-4 hours) with clear goals and deliverables.
Goal: Establish the data layer and build the "Brain" of the operation.
- Goals:
- Get Docker Compose running with Mongo, Postgres, and Redis.
- Connect to databases using a GUI (Compass/PgAdmin) to verify.
- Create the basic Python virtual environment.
- Deliverable:
docker-compose upruns without errors; Python service can connect to Redis.
- Goals:
- Implement
scrape_urlinbackend-python/app/scrapers/basic_scraper.py. - Use
requeststo fetch HTML andBeautifulSoupto parse title/text. - Create a simple API endpoint in
main.pyto trigger this function.
- Implement
- Deliverable:
POST /scrape/quickreturns the title and text of a given URL.
- Goals:
- Install
spacyornltk. - Implement
analyze_textinbackend-python/app/nlp/processor.py. - Return sentiment score and named entities.
- Install
- Deliverable:
POST /analyzeaccepts text and returns JSON with sentiment/entities.
- Goals:
- Configure Celery in
celery_worker.py. - Move the scraping/NLP logic into a Celery task.
- Trigger tasks from the FastAPI endpoints.
- Configure Celery in
- Deliverable: Hitting the API returns a Task ID immediately; the result appears in the logs/database later.
Goal: Build the service responsible for speed and scale.
- Goals:
- Initialize the Go module.
- Set up a basic HTTP server (using
chiorgin) incmd/server/main.go. - Create a health check endpoint.
- Deliverable:
GET /healthreturns 200 OK from the Go container.
- Goals:
- Implement the scraping logic in
internal/scraper/engine.gousingcolly. - Handle basic HTML parsing.
- Add a "worker pool" concept (limit concurrency).
- Implement the scraping logic in
- Deliverable: A function that takes a URL and returns raw HTML, running efficiently.
- Goals:
- Implement
internal/queue/consumer.goto listen to Redis/RabbitMQ. - When a message arrives, trigger the Colly scraper.
- Implement
- Deliverable: Publishing a message to Redis manually triggers the Go scraper.
Goal: Handle complex, JavaScript-heavy sites.
- Goals:
- Set up
backend-node/src/index.tswith Express. - Install Playwright/Puppeteer.
- Create a function to launch a browser, go to a page, and take a screenshot.
- Set up
- Deliverable:
POST /scrapesaves a screenshot of the target website.
- Goals:
- Handle infinite scrolling or button clicks.
- Extract dynamic content (e.g., React-rendered text).
- Return the data as JSON.
- Deliverable: Successfully scrape a site like Twitter or LinkedIn (public pages) that requires JS.
Goal: Create a single entry point for the frontend.
- Goals:
- Define the GraphQL schema (Job, Result, Stats).
- Implement resolvers in
api-gateway/resolvers.jsthat call the Python/Go/Node APIs.
- Deliverable: A GraphQL query
query { jobs { id status } }fetches data from the microservices.
- Goals:
- Create a mutation
createJob(url, type)that decides which service to call. - Standardize the response format across all services.
- Create a mutation
- Deliverable: You can submit a job via GraphQL Playground and see it processed by the correct service.
Goal: Visualize the data and control the system.
- Goals:
- Build the
JobControllercomponent to call your GraphQL mutation. - Display a list of recent jobs.
- Build the
- Deliverable: A UI where you can type a URL, click "Scrape", and see it appear in a list.
- Goals:
- Implement polling or WebSockets (optional) to update job status.
- Build the
AnalyticsChartto show dummy or real data.
- Deliverable: The dashboard updates automatically when a job finishes.
Goal: Make it production-ready.
- Goals:
- Ensure all containers talk to each other via Docker networks.
- Persist database data with volumes.
- Deliverable:
docker-compose downandupretains all your data.
- Goals:
- Write a
README.mdexplaining how to run the project. - Add comments to complex code blocks.
- Write a
- Deliverable: A portfolio-ready GitHub repository.