DHMI Monthly Data Pipeline

A data pipeline to automate the scraping and transformation of monthly airport traffic data from the DHMI (State Airports Authority) website. This project uses Apache Airflow to schedule the pipeline, fetches data with requests and BeautifulSoup, and stores transformed data in a PostgreSQL database.

Overview

This project automates the following tasks:

Scraping: Collects the latest Excel files of airport traffic data from DHMI's website.
Data Transformation: Transforms the Excel file into a structured format with columns for airport name, domestic/international flights, category, and date.
Database Storage: Stores processed data into a PostgreSQL database for analysis and reporting.

Project Structure

The main components of this project are organized into directories and files as follows:

├── dags
│   └── data_pipeline_dag.py    # Defines the DAG for Apache Airflow to schedule and run the data pipeline
├── src
│   ├── main_scraper.py         # Contains main scraping and data transformation logic
│   ├── models.py               # Defines the database schema and connects to PostgreSQL
│   ├── migrations/             # Contains migration files for schema management (optional)
│   └── alembic.ini             # Configuration for Alembic migrations
|   └── config.py       
├── .gitignore
├── Dockerfile                  # Docker configuration for running the app
├── README.md                   # Project documentation
├── docker-compose.yaml         # Docker Compose setup for Airflow, PostgreSQL, and the app
└── requirements.txt            # Python dependencies

Setup

This project uses Docker to simplify the setup process, ensuring all dependencies and services are easily configured and reproducible. Make sure Docker and Docker Compose are installed on your machine.

Clone the Repository

git clone https://github.com/melisacar/monthly-scrape.git
cd monthly-scrape

Build the Docker Containers The docker-compose.yaml file configures the containers for:

Apache Airflow (DAG scheduler)
PostgreSQL (database)
A custom container to run the scraping script

To build and start the containers:

docker compose up airflow-init

If you have stop running containers and then:

docker-compose up -d --build

Access Airflow UI Once Airflow is running, access the Airflow web interface at http://localhost:8080 or http://127.0.0.1:8080/. Use the default login (airflow / airflow) to view and manage DAGs.

Data Pipeline Flow

DAG: `data_pipeline_dag.py`:

The data_pipeline_dag.py defines the Airflow DAG that schedules and runs the main data processing steps.
Schedule Interval: Runs daily at 10:00 AM.
Task: Runs run_main_check, which initiates the scraping and processing of the monthly data.

Script: `main_scraper.py`

main_scraper.py contains the main functions for scraping, processing, and saving data to the PostgreSQL database:

fetch_page_content(url): Retrieves the HTML content from DHMI's website.
parse_excel_links(html_content): Finds and returns the most recent Excel file link.
download_excel_file(href): Downloads the Excel file and returns its content.
transform_excel_file(excel_content): Processes the Excel data and reformats it into the desired structure.
save_to_database(df, session): Saves the processed data to the database.

Database Integration

Database Schema:

The database schema, defined in models.py, specifies the structure of the flight_check table, which includes columns for airport name, flight type, number of flights, category, and date.

Connecting to PostgreSQL

The database connection string in config.py connects to the dhmi-scrape database. The Docker setup in docker-compose.yaml exposes this database to the Airflow DAGs and the scraping script.

Storing Processed Data

The save_to_database function commits new rows to the database. Each entry includes the airport name, type of flight (domestic/international), the number of flights, category, date and data retrieval date.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DHMI Monthly Data Pipeline

Table of Contents

Overview

Project Structure

Setup

Data Pipeline Flow

DAG: `data_pipeline_dag.py`:

Script: `main_scraper.py`

Database Integration

Database Schema:

Connecting to PostgreSQL

Storing Processed Data

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 174 Commits
__pycache__		__pycache__
dags		dags
src		src
tests		tests
venv		venv
.DS_Store		.DS_Store
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yaml		docker-compose.yaml
requirements.txt		requirements.txt

License

melisacar/monthly-scrape

Folders and files

Latest commit

History

Repository files navigation

DHMI Monthly Data Pipeline

Table of Contents

Overview

Project Structure

Setup

Data Pipeline Flow

DAG: data_pipeline_dag.py:

Script: main_scraper.py

Database Integration

Database Schema:

Connecting to PostgreSQL

Storing Processed Data

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

DAG: `data_pipeline_dag.py`:

Script: `main_scraper.py`

Packages