Let’s imagine you’re a client managing tons of university course information scattered across numerous text files. You want an efficient way to extract and organize all the key details, like course schedules, professor names, and class codes, and then store them in a data warehouse for easy access. This project does just that—building a pipeline to extract data from text files, process it, and load it into Snowflake, a modern cloud data warehouse.
⚡ Quick Note: While I worked with a small set of sample files (just 4-5), this setup can scale up to handle hundreds or even thousands of text files. It’s built to grow!
The input is a bunch of text files with detailed information about top university courses in a specific region. We extract key information:
- Course code
- Professor name
- Class schedule (day and time)
- Building and room information
PySpark, the Python API for Spark, handles the heavy lifting here. It reads and processes these text files in parallel, extracting relevant details from each.
Once we’ve processed the data, it’s time to load it into Snowflake—a super-fast, cloud-based data warehouse. The client can then run queries on this organized data whenever they need it.
The entire setup is containerized using Docker. This means the pipeline is portable, scalable, and easy to run or share across different environments.
- Python: Version 3.10 (for compatibility with PySpark and other tools)
- Apache Spark: Version 3.3.1 (with Hadoop)
- Snowflake: A cloud data warehouse
- Docker: For containerizing the pipeline
- Text files: Your raw course data
- Operating System: Linux-based Docker images for consistency
Text files containing course information are placed in a specific directory, ready to be processed.
PySpark reads the text files in parallel, extracts the relevant information, and structures the data.
The extracted course details (like professor names and schedules) are transformed into a structured format for loading into Snowflake.
The transformed data is loaded into Snowflake, where the client can easily query it and integrate it into their scheduling software.
Spark is the heart of this project. It processes and transforms the data in a distributed fashion, ensuring we can scale easily.
PySpark, the Python interface for Spark, is used to write and execute the data transformation logic.
This cloud-based data warehouse stores all the cleaned and structured data, making it easily accessible for querying.
Docker containers ensure the project can run consistently across different environments. The entire pipeline is containerized for scalability.
pandas
: For any additional data manipulation.spacy
: For text processing and parsing, particularly when extracting course names.snowflake-connector-python
: To establish communication between Spark and Snowflake.
First, clone the repository to your local machine:
git clone https://github.com/evanmathew/ETL-University-Course-Extraction-Using-Spark-Snowflake.git
cd Project
2. Build the Docker Containers using make
command
Once you've navigated to the project directory, use Docker to build the container image:
make build
Run the Docker Compose, which will bring up all the services (master, worker, and history server):
make run
This will start the Spark master, Spark worker, and Spark history server. 🚀
In your Snowflake account, create a database and schema for the course information.
- Log in to your Snowflake account.
- Run the following SQL commands:
CREATE DATABASE COURSE_INFORMATION;
CREATE SCHEMA COURSE_INFORMATION.PUBLIC;
CREATE WAREHOUSE COURSE_INFO_WAREHOUSE WITH WAREHOUSE_SIZE = 'XSMALL' AUTO_SUSPEND = 120 AUTO_RESUME = TRUE;
- You need to add the
sfOptions
dictionary to yourpython_job.py
file so that Spark can connect to your Snowflake database during the ETL process.
sf_options= {
"sfURL": "https://your_account.snowflakecomputing.com",
"sfDatabase": "COURSE_INFORMATION",
"sfSchema": "PUBLIC",
"sfWarehouse": "COURSE_INFO_WAREHOUSE ",
"sfRole": "MY_ROLE", # role that you have created (generally 'ACCOUNTADMIN')
"sfUser": "my_username", # username that you created during account setup
"sfPassword": "my_password" # password that you created during account
}
Submitting the spark application that we have created in ./spark_apps/python_job.py
to Spark master running at http://localhost:9090 :
make submit app=python_job.py
- Spark Master UI: http://localhost:9090
- Spark History Server: http://localhost:18080
- Snowflake Database:
-
Apache Spark Documentation
Official documentation for Apache Spark, covering APIs, architecture, and more.
Link: Apache Spark Docs -
Snowflake Connector for Spark
Guide on using the Snowflake Spark Connector for efficient data transfer between Spark and Snowflake.
Link: Snowflake Connector -
Docker Documentation
Comprehensive reference for Docker commands and Docker Compose.
Link: Docker Docs -
PySpark API Documentation
Reference for PySpark APIs, DataFrame operations, and transformations.
Link: PySpark API -
Python 3.10 Documentation
Python language reference and guides.
Link: Python Docs -
SpaCy
A Python library used for natural language processing. This project utilizes theen_core_web_sm
model.
Link: SpaCy Models -
Spark Deploy in Docker
Link: Medium -
Project Reference
Most of the project inspiration are from this video and you might find various other ETL projects that might help you :)
Link: YouTube