This coursework was completed as part of Udacity's Data Engineer Nanodegree. I obtained my certification in 2021, and this repository is a collection of the projects I undertook during the program.
This repository showcases a portfolio of my work from the Udacity Data Engineer Nanodegree. The projects encompass a variety of skills including designing data models, building data warehouses and data lakes, automating data pipelines, and working with massive datasets.
This project explores fundamental concepts of Data Modelling using PostgreSQL. We design and create a database schema, then populate the database using optimized queries for a fictitious music streaming app, Sparkify.
In this project, we build an ETL pipeline that extracts data from S3, stages them in Redshift, and transforms data into a set of dimensional tables for Sparkify's analytics team. The process introduces the hands-on implementation of cloud data warehouses.
This project focuses on the construction of data lakes using Apache Spark. We build an ETL pipeline that extracts data from S3, processes it using Spark, and loads the processed data back into S3. This project highlights working with big data from different sources and in different formats.
We dive into the world of automated data pipelines using Apache Airflow. By scheduling and monitoring data pipelines, we ensure high data quality for analytics and enable consistent data availability. The project also involves source data extraction from S3 to Redshift.
The Capstone project integrates the skills learned throughout the nanodegree. We construct an ETL pipeline to analyze US immigration data. We use Apache Spark to handle large datasets, enabling comprehensive analysis of migration patterns.
Feel free to explore the repository, clone projects, and get hands-on experience with real-world Data Engineering scenarios. Your feedback is always welcome.