GitHub

Udacity Data Warehouse project

Description

In the current project, I (as a data engineer) worked with developing a data warehouse for an imaginary startup Sparkify. According to a legend, the startup wants to move their data and processes to the cloud service of AWS. My task assumed a development of an ETL pipeline that collects data from S3 and moves this data to a Redshift cluster. Inside of this cluster, I was tasked to organise the raw data into data models with an alignment to a star schema design.

Datasets

I worked with 2 datasets that resided in S3:

Song dataset. The dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID.
Log dataset. The dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above.

Files

dwh.cfg. The config file that contains all important redshift database information
create_delete_cluster.py. The script has two main functions: creating and deleting Redshift cluster. When a user runs the sceript, they will be asked first if they want to create or delete cluster. In the case when a user chooses 'create', the script will also save newly created IAM role and an endpoint address to the file dwh.cfg
sql_quieries.py. The python file contains all SQL queries needed for dropping, creating and inserting operation
create_tables.py. The script connects to a Redshift cluster and creates staging tables for copying data from S3 and tables for a star schema
etl.py. The script connects to a Redshift cluster and will copy data from S3 repository to staging tables and then will insert values into tables created by the previous script

How to run the project

Add personal KEY and SECRET and fill in all needed information for a future cluster in dwh.cfg
Run create_delete_cluster.py and choose the option 'create'
Run create_tables.py to create tables in a cluster
Run etl.py to copy data from S3 and insert values into dimensional models
Run create_delete_cluster.py and choose the option 'delete' to delete a cluster

Final tables

Fact Table

songplays - records in event data associated with song plays i.e. records with page NextSong (songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent)

Dimension Tables

users - users in the app (user_id, first_name, last_name, gender, level)

songs - songs in music database (song_id, title, artist_id, year, duration)

artists - artists in music database (artist_id, name, location, latitude, longitude)

time - timestamps of records in songplays broken down into specific units (start_time, hour, day, week, month, year, weekday)

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
__pycache__		__pycache__
.DS_Store		.DS_Store
README.md		README.md
create_delete_cluster.py		create_delete_cluster.py
create_tables.py		create_tables.py
dwh.cfg		dwh.cfg
etl.py		etl.py
sql_queries.py		sql_queries.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Udacity Data Warehouse project

Description

Datasets

Files

How to run the project

Final tables

About

Releases

Packages

Languages

voskresenskiy/data_warehouse_sparkify

Folders and files

Latest commit

History

Repository files navigation

Udacity Data Warehouse project

Description

Datasets

Files

How to run the project

Final tables

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages