In the current project, I (as a data engineer) worked with developing a data warehouse for an imaginary startup Sparkify. According to a legend, the startup wants to move their data and processes to the cloud service of AWS. My task assumed a development of an ETL pipeline that collects data from S3 and moves this data to a Redshift cluster. Inside of this cluster, I was tasked to organise the raw data into data models with an alignment to a star schema design.
I worked with 2 datasets that resided in S3:
-
Song dataset. The dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID.
-
Log dataset. The dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above.
-
dwh.cfg. The config file that contains all important redshift database information
-
create_delete_cluster.py. The script has two main functions: creating and deleting Redshift cluster. When a user runs the sceript, they will be asked first if they want to create or delete cluster. In the case when a user chooses 'create', the script will also save newly created IAM role and an endpoint address to the file dwh.cfg
-
sql_quieries.py. The python file contains all SQL queries needed for dropping, creating and inserting operation
-
create_tables.py. The script connects to a Redshift cluster and creates staging tables for copying data from S3 and tables for a star schema
-
etl.py. The script connects to a Redshift cluster and will copy data from S3 repository to staging tables and then will insert values into tables created by the previous script
- Add personal KEY and SECRET and fill in all needed information for a future cluster in dwh.cfg
- Run create_delete_cluster.py and choose the option 'create'
- Run create_tables.py to create tables in a cluster
- Run etl.py to copy data from S3 and insert values into dimensional models
- Run create_delete_cluster.py and choose the option 'delete' to delete a cluster
Fact Table
songplays - records in event data associated with song plays i.e. records with page NextSong (songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent)
Dimension Tables
users - users in the app (user_id, first_name, last_name, gender, level)
songs - songs in music database (song_id, title, artist_id, year, duration)
artists - artists in music database (artist_id, name, location, latitude, longitude)
time - timestamps of records in songplays broken down into specific units (start_time, hour, day, week, month, year, weekday)