The goal of this project was to create an interactive dashboard for exploring a database (SQLite) populated with data obtained via the API provided by the website setlist.fm. These data relate to concerts (also known as "shows"), particularly which songs were played in what order, and are at least partially supplied by volunteer attendees of the concerts. The question that we initially focused on was which artist has performed in the highest number of countries during a given period.
In 2017 for example, information on about 306 thousand individual shows is available from setlist.fm. In our experience, half to three quarters of these are either duplicates or contain no information on songs played (an empty setlist). For our initial run we sampled 200 randomly selected pages from 2017 (a page has 20 shows) to make sure we pulled a representative sample of shows from throughout the year (see "setlist_data.py"). These 4,000 shows after cleaning were reduced in number to 1,603. Some artists (also known as "musicians") had multiple shows, so there were 1,325 artists. The shows' setlists included 15,111 distinct songs, some of which were played multiple times (in different shows), bringing the total number of instances of a song's being played in a show to 17,789.
The schema we used included a table for artists, each of which could have many songs (some of which might even be covers of other artists' songs, but in the context of a concert, these songs "belong" to the artist performing). Each song could be played at mutiple shows (as artists often play their top hits at every show they do). Conversely, each show has at least one song (around 12-15 songs on average as it happens.) Therefore the relationship between songs and shows is many-to-many (with a table of foreign keys to capture all the pairings). Each show in turn "belongs to" the country in which it occurs. The show table also includes other useful information: the date, the venue name and the city (with its latitude and longitude).
To save processing time, our extract and load procedure had to be refined to only loop through all the raw data on concerts once to populate the database. A much slower version ("etl_old.py") is included here for illustration purposes. We used Flask and SQLAlchemy to get the database up and running. In "models.py," we define class methods to ease the process of extracting data to be displayed on the dashboard (done in "view_methods.py"). The actual interface (using Dash) is detailed in "layout.py." We provided a graph of the top-20 artists by number of countries performed in, a map of performance locations for an artist selected by the user (hovering over each location provides more details) and a list of top venues (by number of performances) in each country (again, as selected by the user). The dropdown options are determined dynamically by the state of the database.
User testing gave us some ideas for future refinements: providing a ranking of countries with the most performances, explaining the axes of our graph more clearly and alphabetizing the dropdown menus for ease of use. The database schema could be re-arranged for speed, especially if the songs are not of particular interest.