Spotify clustering and recommendation system

Analysing Spotify data, clustering and creating a predictive recommendation system.

Project Board - Jupyter Notebooks - Streamlit Dashboard - Conclusions

Table of contents (Click to show)

Dataset Content
Business Requirements
Hypothesis and how to validate
Project Plan
The rationale to map the business requirements to the Data Visualisations
Analysis techniques used
Ethical considerations
Dashboard Design
Unfixed Bugs
Development Roadmap
Conclusions
Deployment
Main Data Analysis Libraries
Credits
- Content
- Media

How to use this repo (Click to show)

Make sure you have:

Python installed, this project used V3.12,
VS Code latest

Inside VS Code:

Open Extensions (Ctrl+Shift+X or ⇧⌘X on macOS) Install these extensions if you don't have them:

Python extension (by Microsoft in the Extensions Marketplace)
Jupyter extension (also by Microsoft)

From the terminal:

Open the folder in a terminal where you want the project to be saved

Run git clone:

git clone https://github.com/petedanielsmith/spotify-recommendation-system.git

Navigate in to the new folder:

cd spotify-recommendation-system

Setup a virtual enviroment:

Create a virtual enviroment for the project.

Linux / Mac:

python3 -m venv .venv
source .venv/bin/activate

Windows CMD:

python3 -m venv .venv
.venv\Scripts\activate

Windows PowerShell:

python3 -m venv .venv
.\.venv\Scripts\Activate.ps1

Install the dependancies:

This will install all the dependancies needed for the project in to the virtual enviroment if it is setup, rather than globally

pip install -r requirements.txt

Select the Kernel

There is a drop down at the top of the notebooks to select your kernal that will run the Python. If you setup a virtual enviroment then make sure you pick the venv one.

Team Members

Cosmin Manolescu - https://www.linkedin.com/in/cosmin-manolescu95/
Duminda Gamage - https://www.linkedin.com/in/dumindap-gamage/
Kumudu Saranath Liyanage - https://www.linkedin.com/in/kumudu-s-liyanage/
Pete Smith - https://www.linkedin.com/in/petedanielsmith/

Dataset Content

The dataset used in this project can be downloaded from Kaggle: Spotify Tracks Dataset. It is a dataset of Spotify tracks over a range of 125 different genres. Each track has some audio features associated with it.

Columns include:

number: index number.
track_id: The Spotify ID for the track.
artists: The artists' names who performed the track. If there is more than one artist, they are separated by a ;.
album_name: The album name in which the track appears.
track_name: Name of the track.
popularity: The popularity of a track is a value between 0 and 100, with 100 being the most popular. The popularity is calculated by algorithm and is based, in the most part, on the total number of plays the track has had and how recent those plays are. Generally speaking, songs that are being played a lot now will have a higher popularity than songs that were played a lot in the past. Duplicate tracks (e.g. the same track from a single and an album) are rated independently. Artist and album popularity is derived mathematically from track popularity.
duration_ms: The track length in milliseconds.
explicit: Whether or not the track has explicit lyrics (true = yes it does; false = no it does not OR unknown).
danceability: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale.
key: The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.
loudness: The overall loudness of a track in decibels (dB).
mode: Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
speechiness: Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
instrumentalness: Predicts whether a track contains no vocals. "Ooh" and "aah" sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly "vocal". The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content.
liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
tempo: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
time_signature: An estimated time signature. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure). The time signature ranges from 3 to 7 indicating time signatures of 3/4, to 7/4.
track_genre: The genre in which the track belongs.

Business Requirements

The objective of this project is to design and evaluate a data-driven music recommendation system using Spotify track data. The system is developed for educational purposes, demonstrating the application of data engineering, exploratory analysis, unsupervised learning, and predictive modelling techniques.

Data Ingestion and ETL

Ingest a publicly available Spotify dataset containing track metadata and audio features.
Clean, transform, and structure the data to ensure consistency, completeness, and suitability for analysis.
Handle missing values, duplicates, and feature scaling as part of the ETL pipeline.
Produce a processed dataset that can be reused across analysis and modelling stages.

Exploratory Data Analysis (EDA)

Perform exploratory analysis to understand the distribution and relationships of key audio features (e.g. energy, tempo, valence, popularity).
Identify trends, correlations, and potential biases within the dataset.
Visualise feature distributions and clustering tendencies to inform modelling decisions.

Clustering and Segmentation

Apply unsupervised learning techniques (e.g. K-Means) to group tracks based on audio feature similarity.
Evaluate cluster quality and interpretability.
Use clustering results to identify distinct musical styles or listening profiles.

Predictive Recommendation System

Develop a recommendation approach that suggests tracks based on similarity to a given song or cluster.
Use audio features and clustering outputs to generate predictive recommendations.

Evaluation and Documentation

Assess model outputs qualitatively and quantitatively where appropriate.
Document assumptions, limitations, and design decisions.

Hypothesis and how to validate?

Hypothesis 1: All genres have the same average popularity; music preference is evenly distributed.
- Selected Test: Mann-Whitney U Test
- Visual: Box Plot comparison.
Hypothesis 2: Duration does not impact popularity.
- Selected Test: Kruskal-Wallis H Test
- Visual: Hexbin Density Plot and Binned Bar Chart.

Project Plan

The project follows the following steps:

Extract - Extract the data from Kaggle.
Load - Load the CSV via Pandas.
Transform - Clean and process the data using Pandas, adding new columns and checking for missing or duplicated values.
Visualise - Creating charts with Matplotlib and Seaborn to visualise trends and distributions.
Analyse - Interpret what the visualisations displayed.
Unupervised Learning - Use K-Means to cluster the data in to similar groups.
Cluster insight extraction - Understand clusters and create user-understandable profiles.
Supervised Learning - Use both Linear Regression and Random Forrest machine learning to create predictive models.
Interactive Dashboard - Use Streamlit to create an interactive dasboard to display the data and run predictive recommendations.
Document - Record findings and conclusions.

The rationale to map the business requirements to the Data Visualisations

Business Requirement	Data Visualisation(s)	Rationale & Hypothesis Outcome
1. Identify trends in music preferences (Genre Popularity)	Box Plot (X=Genre, Y=Popularity)	Rationale: Box plots show not just the average popularity, but the variance within a genre. This reveals if a genre is consistently popular (tight box, high median) or hit-or-miss (large spread). Hypothesis Outcome: The analysis reveals that "Pop-Film" and "K-Pop" with high medians, confirming global preference for these styles.
2. Visualise popular songs by time (Duration vs. Popularity)	Hexbin Plot or Scatter Plot with Trendline (X=Duration, Y=Popularity)	Rationale: With 100k+ rows, a standard scatter plot will suffer from overplotting. A Hexbin plot groups dense points, clearly showing the "sweet spot" duration where most popular songs exist. Hypothesis Outcome: The analysis reveals a significant concentration of high-popularity tracks within the 3-to-4-minute range, effectively visualizing the prevailing industry standard.

Analysis techniques used

Statistical Validation

Non-Parametric Testing: We prioritized non-parametric tests (Mann-Whitney U and Kruskal-Wallis) over standard parametric tests (T-Test/ANOVA).
- Reasoning: Our "Normality Checks" (Shapiro-Wilk) confirmed that audio features and popularity scores are highly skewed. Using parametric tests on this data would lead to incorrect P-values and false conclusions.

Data Visualization Strategies

Distribution Analysis (Box Plots): Used to visualize the spread and central tendency of popularity across genres, highlighting outliers and consistency.
Density Estimation (Hexbins): Employed for the Duration vs. Popularity analysis to handle the large dataset size (100,000+ rows). This technique reveals high-density clusters that would be invisible in a standard scatter plot.
Binning: Continuous variables (like Duration) were converted into categorical bins (e.g., "Radio Edit: 3-4 min") to allow for group-based statistical comparison.

Machine Learning Techniques

Clustering

The clustering phase focuses on creating stable labels that the downstream classifier can be trained on, as well as human meaningful ones that can be explained.

Why K-Means:

Very good with continuous audio features (our dataset is almost all continuous)
It finds centroids, which can then be easily mapped to user preferences
Produced stable, interpretable and easily learnable cluster boundaries

Workflow:
1. Data Preparation: Unusable columns were dropped, categorical columns were numerically encoded, then all columns were normalised using StandardScaler, finally the whole dataset was optimised using PCA(n_components=0.85).
2. Training the model: Used k-means++ as the centroid selection strategy, with n_init=10 to get multiple seeds, finally we tried K=range(2,31) to cover a wide range.
3. Selecting optimal K: Used Inertia (i.e. Elbow Method) and the Silhouette score to choose the optimal K.
4. Obtain labels: Labels were then obtained from the model with the best K, and were attached to the dataset.

Classification

The classification phase focuses on building a robust predictive model to assign new song data or user preferences to one of the 10 identified musical clusters.

Objective: Train and evaluate multiple machine learning algorithms to classify audio features into target clusters.
Workflow:
1. Data Preparation: The dataset is split into training (80%) and testing (20%) sets, stratified by cluster to ensure class balance.
2. Model Selection: Three distinct classifiers are trained and compared:
  - Random Forest Classifier (n_estimators=200)
  - Gradient Boosting Classifier
  - XGBoost Classifier (optimized with max_depth=6, learning_rate=0.1)
3. Evaluation: Models are evaluated based on Accuracy. The notebook automatically identifies the best-performing model among the three.
4. Deployment: The champion model is serialized and saved as best_spotify_model.pkl for integration into the application.
Features Used: 11 numerical audio attributes, including danceability, energy, key, loudness, mode, speechiness, acousticness, instrumentalness, liveness, valence, and tempo.

Ethical considerations

This project uses a publicly available Spotify dataset containing track-level metadata and audio features such as popularity, genre, tempo, energy, and danceability. The dataset does not include any personally identifiable information (PII) or individual user listening histories. As a result, the project does not present risks related to user privacy, data protection, or GDPR compliance.

Despite the absence of personal data, ethical considerations remain relevant. The dataset reflects platform-driven popularity and listening trends, which may introduce bias into the recommendation system. Popular artists and genres may be over-represented, while niche or emerging artists may be under-recommended. Recommendations based solely on audio features may also fail to capture cultural, contextual, or subjective aspects of musical preference.

This system is developed for educational and coursework purposes only, with the goal of exploring recommendation techniques rather than influencing user behaviour or commercial outcomes. The limitations of the dataset and model are acknowledged, and the recommendations should be interpreted as illustrative rather than authoritative. Future extensions of the project could include bias evaluation, diversity-aware recommendation strategies, and improved transparency in recommendation logic.

Dashboard Design

The dashboard was designed with accessibility for non-technical users as a priority. Visualizations were carefully selected for clarity to ensure insights are immediately understandable, while Plotly was utilized to deliver a visually appealing and interactive user experience. Furthermore, textual explanations and strategic recommendations were embedded directly into the interface to ensure self-service interpretation.

Dashboard Pages:

1. EDA - Exploratory Data Analysis

This page serves as the analytical foundation of the project, presenting our Hypothesis Testing and Data Quality checks. To accommodate different stakeholders, the analysis is split into two distinct views:

Tab 1: Executive Insights (Business)
- Focus: High-level trends and actionable strategy.
- Key Visuals:
  - The "Mainstream" Gap: A box plot comparing the Top 10 vs. Bottom 10 genres, highlighting the massive popularity advantage of mainstream music.
  - The "Radio Edit" Effect: A dual-view chart (Bar & Density) visualizing the "sweet spot" for song duration (3–4 minutes).
- Outcome: Provides clear "Business Recommendations" for the recommendation engine, such as implementing a popularity bias for new users and filtering out non-musical content.
Tab 2: Data Science Lab (Technical)
- Focus: Statistical validity and feature engineering.
- Key Visuals:
  - Normality Checks: A dynamic table showing Shapiro-Wilk test results to justify non-parametric testing.
  - Correlation Matrix: A heatmap identifying multicollinearity (e.g., Energy vs. Loudness).
  - Formal Hypothesis Testing: Raw outputs of Mann-Whitney U and Kruskal-Wallis tests (Statistic & P-Value) to mathematically prove that observed trends are not random noise.

2. Clustering

On this page the users can observe all clusters, and learn about them. Furthermore, should they choose to, they can also learn more about the how these were obtained and some of the caveats associated with the process and its results.

Tab 1: Cluster Profiles (Business)
- Focus: Users understanding what each profiles.
- Key Visuals:
  - Cluster Radar Graph: radar / spider plot showing the strenghts and weaknesses of each cluster.
- Outcome: Cluster profiles are explained to users, improving the usability of the system.
Tab 2: Building the Clusters (Technical)
- Focus: Technical explanation of the creation of clusters, plus caveats.

3. Prediction & Recommendation

The project features an interactive web application built with Streamlit, serving as the front-end for the recommendation engine.

User Interface:
- Feature Tuning: Users can define their ideal "sound" using sliders for 11 audio features.
- Musical Key Mapping: A user-friendly dropdown allows selection of musical keys (e.g., C, F#, B) which are mapped internally to their integer representations.
Prediction Engine:
- Upon submission, the app loads the pre-trained best_spotify_model.pkl.
- It predicts the specific Cluster the user's input belongs to and maps it to a descriptive genre label (e.g., Extreme/Metal, Electronic/House, Acoustic/Piano) using a predefined CLUSTER_NAMES dictionary.
Recommendation System:
1. Filtering: The dataset is filtered to include only songs from the predicted cluster.
2. Similarity Search: The app calculates the Euclidean Distance between the user's input vector and every song in that cluster.
3. Ranking: The top 50 songs with the lowest distance (highest similarity) are retrieved.
Interactive Results:
- Artist Filter: Users can search within the recommendations for specific artists.
- Grouped Display: Duplicate track entries are aggregated, displaying unique songs with their associated album and genres.
- Visual Validation: A bar chart acts as a feedback loop, visualizing the difference between the User's Input (Red) and the Average Profile of Recommended Songs (Blue).

Unfixed Bugs

All known issues, to the team's knowledge, were fixed.

Development Roadmap

The project is structured into four distinct phases, ensuring a logical flow from raw data to a user-facing application.

Phase 1: Cleaning, EDA & Hypothesis Testing (Notebook 01)

Data Cleaning
Hypothesis Testing
EDA

Phase 2: Clustering (Notebook 02)

Feature Engineering
Model Selection
Evaluation

Phase 3: Predictions for Song Recommendation (Notebook 03)

Classification
Tuning
Recommendation Logic

Phase 4: Dashboard & Documentation

Streamlit App
README

Future directions

Utilise the Spotify APIs to enable us to pull track/artist imagery and web links to the songs.
Add a 'like songs' page to the dashboard where users can search for tracks by artist/genre/track/albumn and then have a button next to the track to find like songs, which return other songs in the ame cluster

Conclusions

The Radio Edit effect: Data validates the "Radio Edit" effect, showing that songs between 3–4 minutes generally maximize commercial success compared to very short or long tracks.
Genre is an important feature: Unsurprisingly, statistical testing confirms a considerable, non-random popularity gap where "Mainstream" genres (Pop, K-Pop) consistently outperform "Niche" genres.
Popularity is not useful for clustering: After the clustering, a summary table of the clusters showed an almost identical Popularity score of ~ 35% for all, suggesting that it played no role.
Some clusters overlap: 4 clusters were found to have only subtle differences, this reflects the continuous nature of musical tastes. Can also denote a technical quirk of K-means: it creates the specified number of clusters, becoming increasingly 'pernickety' as the actual differences between clusters decrease.
Classification: The project delivers a powerful classifier of clusters, able to achieve a 94.54% accuracy on the test data.

Deployment

The dashboard app is deployed on Streamlit Community Cloud and can be accessed by visiting this URL in a web browser:

https://spotify-recommendation-system-code-institute.streamlit.app/

It can be ran locally by running:

streamlit run dashboard_app/main.py

Main Data Analysis Libraries

The libraries used for data analysis were:

Pandas - For data loading, transforming and cleaning.
NumPy - For data transforming.
Matplotlib - For overall multi chart layouts.
Seaborn - For a lot of the individual charts.
Scikit-learn - For machine learning alogrithms.
Joblib - For saving and loading models.
Streamlit - For creating an interactive web dashboard.
XGBoost - For training the XGBoost model

Credits

Content

Code institute - The intial project structure and the LMS (Learning Managment System) from the course.
Kaggle - Providing the data set used.

Media

Google AI - Gemini 3 - AI generated banner logo for this README file.
Google Material Icons - Icons in the dashboard.
Code Institute - Code Institute logo.
Python - Python logo image.
Pandas - Pandas logo image.
Matplotlib - Matplotlib logo image.
Seaborn - Seaborn logo image.
Kaggle - Kaggle logo image.
Scikit-learn - Scikit-learn logo image.
Streamlit - Steamlit logo image.

Name		Name	Last commit message	Last commit date
Latest commit History 110 Commits
.devcontainer		.devcontainer
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
dashboard_app		dashboard_app
data		data
images		images
jupyter_notebooks		jupyter_notebooks
models		models
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

CoMa95/Spotify-Recommender-System

Folders and files

Latest commit

History

Repository files navigation

Spotify clustering and recommendation system

Run git clone:

Navigate in to the new folder:

Setup a virtual enviroment:

Install the dependancies:

Select the Kernel

Team Members

Dataset Content

Business Requirements

Hypothesis and how to validate?

Project Plan

The rationale to map the business requirements to the Data Visualisations

Analysis techniques used

Statistical Validation

Data Visualization Strategies

Machine Learning Techniques

Clustering

Classification

Ethical considerations

Dashboard Design

Dashboard Pages:

1. EDA - Exploratory Data Analysis

2. Clustering

3. Prediction & Recommendation

Unfixed Bugs

Development Roadmap

Phase 1: Cleaning, EDA & Hypothesis Testing (Notebook 01)

Phase 2: Clustering (Notebook 02)

Phase 3: Predictions for Song Recommendation (Notebook 03)

Phase 4: Dashboard & Documentation

Future directions

Conclusions

Deployment

Main Data Analysis Libraries

Credits

Content

Media

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages