Team members: Seth Billiau, William Drew, Sarah Lucioni
Project TF: Nathan Hollenberg
Last edited: 12/11/2019
Group ID: 56
Github: spotify_playlists
When it comes to music, streaming is the way of the future. Convenient, cheap, and widely accessible, streaming is easier than ever with more people turning to popular services for their musical needs. In April of 2019, industry leader Spotify announced that they had reached a staggering 217 million monthly active users with over 100 million paid subscribers (Porter, “Spotify Is First to 100 Million Paid Subscribers.”).
In addition to providing music for its users, Spotify also provides a robust music recommendation system. Leveraging data science and machine learning techniques, Spotify aggregates a number of different models to generate new playlists based on a user’s musical preferences.
Developing a user’s musical taste is no small feat, but Spotify is able to overcome this hurdle by collecting user feedback on generated playlists with an up/down vote system and by aggregating information from a user’s listening history. But what if they didn’t have all of that information? How should Spotify generate a new playlist for a brand new user given minimal user information? Imagine a new user stumbles upon Spotify for the first time and plays a song. From this single song, is it possible to generate a likeable playlist?
Our project attempts to answer this question by developing such a "cold start" model, generating a new playlist for a user given just a single song.
After a brief literature review, this report will begin with an exploration of the dataset and its features. We will continue by explaining our methods, models, and forms of evaluation. The final sections of our project will present our results and conclusions with a brief discussion of possible extensions to our project.
Porter, “Spotify Is First to 100 Million Paid Subscribers.”
“Discover Weekly: How Spotify Is Changing the Way We Consume Music – Technology and Operations Management.”
Before diving into our project, we reviewed various sources of relevant literature, researching different tools that could help us accomplish our project goal. First, we learned as much as we could about Spotify’s internal recommendation system.
Spotify’s music recommendation system uses a combination of three different models when creating new playlists for existing users: Collaborative filtering, Natural Language Processing (NLP), and Audio Modeling. These three methods are described below.
Collaborative filtering is an algorithm that predicts a user’s interests based on the interests of similar peers. Spotify leverages this technique by using its wealth of customer listening data, identifying similar individuals to a target customer and creating a new playlist with songs enjoyed by those similar individuals. For example, say Spotify knows that target user X has listened to songs 3, 4, 9, and 10 while user Y has listened to songs 1, 2, 9, and 10. Using collaborative filtering, Spotify might add songs 1 and 2 to a new playlist for user X (Johnson, “Algorithmic Music Recommendations at Spotify.”).
Spotify analyzes text from popular blog posts, internet data, and music publications to see which songs are being commonly discussed and if songs are frequently being discussed together. Using this text analysis, Spotify attempts to cluster descriptions and adjectives to get a sense of what people think of a given song or artist. Based on this information and past user information, Spotify will attempt to predicts new songs for a target user. For example, say target user X has listened to song 1. Based on NLP information, Spotify learns that song 1 is often discussed with song 2 and is often described as “dancey”. Therefore, Spotify might suggest song 2 in a new playlist for user X along with other songs that have been identified as “dancey” (Giacaglia, “Behind Spotify Recommendation Engine.”).
Using a Convolutional Neural Network, Spotify analyzes the input raw audio of every song on the service, producing comparable characteristics like time signature, key, tempo, loudness, etc., to compare songs. Therefore, if Spotify knows a target user is listening to song 1 with certain audio features, it will suggest songs with similar audio features in a new playlist. These audio features can be accessed via the Spotify developer tools (Dieleman. “Recommending Music on Spotify with Deep Learning.”).
Based on this research, we decided to explore an approach similar to Spotify’s Audio Modeling techniques in our project. With only a single input song, we do not have enough information on our base user to use collaborative filtering in an effective way. We also don’t have the bandwidth or capabilities to perform NLP with the tools we have learned in CS109. However, Spotify’s song audio features provide easily-comparable metrics by which to identify similar songs given one input track.
Chris Johnson. “Algorithmic Music Recommendations at Spotify.” Technology, 11:33:57 UTC.
https://www.slideshare.net/MrChrisJohnson/algorithmic-music-recommendations-at-spotify/3-What_is_SpotifyOn_demand_music.
Giacaglia, Giuliano. “Behind Spotify Recommendation Engine.” Medium, May 21, 2019.
https://medium.com/datadriveninvestor/behind-spotify-recommendation-engine-a9b5a27a935.
Sander Dieleman. “Recommending Music on Spotify with Deep Learning.” Accessed December 9, 2019.
https://benanne.github.io/2014/08/05/spotify-cnns.html.
Given our literature research, the goal of our data cleaning was to obtain Spotify audio features to be used for modeling in our project. This section describes that process.
This project uses the Spotify Songs dataset, a collection of one million playlists organized into 1000 .csv files. Each row of these .csv files is a song object with the variable pid denoting the playlist ID and linking the corresponding song objects to the playlist. The features of our data are as follows:
We began by performing some preliminary analysis to explore how we can engineer these given features to our advantage. Upon an initial inspection, we saw that we needed to clean three variables: track_uri, artist_uri, and album_uri. Originally, all three of these URI's showed the header “spotify:track/artist/album:URI”. However, the useful part of each of these features was the URI code that followed this header since it can be used to query the Spotify API. Therefore, we wrote a function to strip off this header and reformatted these features to include only the useful parts of the URI.
Using these URI’s we queried the Spotify API using the Spotipy library to access the API with Python, extracting the audio features for each song in our dataset. The audio features we plan to examine most closely are:
When parsing the playlists, we ran into some problems where occasionally the Spotify API did not return a complete audio feature list when provided with specific song URIs. This prevented the entire playlist from being parsed correctly, so we handled this problem by skipping these playlists whenever we ran into this issue. We do not believe that this has any significant impact on the collection of songs we compiled as we could not identify any systematic pattern nor commonality between the problematic playlists.
From our dataset of one million playlists, we parsed 30,000 of them and collected audio features for nearly 200,000 songs. We stopped at parsing through only 30,000 playlists due to technical limitations and time constraints. By our estimation, it would have taken us multiple days and memory usage beyond the capabilities of our laptops to parse through all one million playlists. Of course, by limiting the number of playlists we parsed, we are limiting the size of our music library that we can recommend. We decided that the first 30,000 playlists mapping to 200,000 songs provided a good enough distribution of music genres, types, and styles to provide interesting and novel recommendations to our users.
After successfully extracting audio features for each song in our dataset, we began by visualizing the distribution of these audio features. We then wrote a function to extract a list of the tracks in a playlist and another function to return summary statistics about the playlist’s audio features. The latter function uses the Spotify API to extract the audio features of each song in the playlist. Based on the audio features, we further explored the data by creating visualizations which are displayed in the following section. We also created visualizations based on the number of songs in a playlist and the duration (in minutes) of a playlist.
We looked at all nine of the music features mentioned above as possible parameters for our song suggestion model. To look at the general distribution of features for our parsed songs, we constructed boxplots for each feature as well as obtained summary statistics for each.
Acousticness | Danceability | Duration_ms | Energy | Instrumentalness | Liveness | Loudness | Speechiness | Tempo | |
---|---|---|---|---|---|---|---|---|---|
Mean | 0.324306 | 0.556000 | 244180.2 | 0.603160 | 0.181495 | 0.208895 | -8.999104 | 0.088199 | 120.487872 |
StD | 0.340081 | 0.180091 | 128201.5 | 0.256615 | 0.323783 | 0.188702 | 5.159350 | 0.107022 | 29.685628 |
Min | 0.000000 | 0.000000 | 305600 | 0.000000 | 0.000000 | 0.000000 | -60.000000 | 0.000000 | 0.000000 |
Max | 0.996000 | 0.436000 | 527976.8 | 1.000000 | 1.000000 | 1.000000 | 3.744000 | 0.967000 | 247.963000 |
25% | 0.019700 | 0.569000 | 188352.8 | 0.422000 | 0.000000 | 0.096300 | -11.030000 | 0.035100 | 97.410000 |
Median | 0.176000 | 0.691000 | 225801.5 | 0.642000 | 0.000207 | 0.128000 | -7.659000 | 0.047000 | 120.025000 |
75% | 0.617000 | 0.990000 | 274297 | 0.818000 | 0.163000 | 0.264000 | -5.528000 | 0.086400 | 138.976250 |
From these boxplots, we see that the distributions for acousticness, dancability, and energy are relatively wide. This indicates that the songs in our dataset are especially varied with respect to these three parameters, indicating three parameters that could be significant when developing our model to generate song suggestions. The other parameters - instrumentalness, liveness, and speechiness - do not have as wide of a distribution, indicating that it could be difficult to differentiate songs from one another using these features. We observed similar trends for duration, tempo, and loudness.
We also looked at lengths of playlists in our parsed dataset - both in number of songs and total duration in minutes. We found that playlists tended to have around 50 songs as a good generalization and tended to have a total running time of roughly 4 hours. This gives us a good idea for how long our suggested playlists should be.
Playlist Song Count | Playlist Duration (min) | |
---|---|---|
Mean | 66.346 | 259.661 |
StD | 53.669 | 214.273 |
Median | 49.0 | 190.373 |
Max | 376 | 10584.563 |
Min | 5 | 1.625 |
To address our question of how to generate a new playlist for a new user given minimal user information, we first considered the data we would use, the transformations we would need to apply, the output we wanted, and the input needed to predict the output before deciding on a model type. Rewording our problem statement, our model must predict a playlist given minimal user information.
The data is unconventional because our data of playlists of songs are technically categorical data. However, each unique song is its own “category” which makes traditional categorical transformations such as one-hot encoding uninterpretable. Using the Spotify API, we can extract audio features for each song (see Data and EDA). The audio features transform each song to a set of numerical features better suited for traditional prediction techniques. We selected the audio features acousticness, danceability, energy, instrumentalness, liveness, speechiness, and tempo to analyze. We chose these seven features because they are rarely omitted from the Spotify audio features data (an omitted audio feature appears as a 0 in the Spotify data) and they are all normalized values (except tempo). tempo not being normalized led to several model variations which we will discuss below.
As discussed in the Data and EDA section, our dataset contains one million playlists. To reduce the computational power needed and the Spotify API calls made, we decided to train our model on the first 30,000 playlists and test on randomly-selected seed songs.
The output is a playlist of n songs. This means that our model should predict n songs. To do so, our transformed song data (to audio features) should also keep a reference back to it’s song. This reference will be used to display the predicted playlist.
The input needs to be a datum that our model can predict songs based off of, but it must be minimal to match our problem statement. Due to these constraints, we decided that our input would be a “seed song” as well as a playlist length. The seed song will be in the form of a Spotify URI so that we can extract the audio features from it. We also input a playlist length to simplify the predictive tasks of the model. However, a separate model could be written to predict the length of the playlist, but we did not find this necessary to implement to address our problem statement.
With one seed song, a natural approach suggests creating a playlist from the n closest songs to the seed. This motivates our first set of models: k-Nearest Neighbors based. Our second set of models is a logical comparison to the first: k-Means Clustering based. Let us first define each type of model, then discuss the differences, and finally, present our constructed models.
k-NN models are supervised machine learning algorithms used for classification or regression. Typically, the k nearest neighbors are used to predict a classification (most frequently occurring class in the k) or regression (mean of the k nearest neighbors). The k-NN algorithm is an instance of lazy learning which means that the computation is deferred until it must output a prediction (“Machine Learning Basics with the K-Nearest Neighbors Algorithm”). One variation of a k-NN model is the choice of distance metric. A proper distance metric must satisfy three properties:
Euclidean distance is commonly the selected distance metric for k-NN. However, cosine similarity may be effective in some cases. Cosine similarity does not respect the above properties, but may be used as a distance metric if appropriately treated within the model. The choice of distance metric presents the first variation in our set of k-NN models. We create one set of models using euclidean distance and a second set using cosine similarity.
Due to the distance metric, k-NN typically performs best when the data is all on the same scale (Dhiraj, “Difference between K-Nearest Neighbor(K-NN) and K-Means Clustering”). This limitation raises a concern about training our k-NN models on our data including tempo. The other six features are on a [0, 1] scale. However, tempo varies greatly and is hard to normalize because there is not a common base or top value. This presents a second variation in our set of k-NN models: include vs. omit tempo from the training data.
In our set of k-NN based models, instead of computing an averaged prediction from the k nearest neighbors, our prediction will be the k nearest neighbors. Furthermore, k = n (where n is one of the input parameters).
In the future, possible improvements include creating a weighted k-NN model. The songs could be weighted by popularity/occurrence in the training set, how close they are to the original song, or by some other feature we wish to include (More on this in the final section).
k-Means Clustering is unsupervised machine learning algorithm which aims to split n observations (songs) into k clusters based on the nearest mean. k-Means is an NP-hard problem which means that it is computationally difficult to find the best answer (“K-Means Clustering”). Problems in NP are believed to not be solvable in polynomial time, however, this has not been proven (“Rocchio Algorithm”). However, efficient algorithms which converge to a local optimum exist.
k-Means can work well even when data is not on the same scale. To be consistent with our k-NN based models, we will construct a set of k-Means models which include vs. omit tempo from the training data.
Since we want our model to predict a playlist, we will have to slightly modify our k-Means model. To do so, we will construct a k-Means model, then we will predict the cluster the seed song belongs to. This cluster will become the selected cluster. A random sample of n songs will be drawn from the cluster and will create the predicted playlist. We randomly select the songs from the predicted cluster so as to better test the clustering functionality of the k-Means model.
In the future, we could implement a smarter selection criteria for selecting songs within a cluster. A smarter choice would be to implement a k-NN model within a cluster to select the n songs. This variation is known as the Rocchio algorithm (“Rocchio Algorithm”). We did not want to test this variation because it blurs the line between k-NN and k-Means which we wish to keep clear for the sake of our model comparisons.
There are a few clear differences between the k-NN and k-Means algorithms. First, k-NN solves a classification or regression problem while k-Means solves a clustering problem. Second, as explained above, k-NN is a lazy learner while k-Means is an eager learner. A lazy learner saves computation until absolutely needed as we observe in k-NN. An eager learner has a training step which creates and fits a model. Finally, as discussed above, k-NN functions much better if the data is all on the same scale while k-Means does not have this same constraint (Dhiraj).
Comparing these two sets of models allows us to compare a human tailored selection model (k-NN based) to a computer selection model (k-Means Clustering based). We will use this comparison as a testing metric discussed in the Results section.
Our k-NN models will have two variantes: distance metric and tempo. This leads to four k-NN models total. The four models are:
We decided to omit comparison to normalized tempo after testing normalized tempo and seeing that the playlists constructed with tempo normalized typically consist of songs missing the audio features we aim to test. Missing these audio features breaks our testing metric making it hard to compare to the other models. For this reason, we decided to omit normalization of tempo.
Here is our general k-NN model:
Our general k-NN model is used to construct each of the four models. The inputs are the dataframe of songs to train the model on, the seed song to create a playlist from, the length of the desired playlist, the desired distance metric, and a boolean representing whether or not to include tempo in the model.
We first define the desired audio feature columns with and without tempo. Then we convert the seed song to a dataframe of its audio features. If the distance metric is cosine similarity, we compare the songs dataframe to the seed song using cosine similarity. Then, we sort the values in descending order because a higher cosine similarity means that the song at hand is more similar to the seed song. If the distance metric is euclidean distance, we compare the songs dataframe to the seed song using euclidean distance. Then, we sort the values in ascending order because a smaller euclidean distance means that the song at hand is closer to the seed song. In both cases, we take the top n best values. We then extract and return the index of these values. From these indices, we can find the corresponding song by matching the index in the songs dataframe.
Using the indices returned from the k-NN model, we print out the playlist as follows:
Looping through the indices, we locate each song in the songs dataframe. Then, we use the track uri to access the Spotify API and extract human readable information which we print to the screen. This includes the track name and track artists. Simultaneously, we append the track’s audio features to a dataframe so we also have a numerical view of the playlist which will be used in testing.
Now, we use the k-NN model and the playlist printer to define the four k-NN models:
In order to quickly run and analyze all four models, we create a function which runs all of the models and creates a dataframe of the mean audio feature values for each predicted playlist.
We also add the seed song’s audio features to the mean dataframe to have a comparison value. Furthermore, we fill NaN values in the mean dataframe with 0 to maintain uniformity with Spotify’s functionality (Spotify returns 0 for audio features that are not recorded).
Sample output from run_all_knn
using Michael Bublé’s It’s Beginning to Look a lot Like Christmas as the seed song is as follows:
Next, we have our k-Means Clustering model. The set up is similar to the k-NN model. We first have a general k-Means model. Then, we use this function to construct the two separate models. Finally, we have a function which runs both models, prints the predicted playlists, and returns a means dataframe for analysis and testing purposes.
Here is our general k-Means Clustering model:
Our general k-Means Clustering model is used to construct both k-Means models. The inputs are the number k of clusters, the dataframe of songs to train the model on, the seed song to create a playlist from, the length of the desired playlist, a boolean representing whether or not to include tempo in the model, and the k-Means model if it has previously been generated. Differently from k-NN, we only need to train the model once because k-Means Clustering is an eager learner.
We first define the dataframe with the desired audio feature columns with and without tempo. If a k-Means model has already been fit, then we do not train another model. Otherwise, we must train and fit a k-Means model. Then, we use the k-Means model to predict the clusters for all of the songs. We convert the seed song to a dataframe of it’s audio features and predict the song’s corresponding cluster. Finally, we randomly select n songs from the seed song’s predicted cluster. We then extract and return the index of these values. From these indices, we can find the corresponding song by matching the index in the songs dataframe. We use the playlist printer defined above for the k-Means models as well.
Now, we use the k-Means Clustering model and the playlist printer to define the two k-Means models:
To store and reuse the k-Means models, we create two global variables. If we want to refit the models, then we call the reset_models
function:
In order to quickly run and analyze all four models, we create a function which runs all of the models and creates a dataframe of the mean audio feature values for each predicted playlist.
Sample output from run_all_kmeans
using Michael Bublé’s It’s Beginning to Look a lot Like Christmas as the seed song is as follows:
In the next section, we will test and explore the results produced from using these six models.
K, Dhiraj. “Difference between K-Nearest Neighbor(K-NN) and K-Means Clustering.” Medium, Medium, 29 Aug. 2019,
medium.com/@dhiraj8899/difference-between-k-nearest-neighbor-k-nn-and-k-means-clustering-d9a44859182f.
“K-Means Clustering.” Wikipedia, Wikimedia Foundation, 3 Dec. 2019,
en.wikipedia.org/wiki/K-means_clustering.
“Machine Learning Basics with the K-Nearest Neighbors Algorithm.” Accessed December 9, 2019.
https://towardsdatascience.com/machine-learning-basics-with-the-k-nearest-neighbors-algorithm-6a6e71d01761.
“NP-Hardness.” Wikipedia, Wikimedia Foundation, 20 Nov. 2019,
https://en.wikipedia.org/wiki/NP-hardness.
“Rocchio Algorithm.” Wikipedia, Wikimedia Foundation, 3 Nov. 2019,
https://en.wikipedia.org/wiki/Rocchio_algorithm.
Once we had created our playlists using our k-NN (euclidean distance/cosine similarity and tempo/no tempo) and k-Means models (tempo/no tempo), we created a method of testing to evaluate the quality of our playlists.
We began by using all six models to generate a playlist of length 49 (the median playlist length in our songs dataset) from a single seed song. Then, we calculated the mean audio features of each of these generated playlists. We subtracted the audio features of our generated playlists from the audio features of our seed song to obtain residual-like values for each playlist. This gave us a list of six playlist-level errors for the audio features acousticness, danceability, energy, instrumentalness, liveness, and speechiness. Squaring and summing these six residuals in the spirit of sums of squared error, we obtained a metric of how similar a playlist’s average songs are to the seed song. Like in SSE, the smaller this value, the better.
We decided not to consider tempo as a relevant audio feature for this calculation. Since half of our models were trained with tempo as a parameter and half were not, considering tempo as a residual would automatically make the performance of tempo-included models better than tempo-excluded models.
Given that the objective of our project is to generate a cold start playlist from a song chosen by a new user, we decided against randomly selecting seed songs. Since not all songs are equally likely to be stumbled upon, we instead hand-selected seed songs, attempting to pick a variety of songs across different genres and styles.
Before evaluating the SSE of our playlists, we wanted to visualize the mean audio feature differences for a few selected seed songs to see if we could predict the best model. To do so, we compared the mean audio features of the playlists generated by each model to the audio features of the seed song.
In the visualizations below, each color represents a model type. Stripes denotes a model that does not use tempo. The orange bar is the seed song.
Here, we visualize the model mean audio features using a seed song of It’s Beginning to Look a lot Like Christmas.
We see that k-NN models perform very closely to the seed song. This makes sense since our k-NN models are more “human tailored” models. The k-Means model is unsurprisingly not very good. We say unsurprisingly because the “computer selection” method can change rapidly since the model simply chooses mildly related songs via the predicted cluster of the seed song.
Let’s view a few more using seed songs in different genres.
By analyzing the above visualizations, we start to form a hypothesis that the k-NN model using euclidean distance and no tempo performs the best (blue bar with stripes). Now that we have an initial hypothesis, let us apply our testing metric.
Below is a table of sums of squared error results for all models for each of the tested seed songs.
Model | k-NN, euclidean distance, tempo included | k-NN, euclidean distance, tempo excluded | k-NN, cosine similarity, tempo included | k-NN, cosine similarity, tempo excluded | k-means clustering, k=50, tempo included | k-means clustering, k=50, tempo excluded |
---|---|---|---|---|---|---|
Senorita | 0.004108 | 0.000153 | 0.000804 | 0.000218 | 0.163915 | 0.134497 |
Yo-Yo Ma Cello Suite No. 1 in G Major | 0.007471 | 0.000957 | 0.008852 | 0.006980 | 1.112812 | 1.121155 |
Suge | 0.055340 | 0.000941 | 0.010914 | 0.018576 | 0.332391 | 0.391957 |
bad guy | 0.045853 | 0.011441 | 0.015858 | 0.014628 | 0.139678 | 0.144949 |
SICKO MODE | 0.016450 | 0.000360 | 0.011922 | 0.006341 | 0.278779 | 0.157548 |
Despacito | 0.009528 | 0.000553 | 0.009744 | 0.003939 | 0.107131 | 0.063538 |
Cyanide | 0.078642 | 0.000545 | 0.005271 | 0.005738 | 0.236618 | 0.263430 |
Buy U a Drank (Shawty Snappin') | 0.026162 | 0.001649 | 0.130888 | 0.078595 | 0.214501 | 0.177378 |
My Type | 0.002327 | 0.000053 | 0.000194 | 0.001166 | 0.169776 | 0.160391 |
DDU-DU DDU-DU | 0.000560 | 0.000362 | 0.012985 | 0.004150 | 0.250433 | 0.299767 |
King of Kings | 0.004246 | 0.000084 | 0.006912 | 0.046792 | 0.116599 | 0.100181 |
Autumn Leaves | 0.087358 | 0.002616 | 0.026214 | 0.002551 | 0.267342 | 0.289967 |
The Star Wars Theme | 0.065607 | 0.002017 | 0.007384 | 0.001583 | 0.992940 | 1.000243 |
The Marriage of Figaro Overture | 0.005025 | 0.000967 | 0.002097 | 0.000734 | 1.299728 | 1.348139 |
All I Want for Christmas is You | 0.015787 | 0.000386 | 0.005668 | 0.034819 | 0.115138 | 0.171720 |
Can't Take my Eyes off of You | 0.013111 | 0.000671 | 0.000980 | 0.001516 | 0.134122 | 0.075223 |
No Woman, No Cry | 0.020687 | 0.002071 | 0.015922 | 0.001827 | 0.346313 | 0.351701 |
William Shatner speaks Rocket Man | 0.002954 | 0.000439 | 0.003379 | 0.005049 | 0.456568 | 0.442602 |
Grenade | 0.005015 | 0.000149 | 0.005484 | 0.000146 | 0.138764 | 0.112630 |
Country Girl (Shake It For Me) | 0.001753 | 0.000153 | 0.002200 | 0.009898 | 0.200039 | 0.193183 |
Ransom | 0.008899 | 0.000216 | 0.002312 | 0.004695 | 0.272664 | 0.240512 |
Magnolia | 0.094286 | 0.000410 | 0.001734 | 0.000132 | 0.252121 | 0.267089 |
FACE | 0.020528 | 0.005342 | 0.051708 | 0.083133 | 0.272206 | 0.239928 |
Inside Out | 0.028865 | 0.000131 | 0.008655 | 0.002886 | 0.218657 | 0.239411 |
Lovebug | 0.005819 | 0.000600 | 0.059192 | 0.028596 | 0.103388 | 0.116032 |
My Shot | 0.004700 | 0.000320 | 0.000535 | 0.000413 | 0.179155 | 0.239149 |
Moments Passed | 0.004245 | 0.000423 | 0.028958 | 0.015484 | 0.132276 | 0.145516 |
Come Down | 0.028447 | 0.001110 | 0.020129 | 0.058544 | 0.180081 | 0.157483 |
Emaline | 0.010707 | 0.000139 | 0.000105 | 0.000044 | 0.237566 | 0.137893 |
Say It Ain't So | 0.010820 | 0.000138 | 0.000593 | 0.022434 | 0.202545 | 0.169659 |
Der Geist hilft unser Schwachheit auf | 0.039614 | 0.001006 | 0.004548 | 0.005444 | 0.698867 | 0.637459 |
Vossi Bop | 0.000341 | 0.029178 | 0.001237 | 0.006090 | 0.119955 | 0.166315 |
The Thrill is Gone | 0.002741 | 0.029912 | 0.058425 | 0.055224 | 0.147720 | 0.132937 |
How Far I'll Go | 0.000447 | 0.006570 | 0.005086 | 0.011233 | 0.219445 | 0.334911 |
My Oh My | 0.002860 | 0.000111 | 0.009427 | 0.002148 | 0.148766 | 0.160077 |
Posthumous Forgiveness | 0.038095 | 0.001118 | 0.007383 | 0.049842 | 0.230584 | 0.286806 |
Desperado | 0.059450 | 0.000149 | 0.002446 | 0.000684 | 0.733834 | 0.723697 |
Sir Duke | 0.004620 | 0.000589 | 0.013601 | 0.033783 | 0.043316 | 0.056422 |
Happy Birthday | 0.002366 | 0.000236 | 0.005628 | 0.003489 | 0.488781 | 0.392939 |
Star Spangled Banner | 0.006668 | 0.000621 | 0.000980 | 0.000450 | 1.085056 | 1.087717 |
God is a woman | 0.004181 | 0.000324 | 0.000130 | 0.023923 | 0.209001 | 0.141342 |
God Save the Queen (Queen) | 0.042107 | 0.002340 | 0.019933 | 0.022709 | 0.643322 | 0.944059 |
O Canada | 0.031179 | 0.002123 | 0.002932 | 0.001426 | 0.425895 | 0.529212 |
Kyrie (Mozart Requiem) | 0.021656 | 0.001580 | 0.004743 | 0.007289 | 0.963742 | 1.284673 |
SLOW DANCING IN THE DARK | 0.004440 | 0.000289 | 0.025079 | 0.010463 | 0.074102 | 0.141662 |
Mean It | 0.006078 | 0.000449 | 0.005177 | 0.000655 | 0.170453 | 0.203931 |
A Sea Symphony, Movt 2 | 0.064007 | 0.002698 | 0.010987 | 0.002346 | 1.212871 | 1.232197 |
Thais Meditation | 0.033841 | 0.000303 | 0.004186 | 0.000392 | 0.817475 | 0.730178 |
Bogoroditse Devo | 0.036477 | 0.000653 | 0.002229 | 0.001378 | 1.160927 | 1.141981 |
To better understand the above table, let us visualize the SSE for a selection of seed songs.
First, let’s look at the SSE for all of the models:
We see that the k-Means models perform an a far worse level than the k-NN models. Furthermore, the k-Means model does exceptionally poorly on Yo-Yo Ma's Cello Suite No. 1 in G Major as well as the Star Wars Main Title. These two seed songs are both more orchestral and instrumental. This uncovers a flaw in the k-Means model that it cannot decipher instrumental taste.
Let's now separate the above visualization and look closely at the k-NN models and the k-Means models.
Observe how the k-NN model using euclidean distance without tempo has the lowest SSE values. This once again leads to our conclusion that this is the best performing model.
In the next section, we will draw conclusions from the results presented.
Below is a table of the average sums of squared error for all models across all tested seed songs sorted from best to worst model based on smallest to largest error.
Model | Average SSE |
---|---|
k-NN, euclidean distance, tempo excluded | 0.002360 |
k-NN, cosine similarity, tempo included | 0.013097 |
k-NN, cosine similarity, tempo excluded | 0.014297 |
k-NN, euclidean distance, tempo included | 0.022173 |
k-means clustering, k=50, tempo included | 0.382007 |
k-means clustering, k=50, tempo excluded | 0.393498 |
Let us visualize the above table of average SSE values to better judge the testing metric.
Now that we have evaluative data on our playlist-generating models, we can draw some conclusions on which of our models are best suited to solve the cold start problem. We will compare classes of models across our three types, k-NN euclidean distance, k-NN cosine similarity, and k-Means clustering, and across our two tempo statuses, included and excluded.
Note that in tempo-included models, the underlying model has an extra term for tempo that must be minimized (an extra dimension in our euclidean distance, cosine similarity, or k-means algorithm). Minimizing that extra term to get songs that are closer in tempo will affect how similar the the other audio features are to our seed song. Note, again, that for aforementioned reasons, tempo is unnormalized (see Methods and Models). We can expect the impact of tempo’s inclusion to vary across model type.
With that being said, our best overall model in terms of our evaluative metric is the k-NN model that excludes tempo as a predictor and uses euclidean distance. With an average SSE of 0.002360, this model’s playlist songs had relevant audio features that were closest to the seed song on average, outperforming the next best model (k-NN cosine similarity, no tempo) by 0.010737.
However, when tempo is introduced into the mix as a predictor, the k-NN euclidean distance model performed nearly 10 times worse than before. This can be attributed to the fact that tempo was unnormalized as a predictor. Since euclidean distance is especially sensitive to differences in scale, it makes sense that there is such a dramatic decrease in accuracy when tempo is included as an algorithm with euclidean distance overcorrects by ensuring tempo is similar at the cost of ensuring other audio features are similar.
Though, as mentioned, k-NN models using cosine similarity performed significantly worse than the k-NN euclidean distance model, it is worth noting that they are much less sensitive to the inclusion of the unstandardized tempo variable with an absolute difference in SSE of 0.0012. This makes sense as cosine similarity, in general, is less sensitive to the inclusion of many parameters on different scales than euclidean distance.
The clear worst of the three methods, as expected, is k-Means clustering. k-Means performs over an order of magnitude worse than our other models. To improve this model, we could cross validate for a better k or find some more sophisticated way to choose songs within a cluster. The k-Means model we have now, however, doesn’t seem like the best way to construct a cold start playlist. It is interesting that the overall similarity SSE score actually goes down with the inclusion of tempo for the k-Means clustering algorithm. This makes sense upon further analysis as another degree of freedom on which to create clusters should make clusters more similar within group.
Overall, our k-NN model using euclidean distance and excluding tempo performs the best in terms of producing songs that have similar audio features to the seed song.
If the goal is to generate a playlist with songs that have similar audio features and similar tempos, one would be best served choosing the k-NN model using cosine similarity with tempo included based on the metric we created. However, a better way to solve that problem would be to include a properly-normalized version of tempo as a residual in the loss function. If evaluating playlists wasn't so computationally expensive, we would have re-run our models using this separate loss function.
A future extension of this project could consider weighting certain songs over others. For example, this model might work better if we were able to obtain the genre of the input song. Using that information, we could up-weight songs of the same genre when doing our calculation of Euclidean distance (essentially penalizing songs outside of the genre) to playlists that will likely be more thematically similar.
We also considered weighting by how frequently songs show up in playlists. This would be advantageous if we wanted our playlists to prioritize popular songs over less popular songs (or vice-versa). We could do this using existing data, counting how many playlists a given song shows up in. Unfortunately, we did not have the bandwidth to explore weighting given the time constraints of this project. However, this is certainly something to be considered in the future.
K, Dhiraj. “Difference between K-Nearest Neighbor(K-NN) and K-Means Clustering.” Medium, Medium, 29 Aug. 2019,
medium.com/@dhiraj8899/difference-between-k-nearest-neighbor-k-nn-and-k-means-clustering-d9a44859182f.
Dieleman, Sander. “Recommending Music on Spotify with Deep Learning.” Accessed December 9, 2019.
https://benanne.github.io/2014/08/05/spotify-cnns.html.
Giacaglia, Giuliano. “Behind Spotify Recommendation Engine.” Medium, May 21, 2019.
https://medium.com/datadriveninvestor/behind-spotify-recommendation-engine-a9b5a27a935.
Johnson, Chris. “Algorithmic Music Recommendations at Spotify.” Technology, 11:33:57 UTC.
https://www.slideshare.net/MrChrisJohnson/algorithmic-music-recommendations-at-spotify/3-What_is_SpotifyOn_demand_music.
Porter, “Spotify Is First to 100 Million Paid Subscribers.”
“Discover Weekly: How Spotify Is Changing the Way We Consume Music – Technology and Operations Management.”
“K-Means Clustering.” Wikipedia, Wikimedia Foundation, 3 Dec. 2019,
en.wikipedia.org/wiki/K-means_clustering.
“Machine Learning Basics with the K-Nearest Neighbors Algorithm.” Accessed December 9, 2019.
https://towardsdatascience.com/machine-learning-basics-with-the-k-nearest-neighbors-algorithm-6a6e71d01761.
“NP-Hardness.” Wikipedia, Wikimedia Foundation, 20 Nov. 2019,
https://en.wikipedia.org/wiki/NP-hardness.
“Rocchio Algorithm.” Wikipedia, Wikimedia Foundation, 3 Nov. 2019,
https://en.wikipedia.org/wiki/Rocchio_algorithm.