A Brief Look at Spotify

python
pandas
music
Author

Bryan Armpriest

Published

March 2, 2025

#Let's load up some data from Spotify so that we can take a look at it!

import pandas as pd

spotify = pd.read_csv('https://bcdanl.github.io/data/spotify_all.csv')
spotify.info()
spotify
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 198005 entries, 0 to 198004
Data columns (total 7 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   pid            198005 non-null  int64 
 1   playlist_name  198005 non-null  object
 2   pos            198005 non-null  int64 
 3   artist_name    198005 non-null  object
 4   track_name     198005 non-null  object
 5   duration_ms    198005 non-null  int64 
 6   album_name     198005 non-null  object
dtypes: int64(3), object(4)
memory usage: 10.6+ MB
pid playlist_name pos artist_name track_name duration_ms album_name
0 0 Throwbacks 0 Missy Elliott Lose Control (feat. Ciara & Fat Man Scoop) 226863 The Cookbook
1 0 Throwbacks 1 Britney Spears Toxic 198800 In The Zone
2 0 Throwbacks 2 Beyoncé Crazy In Love 235933 Dangerously In Love (Alben für die Ewigkeit)
3 0 Throwbacks 3 Justin Timberlake Rock Your Body 267266 Justified
4 0 Throwbacks 4 Shaggy It Wasn't Me 227600 Hot Shot
... ... ... ... ... ... ... ...
198000 999998 ✝️ 6 Chris Tomlin Waterfall 209573 Love Ran Red
198001 999998 ✝️ 7 Chris Tomlin The Roar 220106 Love Ran Red
198002 999998 ✝️ 8 Crowder Lift Your Head Weary Sinner (Chains) 224666 Neon Steeple
198003 999998 ✝️ 9 Chris Tomlin We Fall Down 280960 How Great Is Our God: The Essential Collection
198004 999998 ✝️ 10 Caleb and Kelsey 10,000 Reasons / What a Beautiful Name 178189 10,000 Reasons / What a Beautiful Name

198005 rows × 7 columns

Warning: total number of rows (198005) exceeds max_rows (20000). Limiting to first (20000) rows.

To start, I want to see if my favorite band is included in this DataFrame. I will do this two different ways, first by indexing (and searching) and then by using filtering methods. Both of the following code chunks provide the same filtered dataframe.

#Indexing, first we must set the index of the DataFrame to a new variable
spotify_name = spotify.set_index(['artist_name'])

#Now we just have to search for Coldplay!
spotify_name.loc[['Coldplay']]
pid playlist_name pos track_name duration_ms album_name
artist_name
Coldplay 31 Running 2.0 88 Hymn For The Weekend - Seeb Remix 212647 A Head Full Of Dreams Tour Edition
Coldplay 45 angst 43 The Scientist 309600 A Rush Of Blood To The Head
Coldplay 51 Kevin 27 Midnight 294666 Ghost Stories
Coldplay 85 Gym 1 Adventure Of A Lifetime 263786 A Head Full Of Dreams
Coldplay 85 Gym 2 Hymn For The Weekend 258826 A Head Full Of Dreams
... ... ... ... ... ... ...
Coldplay 999968 chill songs 89 A Sky Full of Stars 268466 Ghost Stories
Coldplay 999968 chill songs 90 See You Soon 171373 The Blue Room
Coldplay 999979 summer 2017 18 Paradise 278719 Mylo Xyloto
Coldplay 999984 Lake 9 Adventure Of A Lifetime 263786 A Head Full Of Dreams
Coldplay 999989 PARTAY 23 Hymn For The Weekend - Seeb Remix 212647 A Head Full Of Dreams Tour Edition

483 rows × 6 columns

#Filtering -> use of a the query() method
coldplay = spotify.query("artist_name == 'Coldplay'")
coldplay
pid playlist_name pos artist_name track_name duration_ms album_name
1732 31 Running 2.0 88 Coldplay Hymn For The Weekend - Seeb Remix 212647 A Head Full Of Dreams Tour Edition
2755 45 angst 43 Coldplay The Scientist 309600 A Rush Of Blood To The Head
2973 51 Kevin 27 Coldplay Midnight 294666 Ghost Stories
4859 85 Gym 1 Coldplay Adventure Of A Lifetime 263786 A Head Full Of Dreams
4860 85 Gym 2 Coldplay Hymn For The Weekend 258826 A Head Full Of Dreams
... ... ... ... ... ... ... ...
196152 999968 chill songs 89 Coldplay A Sky Full of Stars 268466 Ghost Stories
196153 999968 chill songs 90 Coldplay See You Soon 171373 The Blue Room
196787 999979 summer 2017 18 Coldplay Paradise 278719 Mylo Xyloto
196980 999984 Lake 9 Coldplay Adventure Of A Lifetime 263786 A Head Full Of Dreams
197493 999989 PARTAY 23 Coldplay Hymn For The Weekend - Seeb Remix 212647 A Head Full Of Dreams Tour Edition

483 rows × 7 columns

Now, I’m curious to see how many times Coldplay appears in this particular dataset that contains 1 million different playlists.

I am also particularly interested in the number of unique track_names that appear within this DataFrame.

coldplay.shape[0] #there are 483 observations of Coldplay in this data!
483
spotify.shape[0]
198005
483/198005 #Coldplay shows up in less than 1% of playlists, I suppose there is a very specific mood of a playlist that calls for Coldplay.
0.002439332340092422
coldplay['track_name'].nunique() #74 unique tracks
74

I also thought it would be interesting to see the mean position in which coldplay songs are added to these playlists (especially in comparison to the average position for all observations in the dataframe).

coldplay['pos'].mean() #Around the 45th position!
44.79710144927536
spotify['pos'].mean()
54.39170728011919

Finally, let’s look at the length of these songs.

spotify['duration_ms'].mean()
234740.84469079063
coldplay['duration_ms'].mean() #coldplay songs seem to be longer! lets see if there are any in particular through sorting.
268694.7039337474
coldplay.sort_values(['duration_ms'], ascending = False)
#With this information in the dataframe, they range from 465957ms to 136866ms
pid playlist_name pos artist_name track_name duration_ms album_name
122515 1827 Now 0 Coldplay O 465957 Ghost Stories
59456 881 EDC 31 Coldplay Every Teardrop Is a Waterfall - Coldplay vs. S... 408828 Until Now
41930 631 house 11 Coldplay Every Teardrop Is a Waterfall - Coldplay vs. S... 408828 Until Now
4861 85 Gym 3 Coldplay Up&Up 405320 A Head Full Of Dreams
107839 1607 sleeps 29 Coldplay Gravity 380946 Talk
... ... ... ... ... ... ... ...
196153 999968 chill songs 90 Coldplay See You Soon 171373 The Blue Room
135923 999023 Pop songs 41 Coldplay Life In Technicolor 149133 Viva La Vida Or Death And All His Friends
59961 887 Reception 61 Coldplay Life In Technicolor 149133 Viva La Vida Or Death And All His Friends
9437 144 picks 25 Coldplay U.F.O. 137819 Mylo Xyloto
44608 665 recommendations !! 48 Coldplay Don't Panic 136866 Parachutes

483 rows × 7 columns

Looking back at my very brief look at Coldplay observations within the Spotify DataFrame, I learned quite a bit from the data.

  1. Coldplay showed up in a fairly low number of playlists (less than 1%)
  2. There was quite a range in the length of the tracks, from 136 seconds to 465 seconds.
  3. When comparing the average position, Coldplay comes slightly sooner in playlists than the average for all songs.

Second approach to the data: With Visualization using Seaborn

import seaborn as sns
import matplotlib.pyplot as plt
#Use pandas to identify the ten artists with the most tracks
sum_tracks_by_artist = spotify.groupby('artist_name').agg(sum_appearances = ('track_name', 'count')).reset_index()
top_10 = sum_tracks_by_artist.nlargest(10, 'sum_appearances', keep = 'all')
top_10_artists = top_10['artist_name'].tolist()
top_10_artists
#Now we just need to filter out the dataframe so it includes JUST those that are in this list of artists (top 10).
['Drake',
 'Kanye West',
 'Kendrick Lamar',
 'Rihanna',
 'The Weeknd',
 'Future',
 'Eminem',
 'Lil Uzi Vert',
 'Ed Sheeran',
 'The Chainsmokers']

Now, given the list of the top artists on Spotify, we can examine where they’re placed in playlists (pos variable). We can visualize this using histogram plots, one that is distinct for every artist. All of these have the same count variable on the left, which is simply the number of occurences at a certain point in the playlist.

The only thing is; I was unable to order them in the order in which they appeared to have the most tracks, it resorted back to their appearances in the dataframe when creating the histograms.

spotify_top_10 = spotify[spotify['artist_name'].isin(top_10_artists)]
spotify_top_10

(
 sns.FacetGrid(
       data = spotify_top_10,
       col = 'artist_name',
       hue = 'artist_name'
       )
 .map(sns.histplot, 'pos')
 )

Keeping in mind the ACTUAL order of the artists: ‘Drake’, ‘Kanye West’, ‘Kendrick Lamar’, ‘Rihanna’, ‘The Weeknd’, ‘Future’, ‘Eminem’, ‘Lil Uzi Vert’, ‘Ed Sheeran’, ‘The Chainsmokers’

Based on these histograms for each artist, there is always a right-skew to the distribution of positions in the playlist they’re in. There appears to be a strong correlation between the start of a playlist and the inclusion of the music of these top 10 artists. This makes sense, since they are popular artists, with the top 10 most tracks appearing in playlists, and are more likely to be recognized and thought of toward the beginning of the creation of a playlist, or added early by an algorithm.

As always, thanks for reading!