#Let's load up some data from Spotify so that we can take a look at it!
import pandas as pd
spotify = pd.read_csv('https://bcdanl.github.io/data/spotify_all.csv')spotify.info()
spotify<class 'pandas.core.frame.DataFrame'>
RangeIndex: 198005 entries, 0 to 198004
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 pid 198005 non-null int64
1 playlist_name 198005 non-null object
2 pos 198005 non-null int64
3 artist_name 198005 non-null object
4 track_name 198005 non-null object
5 duration_ms 198005 non-null int64
6 album_name 198005 non-null object
dtypes: int64(3), object(4)
memory usage: 10.6+ MB
| pid | playlist_name | pos | artist_name | track_name | duration_ms | album_name | |
|---|---|---|---|---|---|---|---|
| 0 | 0 | Throwbacks | 0 | Missy Elliott | Lose Control (feat. Ciara & Fat Man Scoop) | 226863 | The Cookbook |
| 1 | 0 | Throwbacks | 1 | Britney Spears | Toxic | 198800 | In The Zone |
| 2 | 0 | Throwbacks | 2 | Beyoncé | Crazy In Love | 235933 | Dangerously In Love (Alben für die Ewigkeit) |
| 3 | 0 | Throwbacks | 3 | Justin Timberlake | Rock Your Body | 267266 | Justified |
| 4 | 0 | Throwbacks | 4 | Shaggy | It Wasn't Me | 227600 | Hot Shot |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 198000 | 999998 | ✝️ | 6 | Chris Tomlin | Waterfall | 209573 | Love Ran Red |
| 198001 | 999998 | ✝️ | 7 | Chris Tomlin | The Roar | 220106 | Love Ran Red |
| 198002 | 999998 | ✝️ | 8 | Crowder | Lift Your Head Weary Sinner (Chains) | 224666 | Neon Steeple |
| 198003 | 999998 | ✝️ | 9 | Chris Tomlin | We Fall Down | 280960 | How Great Is Our God: The Essential Collection |
| 198004 | 999998 | ✝️ | 10 | Caleb and Kelsey | 10,000 Reasons / What a Beautiful Name | 178189 | 10,000 Reasons / What a Beautiful Name |
198005 rows × 7 columns
Warning: total number of rows (198005) exceeds max_rows (20000). Limiting to first (20000) rows.
To start, I want to see if my favorite band is included in this DataFrame. I will do this two different ways, first by indexing (and searching) and then by using filtering methods. Both of the following code chunks provide the same filtered dataframe.
#Indexing, first we must set the index of the DataFrame to a new variable
spotify_name = spotify.set_index(['artist_name'])
#Now we just have to search for Coldplay!
spotify_name.loc[['Coldplay']]| pid | playlist_name | pos | track_name | duration_ms | album_name | |
|---|---|---|---|---|---|---|
| artist_name | ||||||
| Coldplay | 31 | Running 2.0 | 88 | Hymn For The Weekend - Seeb Remix | 212647 | A Head Full Of Dreams Tour Edition |
| Coldplay | 45 | angst | 43 | The Scientist | 309600 | A Rush Of Blood To The Head |
| Coldplay | 51 | Kevin | 27 | Midnight | 294666 | Ghost Stories |
| Coldplay | 85 | Gym | 1 | Adventure Of A Lifetime | 263786 | A Head Full Of Dreams |
| Coldplay | 85 | Gym | 2 | Hymn For The Weekend | 258826 | A Head Full Of Dreams |
| ... | ... | ... | ... | ... | ... | ... |
| Coldplay | 999968 | chill songs | 89 | A Sky Full of Stars | 268466 | Ghost Stories |
| Coldplay | 999968 | chill songs | 90 | See You Soon | 171373 | The Blue Room |
| Coldplay | 999979 | summer 2017 | 18 | Paradise | 278719 | Mylo Xyloto |
| Coldplay | 999984 | Lake | 9 | Adventure Of A Lifetime | 263786 | A Head Full Of Dreams |
| Coldplay | 999989 | PARTAY | 23 | Hymn For The Weekend - Seeb Remix | 212647 | A Head Full Of Dreams Tour Edition |
483 rows × 6 columns
#Filtering -> use of a the query() method
coldplay = spotify.query("artist_name == 'Coldplay'")
coldplay| pid | playlist_name | pos | artist_name | track_name | duration_ms | album_name | |
|---|---|---|---|---|---|---|---|
| 1732 | 31 | Running 2.0 | 88 | Coldplay | Hymn For The Weekend - Seeb Remix | 212647 | A Head Full Of Dreams Tour Edition |
| 2755 | 45 | angst | 43 | Coldplay | The Scientist | 309600 | A Rush Of Blood To The Head |
| 2973 | 51 | Kevin | 27 | Coldplay | Midnight | 294666 | Ghost Stories |
| 4859 | 85 | Gym | 1 | Coldplay | Adventure Of A Lifetime | 263786 | A Head Full Of Dreams |
| 4860 | 85 | Gym | 2 | Coldplay | Hymn For The Weekend | 258826 | A Head Full Of Dreams |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 196152 | 999968 | chill songs | 89 | Coldplay | A Sky Full of Stars | 268466 | Ghost Stories |
| 196153 | 999968 | chill songs | 90 | Coldplay | See You Soon | 171373 | The Blue Room |
| 196787 | 999979 | summer 2017 | 18 | Coldplay | Paradise | 278719 | Mylo Xyloto |
| 196980 | 999984 | Lake | 9 | Coldplay | Adventure Of A Lifetime | 263786 | A Head Full Of Dreams |
| 197493 | 999989 | PARTAY | 23 | Coldplay | Hymn For The Weekend - Seeb Remix | 212647 | A Head Full Of Dreams Tour Edition |
483 rows × 7 columns
Now, I’m curious to see how many times Coldplay appears in this particular dataset that contains 1 million different playlists.
I am also particularly interested in the number of unique track_names that appear within this DataFrame.
coldplay.shape[0] #there are 483 observations of Coldplay in this data!483
spotify.shape[0]198005
483/198005 #Coldplay shows up in less than 1% of playlists, I suppose there is a very specific mood of a playlist that calls for Coldplay.0.002439332340092422
coldplay['track_name'].nunique() #74 unique tracks74
I also thought it would be interesting to see the mean position in which coldplay songs are added to these playlists (especially in comparison to the average position for all observations in the dataframe).
coldplay['pos'].mean() #Around the 45th position!44.79710144927536
spotify['pos'].mean()54.39170728011919
Finally, let’s look at the length of these songs.
spotify['duration_ms'].mean()234740.84469079063
coldplay['duration_ms'].mean() #coldplay songs seem to be longer! lets see if there are any in particular through sorting.268694.7039337474
coldplay.sort_values(['duration_ms'], ascending = False)
#With this information in the dataframe, they range from 465957ms to 136866ms| pid | playlist_name | pos | artist_name | track_name | duration_ms | album_name | |
|---|---|---|---|---|---|---|---|
| 122515 | 1827 | Now | 0 | Coldplay | O | 465957 | Ghost Stories |
| 59456 | 881 | EDC | 31 | Coldplay | Every Teardrop Is a Waterfall - Coldplay vs. S... | 408828 | Until Now |
| 41930 | 631 | house | 11 | Coldplay | Every Teardrop Is a Waterfall - Coldplay vs. S... | 408828 | Until Now |
| 4861 | 85 | Gym | 3 | Coldplay | Up&Up | 405320 | A Head Full Of Dreams |
| 107839 | 1607 | sleeps | 29 | Coldplay | Gravity | 380946 | Talk |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 196153 | 999968 | chill songs | 90 | Coldplay | See You Soon | 171373 | The Blue Room |
| 135923 | 999023 | Pop songs | 41 | Coldplay | Life In Technicolor | 149133 | Viva La Vida Or Death And All His Friends |
| 59961 | 887 | Reception | 61 | Coldplay | Life In Technicolor | 149133 | Viva La Vida Or Death And All His Friends |
| 9437 | 144 | picks | 25 | Coldplay | U.F.O. | 137819 | Mylo Xyloto |
| 44608 | 665 | recommendations !! | 48 | Coldplay | Don't Panic | 136866 | Parachutes |
483 rows × 7 columns
Looking back at my very brief look at Coldplay observations within the Spotify DataFrame, I learned quite a bit from the data.
- Coldplay showed up in a fairly low number of playlists (less than 1%)
- There was quite a range in the length of the tracks, from 136 seconds to 465 seconds.
- When comparing the average position, Coldplay comes slightly sooner in playlists than the average for all songs.
Second approach to the data: With Visualization using Seaborn
import seaborn as sns
import matplotlib.pyplot as plt#Use pandas to identify the ten artists with the most tracks
sum_tracks_by_artist = spotify.groupby('artist_name').agg(sum_appearances = ('track_name', 'count')).reset_index()
top_10 = sum_tracks_by_artist.nlargest(10, 'sum_appearances', keep = 'all')
top_10_artists = top_10['artist_name'].tolist()
top_10_artists
#Now we just need to filter out the dataframe so it includes JUST those that are in this list of artists (top 10).['Drake',
'Kanye West',
'Kendrick Lamar',
'Rihanna',
'The Weeknd',
'Future',
'Eminem',
'Lil Uzi Vert',
'Ed Sheeran',
'The Chainsmokers']
Now, given the list of the top artists on Spotify, we can examine where they’re placed in playlists (pos variable). We can visualize this using histogram plots, one that is distinct for every artist. All of these have the same count variable on the left, which is simply the number of occurences at a certain point in the playlist.
The only thing is; I was unable to order them in the order in which they appeared to have the most tracks, it resorted back to their appearances in the dataframe when creating the histograms.
spotify_top_10 = spotify[spotify['artist_name'].isin(top_10_artists)]
spotify_top_10
(
sns.FacetGrid(
data = spotify_top_10,
col = 'artist_name',
hue = 'artist_name'
)
.map(sns.histplot, 'pos')
)
Keeping in mind the ACTUAL order of the artists: ‘Drake’, ‘Kanye West’, ‘Kendrick Lamar’, ‘Rihanna’, ‘The Weeknd’, ‘Future’, ‘Eminem’, ‘Lil Uzi Vert’, ‘Ed Sheeran’, ‘The Chainsmokers’
Based on these histograms for each artist, there is always a right-skew to the distribution of positions in the playlist they’re in. There appears to be a strong correlation between the start of a playlist and the inclusion of the music of these top 10 artists. This makes sense, since they are popular artists, with the top 10 most tracks appearing in playlists, and are more likely to be recognized and thought of toward the beginning of the creation of a playlist, or added early by an algorithm.