#Let's load up some data from Spotify so that we can take a look at it!

import pandas as pd

spotify = pd.read_csv('https://bcdanl.github.io/data/spotify_all.csv')

spotify.info()
spotify

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 198005 entries, 0 to 198004
Data columns (total 7 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   pid            198005 non-null  int64 
 1   playlist_name  198005 non-null  object
 2   pos            198005 non-null  int64 
 3   artist_name    198005 non-null  object
 4   track_name     198005 non-null  object
 5   duration_ms    198005 non-null  int64 
 6   album_name     198005 non-null  object
dtypes: int64(3), object(4)
memory usage: 10.6+ MB

	pid	playlist_name	pos	artist_name	track_name	duration_ms	album_name
0	0	Throwbacks	0	Missy Elliott	Lose Control (feat. Ciara & Fat Man Scoop)	226863	The Cookbook
1	0	Throwbacks	1	Britney Spears	Toxic	198800	In The Zone
2	0	Throwbacks	2	Beyoncé	Crazy In Love	235933	Dangerously In Love (Alben für die Ewigkeit)
3	0	Throwbacks	3	Justin Timberlake	Rock Your Body	267266	Justified
4	0	Throwbacks	4	Shaggy	It Wasn't Me	227600	Hot Shot
...	...	...	...	...	...	...	...
198000	999998	✝️	6	Chris Tomlin	Waterfall	209573	Love Ran Red
198001	999998	✝️	7	Chris Tomlin	The Roar	220106	Love Ran Red
198002	999998	✝️	8	Crowder	Lift Your Head Weary Sinner (Chains)	224666	Neon Steeple
198003	999998	✝️	9	Chris Tomlin	We Fall Down	280960	How Great Is Our God: The Essential Collection
198004	999998	✝️	10	Caleb and Kelsey	10,000 Reasons / What a Beautiful Name	178189	10,000 Reasons / What a Beautiful Name

198005 rows × 7 columns

Warning: total number of rows (198005) exceeds max_rows (20000). Limiting to first (20000) rows.

To start, I want to see if my favorite band is included in this DataFrame. I will do this two different ways, first by indexing (and searching) and then by using filtering methods. Both of the following code chunks provide the same filtered dataframe.

#Indexing, first we must set the index of the DataFrame to a new variable
spotify_name = spotify.set_index(['artist_name'])

#Now we just have to search for Coldplay!
spotify_name.loc[['Coldplay']]

	pid	playlist_name	pos	track_name	duration_ms	album_name
artist_name
Coldplay	31	Running 2.0	88	Hymn For The Weekend - Seeb Remix	212647	A Head Full Of Dreams Tour Edition
Coldplay	45	angst	43	The Scientist	309600	A Rush Of Blood To The Head
Coldplay	51	Kevin	27	Midnight	294666	Ghost Stories
Coldplay	85	Gym	1	Adventure Of A Lifetime	263786	A Head Full Of Dreams
Coldplay	85	Gym	2	Hymn For The Weekend	258826	A Head Full Of Dreams
...	...	...	...	...	...	...
Coldplay	999968	chill songs	89	A Sky Full of Stars	268466	Ghost Stories
Coldplay	999968	chill songs	90	See You Soon	171373	The Blue Room
Coldplay	999979	summer 2017	18	Paradise	278719	Mylo Xyloto
Coldplay	999984	Lake	9	Adventure Of A Lifetime	263786	A Head Full Of Dreams
Coldplay	999989	PARTAY	23	Hymn For The Weekend - Seeb Remix	212647	A Head Full Of Dreams Tour Edition

483 rows × 6 columns

#Filtering -> use of a the query() method
coldplay = spotify.query("artist_name == 'Coldplay'")
coldplay

	pid	playlist_name	pos	artist_name	track_name	duration_ms	album_name
1732	31	Running 2.0	88	Coldplay	Hymn For The Weekend - Seeb Remix	212647	A Head Full Of Dreams Tour Edition
2755	45	angst	43	Coldplay	The Scientist	309600	A Rush Of Blood To The Head
2973	51	Kevin	27	Coldplay	Midnight	294666	Ghost Stories
4859	85	Gym	1	Coldplay	Adventure Of A Lifetime	263786	A Head Full Of Dreams
4860	85	Gym	2	Coldplay	Hymn For The Weekend	258826	A Head Full Of Dreams
...	...	...	...	...	...	...	...
196152	999968	chill songs	89	Coldplay	A Sky Full of Stars	268466	Ghost Stories
196153	999968	chill songs	90	Coldplay	See You Soon	171373	The Blue Room
196787	999979	summer 2017	18	Coldplay	Paradise	278719	Mylo Xyloto
196980	999984	Lake	9	Coldplay	Adventure Of A Lifetime	263786	A Head Full Of Dreams
197493	999989	PARTAY	23	Coldplay	Hymn For The Weekend - Seeb Remix	212647	A Head Full Of Dreams Tour Edition

483 rows × 7 columns

Now, I’m curious to see how many times Coldplay appears in this particular dataset that contains 1 million different playlists.

I am also particularly interested in the number of unique track_names that appear within this DataFrame.

coldplay.shape[0] #there are 483 observations of Coldplay in this data!

spotify.shape[0]

483/198005 #Coldplay shows up in less than 1% of playlists, I suppose there is a very specific mood of a playlist that calls for Coldplay.

0.002439332340092422

coldplay['track_name'].nunique() #74 unique tracks

I also thought it would be interesting to see the mean position in which coldplay songs are added to these playlists (especially in comparison to the average position for all observations in the dataframe).

coldplay['pos'].mean() #Around the 45th position!

44.79710144927536

spotify['pos'].mean()

54.39170728011919

Finally, let’s look at the length of these songs.

spotify['duration_ms'].mean()

234740.84469079063

coldplay['duration_ms'].mean() #coldplay songs seem to be longer! lets see if there are any in particular through sorting.

268694.7039337474

coldplay.sort_values(['duration_ms'], ascending = False)
#With this information in the dataframe, they range from 465957ms to 136866ms

	pid	playlist_name	pos	artist_name	track_name	duration_ms	album_name
122515	1827	Now	0	Coldplay	O	465957	Ghost Stories
59456	881	EDC	31	Coldplay	Every Teardrop Is a Waterfall - Coldplay vs. S...	408828	Until Now
41930	631	house	11	Coldplay	Every Teardrop Is a Waterfall - Coldplay vs. S...	408828	Until Now
4861	85	Gym	3	Coldplay	Up&Up	405320	A Head Full Of Dreams
107839	1607	sleeps	29	Coldplay	Gravity	380946	Talk
...	...	...	...	...	...	...	...
196153	999968	chill songs	90	Coldplay	See You Soon	171373	The Blue Room
135923	999023	Pop songs	41	Coldplay	Life In Technicolor	149133	Viva La Vida Or Death And All His Friends
59961	887	Reception	61	Coldplay	Life In Technicolor	149133	Viva La Vida Or Death And All His Friends
9437	144	picks	25	Coldplay	U.F.O.	137819	Mylo Xyloto
44608	665	recommendations !!	48	Coldplay	Don't Panic	136866	Parachutes

483 rows × 7 columns

Looking back at my very brief look at Coldplay observations within the Spotify DataFrame, I learned quite a bit from the data.

Coldplay showed up in a fairly low number of playlists (less than 1%)
There was quite a range in the length of the tracks, from 136 seconds to 465 seconds.
When comparing the average position, Coldplay comes slightly sooner in playlists than the average for all songs.

Second approach to the data: With Visualization using Seaborn

import seaborn as sns
import matplotlib.pyplot as plt

#Use pandas to identify the ten artists with the most tracks
sum_tracks_by_artist = spotify.groupby('artist_name').agg(sum_appearances = ('track_name', 'count')).reset_index()
top_10 = sum_tracks_by_artist.nlargest(10, 'sum_appearances', keep = 'all')
top_10_artists = top_10['artist_name'].tolist()
top_10_artists
#Now we just need to filter out the dataframe so it includes JUST those that are in this list of artists (top 10).

['Drake',
 'Kanye West',
 'Kendrick Lamar',
 'Rihanna',
 'The Weeknd',
 'Future',
 'Eminem',
 'Lil Uzi Vert',
 'Ed Sheeran',
 'The Chainsmokers']

Now, given the list of the top artists on Spotify, we can examine where they’re placed in playlists (pos variable). We can visualize this using histogram plots, one that is distinct for every artist. All of these have the same count variable on the left, which is simply the number of occurences at a certain point in the playlist.

The only thing is; I was unable to order them in the order in which they appeared to have the most tracks, it resorted back to their appearances in the dataframe when creating the histograms.

spotify_top_10 = spotify[spotify['artist_name'].isin(top_10_artists)]
spotify_top_10

(
 sns.FacetGrid(
       data = spotify_top_10,
       col = 'artist_name',
       hue = 'artist_name'
       )
 .map(sns.histplot, 'pos')
 )

Keeping in mind the ACTUAL order of the artists: ‘Drake’, ‘Kanye West’, ‘Kendrick Lamar’, ‘Rihanna’, ‘The Weeknd’, ‘Future’, ‘Eminem’, ‘Lil Uzi Vert’, ‘Ed Sheeran’, ‘The Chainsmokers’

Based on these histograms for each artist, there is always a right-skew to the distribution of positions in the playlist they’re in. There appears to be a strong correlation between the start of a playlist and the inclusion of the music of these top 10 artists. This makes sense, since they are popular artists, with the top 10 most tracks appearing in playlists, and are more likely to be recognized and thought of toward the beginning of the creation of a playlist, or added early by an algorithm.

A Brief Look at Spotify

Second approach to the data: With Visualization using Seaborn

As always, thanks for reading!