Welcome! Today, we will refine our Billboard Christmas dataset, preparing it for data visualization. Start by loading the dataset into a Pandas DataFrame. This step will set a strong foundation for data cleaning by giving us a preview of the dataset's structure.
First, let's double-check the structure of our dataset:
Python1import pandas as pd 2 3df = pd.read_csv('billboard_christmas.csv') 4print(df.head())
The output of the above code will be:
Plain text1 url weekid ... month day 20 http://www.billboard.com/charts/hot-100/1958-1... 12/13/1958 ... 12 13 31 http://www.billboard.com/charts/hot-100/1958-1... 12/20/1958 ... 12 20 42 http://www.billboard.com/charts/hot-100/1958-1... 12/20/1958 ... 12 20 53 http://www.billboard.com/charts/hot-100/1958-1... 12/20/1958 ... 12 20 64 http://www.billboard.com/charts/hot-100/1958-1... 12/27/1958 ... 12 27 7 8[5 rows x 13 columns]
Take special note of the weekid
column. We'll be converting this into a datetime format to leverage datetime features in the next steps. Understanding data types will help us decode and work with data correctly.
Having a look at weekid
, let's convert it to a datetime format, which enables us to easily extract month and week details. Extracting these details will enhance your dataset with temporal features that can aid in identifying trends.
The following code snippet carries out these conversions:
Python1# Convert 'weekid' to datetime 2df['weekid'] = pd.to_datetime(df['weekid']) 3 4# Extract month and week of year from the date 5df['month'] = df['weekid'].dt.month 6df['week_of_year'] = df['weekid'].dt.isocalendar().week 7 8# Create a boolean feature for December 9df['is_december'] = df['month'] == 12 10 11# Print the head of the dataframe to see new columns 12print(df[['weekid', 'month', 'week_of_year', 'is_december']].head())
The output of the above code will be:
Plain text1[5 rows x 13 columns] 2 weekid month week_of_year is_december 30 1958-12-13 12 50 True 41 1958-12-20 12 51 True 52 1958-12-20 12 51 True 63 1958-12-20 12 51 True 74 1958-12-27 12 52 True
By converting weekid
and using .dt.month
and .dt.isocalendar().week
, we enrich the dataset with new dimensions for identifying seasonal patterns. The is_december
feature efficiently flags entries that occur in December, pivotal for holiday-focused analysis.
Ensuring data quality is crucial before any analysis. Let's assess missing values and potential duplicate records. Pandas offers methods to do this quickly:
Python1# Check for missing values 2print("Missing values by column:") 3print(df.isnull().sum()) 4 5# Check for duplicate rows 6print("\nDuplicate rows:", df.duplicated().sum())
The output of the above code will be:
Plain text1Missing values by column: 2url 0 3weekid 0 4week_position 0 5song 0 6performer 0 7songid 0 8instance 0 9previous_week_position 108 10peak_position 0 11weeks_on_chart 0 12year 0 13month 0 14day 0 15week_of_year 0 16is_december 0 17dtype: int64 18 19Duplicate rows: 0
This result indicates that our dataset has a few missing values for the previous_week_position
column and no duplicate rows, eliminating common data quality concerns and simplifying the subsequent analysis steps.
Next, let's ensure the uniformity of our text data for consistent results in analysis and visualization. We'll perform two main text standardization steps on song and performer names: removing extra spaces and converting the text to title case.
Python1# Strip whitespace and capitalize the first letter of each word in 'song' and 'performer' 2df['song'] = df['song'].str.strip().str.title() 3df['performer'] = df['performer'].str.strip().str.title() 4 5# Print the first few rows to see changes 6print(df[['song', 'performer']].head())
By using the .str.strip()
method, we eliminate any unnecessary spaces at the beginning or end of the text, and with .str.title()
, we ensure that each word starts with a capital letter. The output of the above code will be:
Plain text1 song performer 20 Run Rudolph Run Chuck Berry 31 Jingle Bell Rock Bobby Helms 42 Run Rudolph Run Chuck Berry 53 White Christmas Bing Crosby 64 Green Chri$Tma$ Stan Freberg
This transformation not only standardizes the format but also enhances readability and consistency, which is crucial for accurate text-based analyses or visualizations.
The final step is to save the cleaned dataset. A clean dataset will facilitate analysis and visualization and ensure reproducibility of results:
Python1# Create a clean copy and save to a new CSV file 2df_clean = df.copy() 3df_clean.to_csv('billboard_christmas_clean.csv', index=False) 4 5print("Data saved to 'billboard_christmas_clean.csv'")
The output of the above code will be:
Python1Data saved to 'billboard_christmas_clean.csv'
This message confirms that the cleaned dataset has been successfully saved to a new file, marking the completion of the data preparation phase and ensuring our data is ready for detailed analysis and visualization.
Great job! You've learned how to refine the billboard_christmas.csv
dataset, transforming raw data into a structured form, ripe for further analysis and visualization. This hands-on experience with Pandas strengthens your foundational data preparation skills and sets you up for success in the upcoming lessons. Practice with these tasks will solidify your understanding and enable fluid handling of similar datasets. Embrace these skills as you continue your journey in data engineering and analytics!