Lesson 2
Data Cleaning and Preparation with Billboard Christmas Dataset
Loading the Dataset and Data Type Assessment

Welcome! Today, we will refine our Billboard Christmas dataset, preparing it for data visualization. Start by loading the dataset into a Pandas DataFrame. This step will set a strong foundation for data cleaning by giving us a preview of the dataset's structure.

First, let's double-check the structure of our dataset:

Python
1import pandas as pd 2 3df = pd.read_csv('billboard_christmas.csv') 4print(df.head())

The output of the above code will be:

Plain text
1 url weekid ... month day 20 http://www.billboard.com/charts/hot-100/1958-1... 12/13/1958 ... 12 13 31 http://www.billboard.com/charts/hot-100/1958-1... 12/20/1958 ... 12 20 42 http://www.billboard.com/charts/hot-100/1958-1... 12/20/1958 ... 12 20 53 http://www.billboard.com/charts/hot-100/1958-1... 12/20/1958 ... 12 20 64 http://www.billboard.com/charts/hot-100/1958-1... 12/27/1958 ... 12 27 7 8[5 rows x 13 columns]

Take special note of the weekid column. We'll be converting this into a datetime format to leverage datetime features in the next steps. Understanding data types will help us decode and work with data correctly.

Date Conversion and Feature Creation

Having a look at weekid, let's convert it to a datetime format, which enables us to easily extract month and week details. Extracting these details will enhance your dataset with temporal features that can aid in identifying trends.

The following code snippet carries out these conversions:

Python
1# Convert 'weekid' to datetime 2df['weekid'] = pd.to_datetime(df['weekid']) 3 4# Extract month and week of year from the date 5df['month'] = df['weekid'].dt.month 6df['week_of_year'] = df['weekid'].dt.isocalendar().week 7 8# Create a boolean feature for December 9df['is_december'] = df['month'] == 12 10 11# Print the head of the dataframe to see new columns 12print(df[['weekid', 'month', 'week_of_year', 'is_december']].head())

The output of the above code will be:

Plain text
1[5 rows x 13 columns] 2 weekid month week_of_year is_december 30 1958-12-13 12 50 True 41 1958-12-20 12 51 True 52 1958-12-20 12 51 True 63 1958-12-20 12 51 True 74 1958-12-27 12 52 True

By converting weekid and using .dt.month and .dt.isocalendar().week, we enrich the dataset with new dimensions for identifying seasonal patterns. The is_december feature efficiently flags entries that occur in December, pivotal for holiday-focused analysis.

Data Quality Checks

Ensuring data quality is crucial before any analysis. Let's assess missing values and potential duplicate records. Pandas offers methods to do this quickly:

Python
1# Check for missing values 2print("Missing values by column:") 3print(df.isnull().sum()) 4 5# Check for duplicate rows 6print("\nDuplicate rows:", df.duplicated().sum())

The output of the above code will be:

Plain text
1Missing values by column: 2url 0 3weekid 0 4week_position 0 5song 0 6performer 0 7songid 0 8instance 0 9previous_week_position 108 10peak_position 0 11weeks_on_chart 0 12year 0 13month 0 14day 0 15week_of_year 0 16is_december 0 17dtype: int64 18 19Duplicate rows: 0

This result indicates that our dataset has a few missing values for the previous_week_position column and no duplicate rows, eliminating common data quality concerns and simplifying the subsequent analysis steps.

Standardizing Text Data

Next, let's ensure the uniformity of our text data for consistent results in analysis and visualization. We'll perform two main text standardization steps on song and performer names: removing extra spaces and converting the text to title case.

Python
1# Strip whitespace and capitalize the first letter of each word in 'song' and 'performer' 2df['song'] = df['song'].str.strip().str.title() 3df['performer'] = df['performer'].str.strip().str.title() 4 5# Print the first few rows to see changes 6print(df[['song', 'performer']].head())

By using the .str.strip() method, we eliminate any unnecessary spaces at the beginning or end of the text, and with .str.title(), we ensure that each word starts with a capital letter. The output of the above code will be:

Plain text
1 song performer 20 Run Rudolph Run Chuck Berry 31 Jingle Bell Rock Bobby Helms 42 Run Rudolph Run Chuck Berry 53 White Christmas Bing Crosby 64 Green Chri$Tma$ Stan Freberg

This transformation not only standardizes the format but also enhances readability and consistency, which is crucial for accurate text-based analyses or visualizations.

Saving the Cleaned Dataset

The final step is to save the cleaned dataset. A clean dataset will facilitate analysis and visualization and ensure reproducibility of results:

Python
1# Create a clean copy and save to a new CSV file 2df_clean = df.copy() 3df_clean.to_csv('billboard_christmas_clean.csv', index=False) 4 5print("Data saved to 'billboard_christmas_clean.csv'")

The output of the above code will be:

Python
1Data saved to 'billboard_christmas_clean.csv'

This message confirms that the cleaned dataset has been successfully saved to a new file, marking the completion of the data preparation phase and ensuring our data is ready for detailed analysis and visualization.

Lesson Summary

Great job! You've learned how to refine the billboard_christmas.csv dataset, transforming raw data into a structured form, ripe for further analysis and visualization. This hands-on experience with Pandas strengthens your foundational data preparation skills and sets you up for success in the upcoming lessons. Practice with these tasks will solidify your understanding and enable fluid handling of similar datasets. Embrace these skills as you continue your journey in data engineering and analytics!

Enjoy this lesson? Now it's time to practice with Cosmo!
Practice is how you turn knowledge into actual skills.