Data Cleaning and Preparation with Billboard Christmas Dataset

Lesson 2

Loading the Dataset and Data Type Assessment

Welcome! Today, we will refine our Billboard Christmas dataset, preparing it for data visualization. Start by loading the dataset into a Pandas DataFrame. This step will set a strong foundation for data cleaning by giving us a preview of the dataset's structure.

First, let's double-check the structure of our dataset:

Python
1import pandas as pd
2
3df = pd.read_csv('billboard_christmas.csv')
4print(df.head())

The output of the above code will be:

Plain text
1                                                 url      weekid  ...  month day
20  http://www.billboard.com/charts/hot-100/1958-1...  12/13/1958  ...     12  13
31  http://www.billboard.com/charts/hot-100/1958-1...  12/20/1958  ...     12  20
42  http://www.billboard.com/charts/hot-100/1958-1...  12/20/1958  ...     12  20
53  http://www.billboard.com/charts/hot-100/1958-1...  12/20/1958  ...     12  20
64  http://www.billboard.com/charts/hot-100/1958-1...  12/27/1958  ...     12  27
7
8[5 rows x 13 columns]

Take special note of the weekid column. We'll be converting this into a datetime format to leverage datetime features in the next steps. Understanding data types will help us decode and work with data correctly.

Date Conversion and Feature Creation

Having a look at weekid, let's convert it to a datetime format, which enables us to easily extract month and week details. Extracting these details will enhance your dataset with temporal features that can aid in identifying trends.

The following code snippet carries out these conversions:

Python
1# Convert 'weekid' to datetime
2df['weekid'] = pd.to_datetime(df['weekid'])
3
4# Extract month and week of year from the date
5df['month'] = df['weekid'].dt.month
6df['week_of_year'] = df['weekid'].dt.isocalendar().week
7
8# Create a boolean feature for December
9df['is_december'] = df['month'] == 12
10
11# Print the head of the dataframe to see new columns
12print(df[['weekid', 'month', 'week_of_year', 'is_december']].head())

The output of the above code will be:

Plain text
1[5 rows x 13 columns]
2      weekid  month  week_of_year  is_december
30 1958-12-13     12            50         True
41 1958-12-20     12            51         True
52 1958-12-20     12            51         True
63 1958-12-20     12            51         True
74 1958-12-27     12            52         True

By converting weekid and using .dt.month and .dt.isocalendar().week, we enrich the dataset with new dimensions for identifying seasonal patterns. The is_december feature efficiently flags entries that occur in December, pivotal for holiday-focused analysis.

Data Quality Checks

Ensuring data quality is crucial before any analysis. Let's assess missing values and potential duplicate records. Pandas offers methods to do this quickly:

Python
1# Check for missing values
2print("Missing values by column:")
3print(df.isnull().sum())
4
5# Check for duplicate rows
6print("\nDuplicate rows:", df.duplicated().sum())

The output of the above code will be:

Plain text
1Missing values by column:
2url                         0
3weekid                      0
4week_position               0
5song                        0
6performer                   0
7songid                      0
8instance                    0
9previous_week_position    108
10peak_position               0
11weeks_on_chart              0
12year                        0
13month                       0
14day                         0
15week_of_year                0
16is_december                 0
17dtype: int64
18
19Duplicate rows: 0

This result indicates that our dataset has a few missing values for the previous_week_position column and no duplicate rows, eliminating common data quality concerns and simplifying the subsequent analysis steps.

Standardizing Text Data

Next, let's ensure the uniformity of our text data for consistent results in analysis and visualization. We'll perform two main text standardization steps on song and performer names: removing extra spaces and converting the text to title case.

Python
1# Strip whitespace and capitalize the first letter of each word in 'song' and 'performer'
2df['song'] = df['song'].str.strip().str.title()
3df['performer'] = df['performer'].str.strip().str.title()
4
5# Print the first few rows to see changes
6print(df[['song', 'performer']].head())

By using the .str.strip() method, we eliminate any unnecessary spaces at the beginning or end of the text, and with .str.title(), we ensure that each word starts with a capital letter. The output of the above code will be:

Plain text
1               song     performer
20   Run Rudolph Run   Chuck Berry
31  Jingle Bell Rock   Bobby Helms
42   Run Rudolph Run   Chuck Berry
53   White Christmas   Bing Crosby
64   Green Chri$Tma$  Stan Freberg

This transformation not only standardizes the format but also enhances readability and consistency, which is crucial for accurate text-based analyses or visualizations.

Saving the Cleaned Dataset

The final step is to save the cleaned dataset. A clean dataset will facilitate analysis and visualization and ensure reproducibility of results:

Python
1# Create a clean copy and save to a new CSV file
2df_clean = df.copy()
3df_clean.to_csv('billboard_christmas_clean.csv', index=False)
4
5print("Data saved to 'billboard_christmas_clean.csv'")

The output of the above code will be:

Python
1Data saved to 'billboard_christmas_clean.csv'

This message confirms that the cleaned dataset has been successfully saved to a new file, marking the completion of the data preparation phase and ensuring our data is ready for detailed analysis and visualization.

Lesson Summary

Great job! You've learned how to refine the billboard_christmas.csv dataset, transforming raw data into a structured form, ripe for further analysis and visualization. This hands-on experience with Pandas strengthens your foundational data preparation skills and sets you up for success in the upcoming lessons. Practice with these tasks will solidify your understanding and enable fluid handling of similar datasets. Embrace these skills as you continue your journey in data engineering and analytics!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.