Lesson 1
Basic Data Cleaning with the Diamonds Dataset
Introduction to Data Cleaning

Hello! In this lesson, we will dive into the basic concepts of data cleaning using the Diamonds dataset from the seaborn library. Data cleaning is a crucial step in data preprocessing, ensuring that our data is ready for analysis by dealing with inconsistencies, errors, and missing values.

Data cleaning involves identifying and handling missing values, correcting errors, and ensuring consistency. By cleaning your data, you improve the quality of your analysis and the performance of machine learning models.

Quick Recap: Loading and Exploring

Let's quickly revisit how to load the dataset, explore its structure, and identify missing values. First, load the Diamonds dataset using the seaborn library:

Python
1import seaborn as sns 2import pandas as pd 3 4# Load the diamonds dataset 5diamonds = sns.load_dataset('diamonds')

View the first few rows to get an initial overview:

Python
1print(diamonds.head())

Output:

Plain text
1 carat cut color clarity depth table price x y z 20 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43 31 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31 42 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31 53 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63 64 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75

You can access a column using either diamonds['cut'] or diamonds.get('cut'). Both will return the 'cut' column, but get is safer as it does not raise a KeyError if the column is missing.

Python
1print(diamonds['cut'].head()) # Or print(diamonds.get('cut').head())

Output:

Plain text
10 Ideal 21 Premium 32 Good 43 Premium 54 Good 6Name: cut, dtype: category 7Categories (5, object): ['Ideal', 'Premium', 'Very Good', 'Good', 'Fair']

Check the dimensions and basic statistics:

Python
1print(diamonds.shape) 2print(diamonds.describe())

Output:

Plain text
1(53940, 10) 2 carat depth table price x \ 3count 53940.000000 53940.000000 53940.000000 53940.000000 53940.000000 4mean 0.797940 61.749405 57.457184 3932.799722 5.731157 5std 0.474011 1.432621 2.234491 3989.439738 1.121761 6min 0.200000 43.000000 43.000000 326.000000 0.000000 725% 0.400000 61.000000 56.000000 950.000000 4.710000 850% 0.700000 61.800000 57.000000 2401.000000 5.700000 975% 1.040000 62.500000 59.000000 5324.250000 6.540000 10max 5.010000 79.000000 95.000000 18823.000000 10.740000 11 12 y z 13count 53940.000000 53940.000000 14mean 5.734526 3.538734 15std 1.142135 0.705699 16min 0.000000 0.000000 1725% 4.720000 2.910000 1850% 5.710000 3.530000 1975% 6.540000 4.040000 20max 58.900000 31.800000
Quick Recap: Identifying Missing Values

To identify missing values use the isnull() function combined with the sum() function:

Python
1print(diamonds.isnull().sum())

This results in the following output:

Plain text
1carat 0 2cut 0 3color 0 4clarity 0 5depth 0 6table 0 7price 0 8x 0 9y 0 10z 0 11dtype: int64

For demonstration, simulate a missing value:

Python
1diamonds.loc[0, 'cut'] = None 2print(diamonds.isnull().sum())

The output of the code will reflect the added null value and subsequently be:

Plain text
1carat 0 2cut 1 3color 0 4clarity 0 5depth 0 6table 0 7price 0 8x 0 9y 0 10z 0 11dtype: int64

This output shows that after simulating a missing value in the 'cut' column, we successfully detect it using the isnull().sum() function, illustrating the method to find missing data within our dataset.

Handling Missing Values

There are several strategies to handle missing values, including dropping rows and filling in missing values. For simplicity, we'll focus on dropping rows with any null values.

To drop rows with missing values, we use the dropna() function:

Python
1diamonds_cleaned = diamonds.dropna() 2print(diamonds_cleaned.shape)

The output of the above code will be:

Plain text
1(53939, 10)

This indicates we have successfully removed the row with the missing value, reducing our dataset from 53,940 rows to 53,939.

This will remove any rows containing null values and return a cleaned DataFrame. To confirm that there are no missing values left, we check again:

Python
1print(diamonds_cleaned.isnull().sum())

The output of the above code will be:

Plain text
1carat 0 2cut 0 3color 0 4clarity 0 5depth 0 6table 0 7price 0 8x 0 9y 0 10z 0 11dtype: int64

This confirms that there are no more missing values in our cleaned dataset, indicating a successful data cleaning process.

Lesson Summar

In this lesson, we've covered the basics of data cleaning, specifically focusing on identifying and handling missing values using the Diamonds dataset. You learned to:

  • Load and explore the Diamonds dataset.
  • Identify missing values.
  • Handle missing values by dropping rows with null values.

Keep practicing, and you'll be well-prepared for the next steps!

Enjoy this lesson? Now it's time to practice with Cosmo!
Practice is how you turn knowledge into actual skills.