Basic Data Cleaning with the Diamonds Dataset

Lesson 1

Introduction to Data Cleaning

Hello! In this lesson, we will dive into the basic concepts of data cleaning using the Diamonds dataset from the seaborn library. Data cleaning is a crucial step in data preprocessing, ensuring that our data is ready for analysis by dealing with inconsistencies, errors, and missing values.

Data cleaning involves identifying and handling missing values, correcting errors, and ensuring consistency. By cleaning your data, you improve the quality of your analysis and the performance of machine learning models.

Quick Recap: Loading and Exploring

Let's quickly revisit how to load the dataset, explore its structure, and identify missing values. First, load the Diamonds dataset using the seaborn library:

Python
1import seaborn as sns
2import pandas as pd
3
4# Load the diamonds dataset
5diamonds = sns.load_dataset('diamonds')

View the first few rows to get an initial overview:

Python
1print(diamonds.head())

Output:

Plain text
1   carat      cut color clarity  depth  table  price     x     y     z
20   0.23    Ideal     E     SI2   61.5   55.0    326  3.95  3.98  2.43
31   0.21  Premium     E     SI1   59.8   61.0    326  3.89  3.84  2.31
42   0.23     Good     E     VS1   56.9   65.0    327  4.05  4.07  2.31
53   0.29  Premium     I     VS2   62.4   58.0    334  4.20  4.23  2.63
64   0.31     Good     J     SI2   63.3   58.0    335  4.34  4.35  2.75

You can access a column using either diamonds['cut'] or diamonds.get('cut'). Both will return the 'cut' column, but get is safer as it does not raise a KeyError if the column is missing.

Python
1print(diamonds['cut'].head()) # Or print(diamonds.get('cut').head())

Output:

Plain text
10      Ideal
21    Premium
32       Good
43    Premium
54       Good
6Name: cut, dtype: category
7Categories (5, object): ['Ideal', 'Premium', 'Very Good', 'Good', 'Fair']

Check the dimensions and basic statistics:

Python
1print(diamonds.shape)
2print(diamonds.describe())

Output:

Plain text
1(53940, 10)
2              carat         depth         table         price             x  \
3count  53940.000000  53940.000000  53940.000000  53940.000000  53940.000000   
4mean       0.797940     61.749405     57.457184   3932.799722      5.731157   
5std        0.474011      1.432621      2.234491   3989.439738      1.121761   
6min        0.200000     43.000000     43.000000    326.000000      0.000000   
725%        0.400000     61.000000     56.000000    950.000000      4.710000   
850%        0.700000     61.800000     57.000000   2401.000000      5.700000   
975%        1.040000     62.500000     59.000000   5324.250000      6.540000   
10max        5.010000     79.000000     95.000000  18823.000000     10.740000   
11
12                  y             z  
13count  53940.000000  53940.000000  
14mean       5.734526      3.538734  
15std        1.142135      0.705699  
16min        0.000000      0.000000  
1725%        4.720000      2.910000  
1850%        5.710000      3.530000  
1975%        6.540000      4.040000  
20max       58.900000     31.800000

Quick Recap: Identifying Missing Values

To identify missing values use the isnull() function combined with the sum() function:

Python
1print(diamonds.isnull().sum())

This results in the following output:

Plain text
1carat      0
2cut        0
3color      0
4clarity    0
5depth      0
6table      0
7price      0
8x          0
9y          0
10z          0
11dtype: int64

For demonstration, simulate a missing value:

Python
1diamonds.loc[0, 'cut'] = None
2print(diamonds.isnull().sum())

The output of the code will reflect the added null value and subsequently be:

Plain text
1carat      0
2cut        1
3color      0
4clarity    0
5depth      0
6table      0
7price      0
8x          0
9y          0
10z          0
11dtype: int64

This output shows that after simulating a missing value in the 'cut' column, we successfully detect it using the isnull().sum() function, illustrating the method to find missing data within our dataset.

Handling Missing Values

There are several strategies to handle missing values, including dropping rows and filling in missing values. For simplicity, we'll focus on dropping rows with any null values.

To drop rows with missing values, we use the dropna() function:

Python
1diamonds_cleaned = diamonds.dropna()
2print(diamonds_cleaned.shape)

The output of the above code will be:

Plain text
1(53939, 10)

This indicates we have successfully removed the row with the missing value, reducing our dataset from 53,940 rows to 53,939.

This will remove any rows containing null values and return a cleaned DataFrame. To confirm that there are no missing values left, we check again:

Python
1print(diamonds_cleaned.isnull().sum())

The output of the above code will be:

Plain text
1carat      0
2cut        0
3color      0
4clarity    0
5depth      0
6table      0
7price      0
8x          0
9y          0
10z          0
11dtype: int64

This confirms that there are no more missing values in our cleaned dataset, indicating a successful data cleaning process.

Lesson Summar

In this lesson, we've covered the basics of data cleaning, specifically focusing on identifying and handling missing values using the Diamonds dataset. You learned to:

Load and explore the Diamonds dataset.
Identify missing values.
Handle missing values by dropping rows with null values.

Keep practicing, and you'll be well-prepared for the next steps!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.