Hello! In this lesson, we will dive into the basic concepts of data cleaning using the Diamonds dataset from the seaborn
library. Data cleaning is a crucial step in data preprocessing, ensuring that our data is ready for analysis by dealing with inconsistencies, errors, and missing values.
Data cleaning involves identifying and handling missing values, correcting errors, and ensuring consistency. By cleaning your data, you improve the quality of your analysis and the performance of machine learning models.
Let's quickly revisit how to load the dataset, explore its structure, and identify missing values. First, load the Diamonds dataset using the seaborn
library:
Python1import seaborn as sns 2import pandas as pd 3 4# Load the diamonds dataset 5diamonds = sns.load_dataset('diamonds')
View the first few rows to get an initial overview:
Python1print(diamonds.head())
Output:
Plain text1 carat cut color clarity depth table price x y z 20 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43 31 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31 42 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31 53 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63 64 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75
You can access a column using either diamonds['cut']
or diamonds.get('cut')
. Both will return the 'cut' column, but get
is safer as it does not raise a KeyError if the column is missing.
Python1print(diamonds['cut'].head()) # Or print(diamonds.get('cut').head())
Output:
Plain text10 Ideal 21 Premium 32 Good 43 Premium 54 Good 6Name: cut, dtype: category 7Categories (5, object): ['Ideal', 'Premium', 'Very Good', 'Good', 'Fair']
Check the dimensions and basic statistics:
Python1print(diamonds.shape) 2print(diamonds.describe())
Output:
Plain text1(53940, 10) 2 carat depth table price x \ 3count 53940.000000 53940.000000 53940.000000 53940.000000 53940.000000 4mean 0.797940 61.749405 57.457184 3932.799722 5.731157 5std 0.474011 1.432621 2.234491 3989.439738 1.121761 6min 0.200000 43.000000 43.000000 326.000000 0.000000 725% 0.400000 61.000000 56.000000 950.000000 4.710000 850% 0.700000 61.800000 57.000000 2401.000000 5.700000 975% 1.040000 62.500000 59.000000 5324.250000 6.540000 10max 5.010000 79.000000 95.000000 18823.000000 10.740000 11 12 y z 13count 53940.000000 53940.000000 14mean 5.734526 3.538734 15std 1.142135 0.705699 16min 0.000000 0.000000 1725% 4.720000 2.910000 1850% 5.710000 3.530000 1975% 6.540000 4.040000 20max 58.900000 31.800000
To identify missing values use the isnull()
function combined with the sum()
function:
Python1print(diamonds.isnull().sum())
This results in the following output:
Plain text1carat 0 2cut 0 3color 0 4clarity 0 5depth 0 6table 0 7price 0 8x 0 9y 0 10z 0 11dtype: int64
For demonstration, simulate a missing value:
Python1diamonds.loc[0, 'cut'] = None 2print(diamonds.isnull().sum())
The output of the code will reflect the added null value and subsequently be:
Plain text1carat 0 2cut 1 3color 0 4clarity 0 5depth 0 6table 0 7price 0 8x 0 9y 0 10z 0 11dtype: int64
This output shows that after simulating a missing value in the 'cut' column, we successfully detect it using the isnull().sum()
function, illustrating the method to find missing data within our dataset.
There are several strategies to handle missing values, including dropping rows and filling in missing values. For simplicity, we'll focus on dropping rows with any null values.
To drop rows with missing values, we use the dropna()
function:
Python1diamonds_cleaned = diamonds.dropna() 2print(diamonds_cleaned.shape)
The output of the above code will be:
Plain text1(53939, 10)
This indicates we have successfully removed the row with the missing value, reducing our dataset from 53,940 rows to 53,939.
This will remove any rows containing null values and return a cleaned DataFrame. To confirm that there are no missing values left, we check again:
Python1print(diamonds_cleaned.isnull().sum())
The output of the above code will be:
Plain text1carat 0 2cut 0 3color 0 4clarity 0 5depth 0 6table 0 7price 0 8x 0 9y 0 10z 0 11dtype: int64
This confirms that there are no more missing values in our cleaned dataset, indicating a successful data cleaning process.
In this lesson, we've covered the basics of data cleaning, specifically focusing on identifying and handling missing values using the Diamonds dataset. You learned to:
Keep practicing, and you'll be well-prepared for the next steps!