Lesson 4
Data Cleaning and Transformation
Introduction

We step into the world of Data Cleaning and Transformation. Real-life data isn't always tidy; it has inconsistencies, missing data points, outliers, and even incorrect data! To extract meaningful insights or build reliable machine learning models, we clean and transform data.

In this session, we handle inconsistencies and outliers and apply various data transformations to enhance its readiness for analysis. Now, let's start this exploratory journey!

Why is Data Cleaning and Transformation Necessary?

Why clean and transform data? Simple: unclean or inconsistent data can skew analysis or predictions. Weather data with missing temperatures, for instance, can lead to misleading climate predictions. The real world is full of such examples of analysis gone awry due to unclean data.

Recognizing Inconsistencies in Data

Let's delve into spotting inconsistencies. For instance, XL, X-L, xl represent the same clothing size but are reported differently. Python's pandas library comes in handy here.

Python
1import pandas as pd 2 3# hypothetical dataset of clothing sizes 4sizes = ['XL', 'S', 'M', 'X-L', 'xl', 'S', 'L', 'XL', 'M'] 5df = pd.DataFrame(sizes, columns=['Size']) 6 7# Use value_counts() to spot inconsistent values 8print(df['Size'].value_counts())

Output:

1XL 2 2X-L 1 3xl 1 4S 2 5M 2 6L 1 7dtype: int64
Dealing with Inconsistencies in Data

To sort out inconsistencies, replace them with a standard value.

Python
1df.replace(['X-L', 'xl'], 'XL', inplace=True) 2print(df['Size'].value_counts())

Output:

1XL 4 2S 2 3M 2 4L 1 5dtype: int64
Detecting and Filtering Outliers

Scanning for outliers, or exceptional values, is the next step. Outliers can distort the analytical outcome. One common method to detect outliers is using the Interquartile Range (IQR).

As a short reminder, IQR method suggests that any value below Q11.5IQRQ_1 - 1.5 \cdot IQR and above Q3+1.5IQRQ_3 + 1.5 \cdot IQR are considered to be outliers. Where:

  • Q1Q_1 – The first quartile
  • Q3Q_3 – The third quartile
  • IQRIQR – The Interquartile Range

Let's use the IQR method to identify and filter out outliers in a dataset.

Python
1import pandas as pd 2 3# A dataset with an outlier 4data = [1, 1.2, 1.1, 1.05, 1.5, 1.4, 9] 5df = pd.DataFrame(data, columns=['Values']) 6 7# Calculate Q1 (25th percentile) and Q3 (75th percentile) 8Q1 = df['Values'].quantile(0.25) 9Q3 = df['Values'].quantile(0.75) 10 11# Calculate IQR 12IQR = Q3 - Q1 13 14# Define the acceptable range (1.5 * IQR rule) 15lower_bound = Q1 - 1.5 * IQR 16upper_bound = Q3 + 1.5 * IQR 17 18# Filter out outliers 19no_outliers_df = df[(df['Values'] >= lower_bound) & (df['Values'] <= upper_bound)] 20print(no_outliers_df)

Output:

1 Values 20 1.00 31 1.20 42 1.10 53 1.05 64 1.50 75 1.40

The value 9 is considered an outlier and is excluded from the filtered dataset.

Data Transformation

Now, data transformation is required when data needs adjustment to suit a specific analysis or model-building exercise. The need might be to bring skewed data to normality or harmonize differing scales of variables. For this purpose, the scikit-learn library comes in handy. Though this course is not about this library, it is widely used with pandas dataframes, so we will take a look at it:

Python
1import pandas as pd 2from sklearn.preprocessing import StandardScaler 3 4# Creating a dataframe 5data = {'Feature1': [0.5, 0.6, 0.9], 'Feature2': [10, 12, 15]} 6df = pd.DataFrame(data) 7 8# Initializing StandardScaler 9scaler = StandardScaler() 10 11# Scaling just 'Feature2' 12df['Feature2_scaled'] = scaler.fit_transform(df[['Feature2']]) 13 14print("\nOriginal Data with Scaled Feature2 Column:") 15print(df)

Output:

1Original Data with Scaled Feature2 Column: 2   Feature1  Feature2  Feature2_scaled 30       0.5        10        -1.135550 41       0.6        12        -0.162221 52       0.9        15         1.297771

Standard scaler works simply: just create the StandardScaler object, then use its fit_transform method on data. We select the column with double square brackets [['Feature2']] to ensure it's treated as a DataFrame (required by scikit-learn transformers) rather than a Series. In the output, we see a new column Feature2_scaled, which is values of Feature2, but scaled so their mean is 0 and standard deviation is 1.

Lesson Summary and Practice

Kudos! You've completed the Data Cleaning and Transformation lesson. You learned about handling data inconsistencies and outliers and performing data transformations using Python's pandas and scikit-learn libraries.

Practice exercises are next, focusing on consolidating concepts through application. They will reinforce your understanding and hone your skills. So, ready, set, explore!

Enjoy this lesson? Now it's time to practice with Cosmo!
Practice is how you turn knowledge into actual skills.