Lesson 4

We step into the world of **Data Cleaning and Transformation**. Real-life data isn't always tidy; it has inconsistencies, missing data points, outliers, and even incorrect data! To extract meaningful insights or build reliable machine learning models, we clean and transform data.

In this session, we handle inconsistencies and outliers and apply various data transformations to enhance its readiness for analysis. Now, let's start this exploratory journey!

Why clean and transform data? Simple: unclean or inconsistent data can skew analysis or predictions. Weather data with missing temperatures, for instance, can lead to misleading climate predictions. The real world is full of such examples of analysis gone awry due to unclean data.

Let's delve into spotting inconsistencies. For instance, `XL`

, `X-L`

, `xl`

represent the same clothing size but are reported differently. Python's `pandas`

library comes in handy here.

Python`1import pandas as pd 2 3# hypothetical dataset of clothing sizes 4sizes = ['XL', 'S', 'M', 'X-L', 'xl', 'S', 'L', 'XL', 'M'] 5df = pd.DataFrame(sizes, columns=['Size']) 6 7# Use value_counts() to spot inconsistent values 8print(df['Size'].value_counts())`

Output:

`1XL 2 2X-L 1 3xl 1 4S 2 5M 2 6L 1 7dtype: int64`

To sort out inconsistencies, replace them with a standard value.

Python`1df.replace(['X-L', 'xl'], 'XL', inplace=True) 2print(df['Size'].value_counts())`

Output:

`1XL 4 2S 2 3M 2 4L 1 5dtype: int64`

Scanning for outliers, or exceptional values, is the next step. Outliers can distort the analytical outcome. One common method to detect outliers is using the Interquartile Range (IQR).

As a short reminder, IQR method suggests that any value below $Q_1 - 1.5 \cdot IQR$ and above $Q_3 + 1.5 \cdot IQR$ are considered to be outliers. Where:

- $Q_1$ – The first quartile
- $Q_3$ – The third quartile
- $IQR$ – The Interquartile Range

Let's use the IQR method to identify and filter out outliers in a dataset.

Python`1import pandas as pd 2 3# A dataset with an outlier 4data = [1, 1.2, 1.1, 1.05, 1.5, 1.4, 9] 5df = pd.DataFrame(data, columns=['Values']) 6 7# Calculate Q1 (25th percentile) and Q3 (75th percentile) 8Q1 = df['Values'].quantile(0.25) 9Q3 = df['Values'].quantile(0.75) 10 11# Calculate IQR 12IQR = Q3 - Q1 13 14# Define the acceptable range (1.5 * IQR rule) 15lower_bound = Q1 - 1.5 * IQR 16upper_bound = Q3 + 1.5 * IQR 17 18# Filter out outliers 19no_outliers_df = df[(df['Values'] >= lower_bound) & (df['Values'] <= upper_bound)] 20print(no_outliers_df)`

Output:

`1 Values 20 1.00 31 1.20 42 1.10 53 1.05 64 1.50 75 1.40`

The value `9`

is considered an outlier and is excluded from the filtered dataset.

Now, data transformation is required when data needs adjustment to suit a specific analysis or model-building exercise. The need might be to bring skewed data to normality or harmonize differing scales of variables. For this purpose, the `scikit-learn`

library comes in handy. Though this course is not about this library, it is widely used with pandas dataframes, so we will take a look at it:

Python`1import pandas as pd 2from sklearn.preprocessing import StandardScaler 3 4# Creating a dataframe 5data = {'Feature1': [0.5, 0.6, 0.9], 'Feature2': [10, 12, 15]} 6df = pd.DataFrame(data) 7 8# Initializing StandardScaler 9scaler = StandardScaler() 10 11# Scaling just 'Feature2' 12df['Feature2_scaled'] = scaler.fit_transform(df[['Feature2']]) 13 14print("\nOriginal Data with Scaled Feature2 Column:") 15print(df)`

Output:

`1Original Data with Scaled Feature2 Column: 2 Feature1 Feature2 Feature2_scaled 30 0.5 10 -1.135550 41 0.6 12 -0.162221 52 0.9 15 1.297771`

Standard scaler works simply: just create the `StandardScaler`

object, then use its `fit_transform`

method on data. We select the column with double square brackets [['Feature2']] to ensure it's treated as a DataFrame (required by scikit-learn transformers) rather than a Series. In the output, we see a new column `Feature2_scaled`

, which is values of `Feature2`

, but scaled so their mean is `0`

and standard deviation is `1`

.

Kudos! You've completed the **Data Cleaning and Transformation** lesson. You learned about handling data inconsistencies and outliers and performing data transformations using Python's `pandas`

and `scikit-learn`

libraries.

Practice exercises are next, focusing on consolidating concepts through application. They will reinforce your understanding and hone your skills. So, ready, set, explore!