Lesson 3

Comprehensive Preprocessing With Multiple Techniques: Part 1

Lesson Introduction

Imagine you are cleaning your room and organizing items step-by-step. Data preprocessing is similar! In this lesson, we'll prepare a dataset for analysis by integrating multiple preprocessing techniques. Our goal is to make the data clean and ready for useful insights.

Drop Unnecessary Columns

Not all columns are useful for our analysis. Some might be redundant or irrelevant. For example, columns like deck, embark_town, alive, class, who, adult_male, and alone may not add much value. Let's drop these columns.

Python
1import pandas as pd 2import seaborn as sns 3 4titanic = sns.load_dataset('titanic') 5 6# Drop unnecessary columns 7columns_to_drop = ['deck', 'embark_town', 'alive', 'class', 'who', 'adult_male', 'alone'] 8titanic = titanic.drop(columns=columns_to_drop) 9 10# Display the DataFrame after dropping columns 11print(titanic.head())
1 survived pclass sex age sibsp parch fare embarked 20 0 3 male 22.0 1 0 7.2500 S 31 1 1 female 38.0 1 0 71.2833 C 42 1 3 female 26.0 0 0 7.9250 S 53 1 1 female 35.0 1 0 53.1000 S 64 0 3 male 35.0 0 0 8.0500 S

We use the .drop() function, which takes a list of columns names to drop as an argument columns.

Handle Missing Values

Data often has missing values, which are problematic for many algorithms. In our Titanic dataset, we can fill missing values with reasonable substitutes like the median for numerical columns and the mode for categorical columns.

Python
1# Fill missing values in 'age' with the median value 2titanic['age'] = titanic['age'].fillna(titanic['age'].median()) 3 4# Fill missing values in 'embarked' with the mode value 5titanic['embarked'] = titanic['embarked'].fillna(titanic['embarked'].mode()[0]) 6 7# Fill missing values in 'fare' with the median value 8titanic['fare'] = titanic['fare'].fillna(titanic['fare'].median())

Here, we use the fillna method to replace missing values (NaN) in a DataFrame with a specified value. You can provide a single value, a dictionary of values specifying different substitutes for different columns, or use aggregations like median or mode for more meaningful replacements, like we do here.

Let's check if it worked.

Python
1# Check for any remaining missing values 2print(titanic.isnull().sum())

This line outputs the count of missing values for each column in the titanic DataFrame. isnull() function returns a new dataframe of the same size, containing True instead of the missing values, and False instead of the present values. If we find the sum of these boolean values, True will be taken as 1, and False – as 0. Thus, if there are any missing values, the sum will be positive.

The output is:

1survived 0 2pclass 0 3sex 0 4age 0 5sibsp 0 6parch 0 7fare 0 8embarked 0 9dtype: int64

We see zeros everywhere, indicating there is no more missing values in the dataframe.

Encode Categorical Values

Categorical values need to be converted into numbers for most algorithms. For example, the sex and embarked columns in our dataset are categorical. We'll use the get_dummies function to encode these columns.

Python
1# Encode categorical values 2titanic = pd.get_dummies(titanic, columns=['sex', 'embarked'], dtype='int') 3 4# Display the DataFrame after encoding 5print(titanic.head())

Note the dtype=int parameter. It specifies that we expect our new encoding columns to hold either 0 or 1. Otherwise, they will hold False or True.

1 survived pclass age sibsp parch fare sex_female sex_male embarked_C embarked_Q embarked_S 20 0 3 22.0 1 0 7.2500 0 1 0 0 1 31 1 1 38.0 1 0 71.2833 1 0 1 0 0 42 1 3 26.0 0 0 7.9250 1 0 0 0 1 53 1 1 35.0 1 0 53.1000 1 0 0 0 1 64 0 3 35.0 0 0 8.0500 0 1 0 0 1
Scale Numerical Values

Scaling numerical values is crucial, especially for algorithms that rely on the distance between data points. We will standardize the age and fare columns so they have a mean of 0 and a standard deviation of 1.

Python
1# Scale numerical values 2titanic['age'] = (titanic['age'] - titanic['age'].mean()) / titanic['age'].std() 3titanic['fare'] = (titanic['fare'] - titanic['fare'].mean()) / titanic['fare'].std() 4 5# Display the DataFrame after scaling 6print(titanic.head())
1 survived pclass age sibsp parch fare sex_female sex_male embarked_C embarked_Q embarked_S 20 0 3 -0.530005 1 0 -0.502445 0 1 0 0 1 31 1 1 0.571433 1 0 0.786845 1 0 1 0 0 42 1 3 -0.254888 0 0 -0.488854 1 0 0 0 1 53 1 1 0.396745 1 0 0.420730 1 0 0 0 1 64 0 3 0.396745 0 0 -0.486337 0 1 0 0 1
Lesson Summary

Congratulations! You've cleaned and prepared the Titanic dataset using multiple preprocessing techniques. Here's a quick recap:

  • Loaded and inspected the dataset.
  • Dropped unnecessary columns to focus on valuable data.
  • Handled missing values to ensure the dataset is complete.
  • Encoded categorical values to make them usable by algorithms.
  • Scaled numerical values to improve model performance.

Now it's time to put your newfound skills to the test! In the upcoming practice session, you'll apply these preprocessing techniques to another dataset. This hands-on experience will solidify your understanding and give you confidence in tackling data preprocessing in real-world scenarios. Let's get started!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.