Lesson 6

Engineering New Features for Better Predictions

Intro to Feature Engineering

Welcome to this lesson on Feature Engineering! Today, we'll explore how to derive new features from our existing data to enhance our predictive models. These derived features could provide more insightful information that our original data might not capture directly.

Feature Engineering is an essential part of machine learning, and it's the process of using domain knowledge to create features that make machine learning algorithms work. Although modern machine learning methods can automatically derive features, manually combining existing features – based on human intuition and industry expertise – can often produce better results.

Why is Feature Engineering vital? Consider this parallel: Artistic talent won't help a painter without paints, and a high-quality dataset may be useless without proper features. The process of Feature Engineering ensures you have the 'right paint' to create your masterpiece!

Let's use the Titanic dataset as an example. We could create a new feature, age_group to categorize age into different groups, or another feature, family_size, by adding sibsp (number of siblings/spouses aboard) and parch (number of parents/children aboard). Let's dive in!

Creating New Features

We'll start by creating the family_size feature. This is simply the sibsp and parch features added together plus one (the passenger themself). You might be wondering why we are creating the family_size feature. The reason is that sometimes, the size of the family might have a significant impact on the survival chance of a person. For instance, if a person has a big family, they might have gotten confused and lost in the crowd, or they might have tried to look for their family members, delaying their escape.

Python
1# Load the data 2import seaborn as sns 3 4titanic_df = sns.load_dataset('titanic') 5 6# Create a new feature, 'family_size' 7titanic_df['family_size'] = titanic_df['sibsp'] + titanic_df['parch'] + 1 8print(titanic_df.head()) 9""" 10 survived pclass sex age ... embark_town alive alone family_size 110 0 3 male 22.0 ... Southampton no False 2 121 1 1 female 38.0 ... Cherbourg yes False 2 132 1 3 female 26.0 ... Southampton yes True 1 143 1 1 female 35.0 ... Southampton yes False 2 154 0 3 male 35.0 ... Southampton no True 1 16 17[5 rows x 16 columns] 18"""

After executing the code above, you'll notice an extra column family_size in the dataset, representing each passenger's family size. For instance, the first passenger (Mr. Owen Harris) has a family size of 2 (one spouse aboard), and Miss. Laina has a family size of 1 (alone).

Creating Categorical Features

Another common operation in feature engineering is the creation of categorical features. Usually, categories carry more meanings than continuous values. For instance, we could categorize age into different age groups. We can use the cut() function from pandas, which segments and sorts data values into bins. This function is quite efficient for transforming continuous variables into categorical counterparts. An underlying concept of the function is that it uses the values of the input array to determine the appropriate bin for each value.

Python
1# Import pandas 2import pandas as pd 3 4# Define the bin edges 5age_bins = [0, 12, 18, 30, 45, 100] 6 7# Define the bin labels 8age_labels = ['Child', 'Teenager', 'Young Adult', 'Middle Age', 'Senior'] 9 10# Create the age group feature 11titanic_df['age_group'] = pd.cut(titanic_df['age'], bins=age_bins, labels=age_labels) 12 13# Show the first few rows of the data 14print(titanic_df.head()) 15""" 16 survived pclass sex age ... alive alone family_size age_group 170 0 3 male 22.0 ... no False 2 Young Adult 181 1 1 female 38.0 ... yes False 2 Middle Age 192 1 3 female 26.0 ... yes True 1 Young Adult 203 1 1 female 35.0 ... yes False 2 Middle Age 214 0 3 male 35.0 ... no True 1 Middle Age 22 23[5 rows x 17 columns] 24"""

Here, pd.cut() function is used to segregate array elements into different bins. The bins argument defines the bin edges, and the labels argument sets the label names for the resultant bins. In the output, you'll notice a new column, age_group, categorizing passengers into different age groups.

Let's check the distribution of the age_group to verify that the transformation was successful:

Python
1# Check the distribution of the 'age_group' column 2print(titanic_df['age_group'].value_counts()) 3""" 4age_group 5Young Adult 270 6Middle Age 202 7Senior 103 8Teenager 70 9Child 69 10Name: count, dtype: int64 11"""

You'll see that each age group has a specific count of passengers in the dataset belonging to that group.

Summing It Up

In this lesson, we've learned to engineer new features from our existing data using Python, improving the subsequent performance of our machine-learning models. We've added a family_size feature by adding up sibsp and parch, and we've created a new categorical feature, age_group, by segmenting ages and defining labels for each segment. We've also learned how to use the cut function from Pandas to achieve this, defining bin edges that made these segments meaningful.

The practicality of family_size and age_group are demographic representations that can influence survival chances. For example, larger families might have a lower chance of survival due to difficulties keeping the family together during the sinking, or certain age groups might have a higher or lower survival rate.

Feature engineering is often a vital step in the real world because you're not always provided with the most predictive features at the start. Sometimes, you'll have to create them yourself!

Let's move on to apply what you've learned and get your hands 'dirty' on some data. The more you practice, the better your intuition for engineering new features will become. Happy Coding!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.