Engineering New Features for Better Predictions

Intro to Data Cleaning and Preprocessing with TitanicLesson 6

Lesson 6

Intro to Feature Engineering

Welcome to this lesson on Feature Engineering! Today, we'll explore how to derive new features from our existing data to enhance our predictive models. These derived features could provide more insightful information that our original data might not capture directly.

Feature Engineering is an essential part of machine learning, and it's the process of using domain knowledge to create features that make machine learning algorithms work. Although modern machine learning methods can automatically derive features, manually combining existing features – based on human intuition and industry expertise – can often produce better results.

Why is Feature Engineering vital? Consider this parallel: Artistic talent won't help a painter without paints, and a high-quality dataset may be useless without proper features. The process of Feature Engineering ensures you have the 'right paint' to create your masterpiece!

Let's use the Titanic dataset as an example. We could create a new feature, age_group to categorize age into different groups, or another feature, family_size, by adding sibsp (number of siblings/spouses aboard) and parch (number of parents/children aboard). Let's dive in!

Creating New Features

We'll start by creating the family_size feature. This is simply the sibsp and parch features added together plus one (the passenger themself). You might be wondering why we are creating the family_size feature. The reason is that sometimes, the size of the family might have a significant impact on the survival chance of a person. For instance, if a person has a big family, they might have gotten confused and lost in the crowd, or they might have tried to look for their family members, delaying their escape.

Python
1# Load the data
2import seaborn as sns
3
4titanic_df = sns.load_dataset('titanic')
5
6# Create a new feature, 'family_size'
7titanic_df['family_size'] = titanic_df['sibsp'] + titanic_df['parch'] + 1
8print(titanic_df.head())
9"""
10   survived  pclass     sex   age  ...  embark_town  alive  alone family_size
110         0       3    male  22.0  ...  Southampton     no  False           2
121         1       1  female  38.0  ...    Cherbourg    yes  False           2
132         1       3  female  26.0  ...  Southampton    yes   True           1
143         1       1  female  35.0  ...  Southampton    yes  False           2
154         0       3    male  35.0  ...  Southampton     no   True           1
16
17[5 rows x 16 columns]
18"""

After executing the code above, you'll notice an extra column family_size in the dataset, representing each passenger's family size. For instance, the first passenger (Mr. Owen Harris) has a family size of 2 (one spouse aboard), and Miss. Laina has a family size of 1 (alone).

Creating Categorical Features

Another common operation in feature engineering is the creation of categorical features. Usually, categories carry more meanings than continuous values. For instance, we could categorize age into different age groups. We can use the cut() function from pandas, which segments and sorts data values into bins. This function is quite efficient for transforming continuous variables into categorical counterparts. An underlying concept of the function is that it uses the values of the input array to determine the appropriate bin for each value.

Python
1# Import pandas
2import pandas as pd
3
4# Define the bin edges
5age_bins = [0, 12, 18, 30, 45, 100]
6
7# Define the bin labels
8age_labels = ['Child', 'Teenager', 'Young Adult', 'Middle Age', 'Senior']
9
10# Create the age group feature
11titanic_df['age_group'] = pd.cut(titanic_df['age'], bins=age_bins, labels=age_labels)
12
13# Show the first few rows of the data
14print(titanic_df.head())
15"""
16   survived  pclass     sex   age  ...  alive  alone  family_size    age_group
170         0       3    male  22.0  ...     no  False            2  Young Adult
181         1       1  female  38.0  ...    yes  False            2   Middle Age
192         1       3  female  26.0  ...    yes   True            1  Young Adult
203         1       1  female  35.0  ...    yes  False            2   Middle Age
214         0       3    male  35.0  ...     no   True            1   Middle Age
22
23[5 rows x 17 columns]
24"""

Here, pd.cut() function is used to segregate array elements into different bins. The bins argument defines the bin edges, and the labels argument sets the label names for the resultant bins. In the output, you'll notice a new column, age_group, categorizing passengers into different age groups.

Let's check the distribution of the age_group to verify that the transformation was successful:

Python
1# Check the distribution of the 'age_group' column
2print(titanic_df['age_group'].value_counts())
3"""
4age_group
5Young Adult    270
6Middle Age     202
7Senior         103
8Teenager        70
9Child           69
10Name: count, dtype: int64
11"""

You'll see that each age group has a specific count of passengers in the dataset belonging to that group.

Summing It Up

In this lesson, we've learned to engineer new features from our existing data using Python, improving the subsequent performance of our machine-learning models. We've added a family_size feature by adding up sibsp and parch, and we've created a new categorical feature, age_group, by segmenting ages and defining labels for each segment. We've also learned how to use the cut function from Pandas to achieve this, defining bin edges that made these segments meaningful.

The practicality of family_size and age_group are demographic representations that can influence survival chances. For example, larger families might have a lower chance of survival due to difficulties keeping the family together during the sinking, or certain age groups might have a higher or lower survival rate.

Feature engineering is often a vital step in the real world because you're not always provided with the most predictive features at the start. Sometimes, you'll have to create them yourself!

Let's move on to apply what you've learned and get your hands 'dirty' on some data. The more you practice, the better your intuition for engineering new features will become. Happy Coding!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.