Mastering Feature Selection with SelectFromModel in Scikit-learn

Lesson 5

Introduction

Welcome back! In this lesson, we will delve into another powerful technique for feature selection — SelectFromModel. This technique is particularly useful when you have a trained model and want to select the most important features based on the model's importance criterion.

SelectFromModel is a meta-transformer that can be used along with any estimator that assigns importance to each feature through a specific attribute (like coef_ or feature_importances_). Later, you can set a threshold, and SelectFromModel will consider those features whose importance is more than this threshold.

So, in essence, SelectFromModel does the heavy lifting of identifying and choosing the right features based on the model's importance criterion — a significant advantage for any machine learning practitioner!

Exploring the California Housing Dataset

While discussing dimensionality reduction and feature selection remains crucial, these theories gain relevance and become more comprehensible when we apply them to real-life datasets. For the purpose of our lesson, we shall work with the California Housing dataset available in scikit-learn's set of datasets.

Let's begin by loading and briefly exploring the dataset:

Python
1# Import necessary libraries
2from sklearn.datasets import fetch_california_housing
3
4# Load California housing dataset
5housing = fetch_california_housing()
6X = housing.data
7Y = housing.target

The California housing dataset encapsulates data on multiple variables like the average number of rooms, average income, population, etc., in various housing blocks in California. Each of these variables might exhibit a distinct influence on the median housing prices, our target variable in this dataset. Managing such a dataset makes feature selection quite necessary!

After loading the dataset, we can split the data into training and testing sets using the train_test_split function from Scikit-learn, so that we can train our model on the training data and evaluate it on the testing data. We will skip this step in this lesson, and focus only on applying SelectFromModel for feature selection.

Creating and Training a Linear Regression Model

Before moving on to feature selection, let's understand why we need a model to perform feature selection. The model helps us determine the importance of each feature in predicting the target variable. This importance is then used by SelectFromModel to select the most important features. Hence, the model acts as a guiding light in our feature selection journey.

For demo purposes, we'll use a simple Linear Regression model where it will learn the relationship between the independent variables (features) and the dependent variable (housing prices).

Let's fit our model:

Python
1from sklearn.linear_model import LinearRegression
2
3# Fitting linear regression model on the data
4lr = LinearRegression()
5lr.fit(X_train, Y_train)

Great! Now that we have a trained Linear Regression model, let's use it with SelectFromModel to perform feature selection.

Performing Feature Selection using `SelectFromModel`

Let's use our trained model with SelectFromModel, which will use the Linear Regression's coefficients to determine the importance of features. Features having coefficients greater than a pre-defined threshold will be considered important.

Here's how we do it:

Python
1from sklearn.feature_selection import SelectFromModel
2
3# Applying SelectFromModel
4sfm = SelectFromModel(lr)
5sfm.fit(X_train, Y_train)

With these steps, we instructed SelectFromModel to analyze all features in our dataset, determine their importance through our Linear Regression model, and select features that are considered important by our model.

Interpreting and Validating Selected Features

Finally, we arrive at the step where we uncover the most important features in our dataset according to SelectFromModel. We do this by calling sfm.get_support(indices=True), which returns an array with indices of features that are important. These indices are used to get the corresponding feature names from the dataset.

Let's unveil our most important features:

Python
1# Printing the names of the most important features
2for feature_list_index in sfm.get_support(indices=True):
3    print(housing.feature_names[feature_list_index])

The output will be:

Plain text
1MedInc
2AveBedrms
3Latitude
4Longitude

This output indicates that among all the features provided by the California Housing dataset, Median Income (MedInc), Average Bedrooms (AveBedrms), Latitude, and Longitude have been identified as the most important features when determining housing prices. This filtering allows for a more focused analysis and model training with these key features.

Congratulations, you now have a list of most important features that you can further use in training a more efficient and accurate model. Reducing your data dimensionality using SelectFromModel doesn't look challenging now, does it?

Applying Custom Thresholds

The default threshold for SelectFromModel is the mean of the feature importances, but you can set your own threshold using the threshold parameter. For instance, if you want to set a threshold of 0.6, you can do so by:

Python
1sfm = SelectFromModel(lr, threshold=0.6)
2sfm.fit(X, Y)
3
4for feature_list_index in sfm.get_support(indices=True):
5    print(housing.feature_names[feature_list_index])

The output will be:

Plain text
1AveBedrms

Lesson Summary

In today's lesson, we dived deep into feature selection using Scikit-learn's SelectFromModel. We started at understanding why feature selection is significant and how SelectFromModel helps achieve it. We used the California Housing dataset and demonstrated each step from training the model to extracting the most important features using SelectFromModel.

Residing on theory won't do much good unless you get your hands dirty with code. Hence, the upcoming exercises are carefully designed for you that will solidify your understanding of this topic. Remember, practice is the key. Happy coding!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.