Welcome back! In this lesson, we will delve into another powerful technique for feature selection — SelectFromModel
. This technique is particularly useful when you have a trained model and want to select the most important features based on the model's importance criterion.
SelectFromModel
is a meta-transformer that can be used along with any estimator that assigns importance to each feature through a specific attribute (like coef_
or feature_importances_
). Later, you can set a threshold, and SelectFromModel
will consider those features whose importance is more than this threshold.
So, in essence, SelectFromModel
does the heavy lifting of identifying and choosing the right features based on the model's importance criterion — a significant advantage for any machine learning practitioner!
While discussing dimensionality reduction and feature selection remains crucial, these theories gain relevance and become more comprehensible when we apply them to real-life datasets. For the purpose of our lesson, we shall work with the California Housing dataset available in scikit-learn's
set of datasets.
Let's begin by loading and briefly exploring the dataset:
Python1# Import necessary libraries 2from sklearn.datasets import fetch_california_housing 3 4# Load California housing dataset 5housing = fetch_california_housing() 6X = housing.data 7Y = housing.target
The California housing dataset encapsulates data on multiple variables like the average number of rooms, average income, population, etc., in various housing blocks in California. Each of these variables might exhibit a distinct influence on the median housing prices, our target variable in this dataset. Managing such a dataset makes feature selection quite necessary!
After loading the dataset, we can split the data into training and testing sets using the train_test_split
function from Scikit-learn, so that we can train our model on the training data and evaluate it on the testing data. We will skip this step in this lesson, and focus only on applying SelectFromModel
for feature selection.
Before moving on to feature selection, let's understand why we need a model to perform feature selection. The model helps us determine the importance of each feature in predicting the target variable. This importance is then used by SelectFromModel
to select the most important features. Hence, the model acts as a guiding light in our feature selection journey.
For demo purposes, we'll use a simple Linear Regression model where it will learn the relationship between the independent variables (features) and the dependent variable (housing prices).
Let's fit our model:
Python1from sklearn.linear_model import LinearRegression 2 3# Fitting linear regression model on the data 4lr = LinearRegression() 5lr.fit(X_train, Y_train)
Great! Now that we have a trained Linear Regression model, let's use it with SelectFromModel
to perform feature selection.
Let's use our trained model with SelectFromModel
, which will use the Linear Regression's coefficients to determine the importance of features. Features having coefficients greater than a pre-defined threshold will be considered important.
Here's how we do it:
Python1from sklearn.feature_selection import SelectFromModel 2 3# Applying SelectFromModel 4sfm = SelectFromModel(lr) 5sfm.fit(X_train, Y_train)
With these steps, we instructed SelectFromModel
to analyze all features in our dataset, determine their importance through our Linear Regression model, and select features that are considered important by our model.
Finally, we arrive at the step where we uncover the most important features in our dataset according to SelectFromModel
. We do this by calling sfm.get_support(indices=True)
, which returns an array with indices of features that are important. These indices are used to get the corresponding feature names from the dataset.
Let's unveil our most important features:
Python1# Printing the names of the most important features 2for feature_list_index in sfm.get_support(indices=True): 3 print(housing.feature_names[feature_list_index])
The output will be:
Plain text1MedInc 2AveBedrms 3Latitude 4Longitude
This output indicates that among all the features provided by the California Housing dataset, Median Income (MedInc
), Average Bedrooms (AveBedrms
), Latitude, and Longitude have been identified as the most important features when determining housing prices. This filtering allows for a more focused analysis and model training with these key features.
Congratulations, you now have a list of most important features that you can further use in training a more efficient and accurate model. Reducing your data dimensionality using SelectFromModel
doesn't look challenging now, does it?
The default threshold for SelectFromModel
is the mean of the feature importances, but you can set your own threshold using the threshold
parameter. For instance, if you want to set a threshold of 0.6, you can do so by:
Python1sfm = SelectFromModel(lr, threshold=0.6) 2sfm.fit(X, Y) 3 4for feature_list_index in sfm.get_support(indices=True): 5 print(housing.feature_names[feature_list_index])
The output will be:
Plain text1AveBedrms
In today's lesson, we dived deep into feature selection using Scikit-learn's
SelectFromModel
. We started at understanding why feature selection is significant and how SelectFromModel
helps achieve it. We used the California Housing dataset and demonstrated each step from training the model to extracting the most important features using SelectFromModel
.
Residing on theory won't do much good unless you get your hands dirty with code. Hence, the upcoming exercises are carefully designed for you that will solidify your understanding of this topic. Remember, practice is the key. Happy coding!