Greetings! Welcome to an exciting lesson on feature selection and dimensionality reduction, a foundational element in the realms of machine learning and data science. Today, we will delve into a variance-based approach for feature selection in high-dimensionality data. We will explore the importance of feature selection, understand the concept of variance, and implement feature selection using VarianceThreshold
on a synthetic dataset.
The variance of a feature is a statistical measurement that describes the spread of data points in a data feature. It is one of the key metrics that carries top significance in statistical data analysis.
In context of feature selection, if a feature has a low variance (close to zero), it likely carries less information. For instance, consider a dataset of students with a variable 'nationality' where 99% of students come from India, the 'nationality' feature will have very low variance as almost all observations are 'India'; it’s near-constant and therefore would not improve the model's performance.
Variance based feature selection should be used in the cases when you suspect that some features are near-constant and may not be informative for the model.
Scikit-learn provides the VarianceThreshold
method to remove all features which variance doesn’t meet some threshold. By removing these low variance features, we can then decrease the number of input dimensions.
To demonstrate our feature selection and dimensionality reduction concepts, let's start by generating a synthetic dataset. For many machine learning concepts, especially those related to data preprocessing and manipulation, synthetic datasets can be a useful tool for learning and exploration.
First, we'll need to import pandas
, numpy
and VarianceThreshold
from sklearn.feature_selection
. We are going to use pandas
and numpy
to create a DataFrame with ten distinct features, each composed of random numbers.
Python1# Importing necessary libraries 2import pandas as pd 3import numpy as np 4from sklearn.feature_selection import VarianceThreshold 5np.random.seed(36)
Next, we generate a DataFrame with ten features.
Python1data = pd.DataFrame(data={ 2 "feature_1": np.random.rand(1000), 3 "feature_2": np.random.rand(1000) * 10, 4 "feature_3": np.random.rand(1000), 5 "feature_4": np.random.rand(1000) * 100, 6 "feature_5": np.random.rand(1000), 7 "feature_6": np.random.rand(1000) * 0.1, 8 "feature_7": np.random.rand(1000), 9 "feature_8": np.random.rand(1000) * 0.01, 10 "feature_9": np.random.rand(1000), 11 "feature_10": np.random.rand(1000) * 50, 12}) 13 14print("Original data shape: ", data.shape) # (1000, 10)
The output of the above code will be 1000 rows and 10 columns.
Here, we assume that all features in our data are numerical and there's consequently no missing data.
After generating the data, let's apply VarianceThreshold
and see how it impacts the dimensionality of our data.
Python1# We use the VarianceThreshold to perform the feature selection. 2# We set the threshold to 0.1, meaning that if the variance of a feature is less than 0.1, it will be removed. 3selector = VarianceThreshold(threshold=0.1) 4 5# Fit the data to the VarianceThreshold object 6data_values = data.values 7data_values_reduced = selector.fit_transform(data_values) 8 9# Print the shape of the reduced data 10print("Reduced data shape: ", data_values_reduced.shape) # (1000, 3)
The output of the above code shows that the shape of the reduced data is (1000, 3) after applying the variance threshold.
This indicates that the dimensionality of our dataset has been reduced from 10 features to 3, suggesting that only three features met the variance threshold and therefore were kept.
Now, it would also be beneficial to know which features have been retained after the feature selection process. For this, we can utilize the get_support
method of the VarianceThreshold
object.
Python1# Get the names of the features that were kept. The get_support method returns a boolean mask of the features selected - True for selected features and False for removed features. 2kept_features = data.columns[selector.get_support(indices=True)] 3print("Kept Features: ", kept_features)
The output of the above code will be:
1Kept Features: Index(['feature_2', 'feature_4', 'feature_10'], dtype='object')
This shows the names of the features that were kept after applying the variance threshold. It provides insight into which features contain enough variance to possibly improve the performance of a machine learning model.
You've now learned how to implement VarianceThreshold
for feature selection and dimensionality reduction. We've established the importance of dimensionality reduction, introduced feature selection, walked you through the concept of variance, and performed variance-based feature selection with VarianceThreshold
using a synthetic dataset.
Remember, to gain a good command over these concepts, practice is key! I would recommend you to experiment with different variance thresholds and observe how it affects the number and selection of features. This will bolster your understanding of implementing feature selection within your own data science and machine learning projects! Happy learning!