Unveiling the Power of Univariate Feature Selection with SelectKBest in Python

Lesson 2

Introduction

Hello, and welcome to this lesson on univariate statistical tests for feature selection in machine learning. The efficient handling of dataset features can drastically impact your machine learning model's performance. We strive to improve our model's accuracy, reduce overfitting, and lessen the training time by intelligently choosing the most relevant features. In the journey to achieve this, we will encounter a dominant concept known as univariate selection for feature selection. We will apply SelectKBest to select the most informative features from our dataset. By the end of this session, you will grasp how to use univariate feature selection in Python and appreciate its strengths and limitations.

Univariate Statistical Tests for Feature Selection

Univariate statistical tests examine each feature independently to determine the strength of the relationship between the feature and the response variable. These tests are simple to run and understand and often provide good intuition about your features. The scikit-learn library provides the SelectKBest class, which uses a set of statistical tests to select a specific number of features.

The SelectKBest class simply retains the first 'k' features of X with the highest scores. In this lesson, we'll use the chi-squared statistical test for non-negative features to select ‘k’ best features. The chi-square test is used to determine whether there's a significant difference between the expected frequencies and the observed frequencies in one or more categories of a contingency table.

Loading Dataset for Feature Selection

We'll use the Iris dataset from Fisher's classic 1936 paper, The Use of Multiple Measurements in Taxonomic Problems, as the dataset for this tutorial. The Iris dataset is one of the datasets scikit-learn comes with that do not require the downloading of any file from some external website. It's a beginner-friendly dataset that contains measurements for 150 iris flowers from three different species.

The dataset contains five attributes - sepal length, sepal width, petal length, petal width, and species. Species is our target variable, while the rest measure particular characteristics (features) of individual Iris flowers.

Here's how you load the dataset:

Python
1from sklearn.datasets import load_iris
2
3# Loading the dataset
4iris = load_iris()
5X, y = iris.data, iris.target
6
7print(X.shape, y.shape) # (150, 4) (150,)

This output indicates that our dataset has 150 samples, each with 4 feature variables, and 150 target values representing the species of each Iris plant.

load_iris() is a function that returns a data object. We then unpack this object into our features (X) and classes (y).

Implementing Univariate Feature Selection with Scikit-learn's SelectKBest

Selecting features using the chi-square statistical test in scikit-learn involves the following steps:

Feature Selection: We use the chi-square statistical test for non-negative features to select k best features from the Iris dataset. The SelectKBest class is used to choose those features, with k=2 indicating that we would like to select the top 2 features that are most related to the output variable.
Fit the Model: Now that we have our SelectKBest instance, we can train (fit) it on our dataset.
Get Selected Features: Once the model is trained, we can use the get_support method to retrieve a mask of the selected features.
Print Selected Features' Scores: Finally, we want to visually inspect the scores of our selected features. Here's how you implement this:

Python
1from sklearn.feature_selection import SelectKBest, chi2
2
3# Performing feature selection
4selector = SelectKBest(chi2, k=2)
5X_new = selector.fit_transform(X, y)
6
7# Print the index of selected features and their scores
8selected_features = selector.get_support(indices=True)
9scores = selector.scores_
10
11print("Selected Features: ", selected_features) # [2 3]
12print("Scores: ", scores) # [ 10.81782088 3.7107283 116.31261309 67.0483602 ]

This output reveals that features at index positions 2 and 3 (petal length and petal width) have the highest chi-square scores and are thus selected as the most relevant features for our prediction model.

Understanding chi-square Scores

The chi-square statistic is used to determine the strength of the relationship between the variables. A higher chi-square statistic indicates a stronger relationship.

The chi-square test statistic is calculated as follows:

Calculate the Expected Frequency: The expected frequency is the frequency that we would expect to see in each category if the null hypothesis were true. It's calculated as the row total multiplied by the column total and divided by the total number of observations. Here, null hypothesis refers to the assumption that there's no relationship between the variables.
Calculate the Chi-Square Statistic: The chi-square statistic is calculated as the sum of the squared difference between the observed and expected frequencies divided by the expected frequency. Mathematically, the chi-square statistic is calculated as:

\chi^2 = \sum \frac{(O - E)^2}{E}

Where $O$ is the observed frequency and $E$ is the expected frequency in each category. The chi-square statistic follows a chi-square distribution, and the p-value is calculated using this distribution. The higher the chi-square statistic, the more significant the relationship between the variables.

Understanding the p-value

While the chi-square scores indicate the strength of the relationship between the variables, the p-value is used to determine the significance of the relationship. The p-value is a measure of the probability that an observed difference could have occurred just by random chance. A lower p-value indicates a more significant relationship between the variables.

The p-value is calculated using the chi-square distribution. If the p-value is less than the significance level (usually 0.05), we reject the null hypothesis and conclude that there's a significant relationship between the variables. We calculate the p-value using the chi2 distribution in Python:

Python
1from scipy.stats import chi2
2
3p_values = chi2.sf(scores, 1)
4print("P-values: ", p_values) # [1.00527740e-03 5.40637961e-02 4.05982042e-27 2.64927865e-16]

We can interpret the p-values as follows:

The p-value for petal length is 1.00527740e-03, which is less than 0.05. Therefore, we reject the null hypothesis and conclude that there's a significant relationship between petal length and the target variable.
The p-value for petal width is 5.40637961e-02, which is greater than 0.05. Therefore, we fail to reject the null hypothesis and conclude that there's no significant relationship between petal width and the target variable.
The p-values for sepal length and sepal width are 4.05982042e-27 and 2.64927865e-16, respectively. Both are less than 0.05, indicating a significant relationship between these features and the target variable.

While the chi-square scores indicate the strength of the relationship between the variables, the p-values indicate the significance of the relationship. A high chi-square score doesn't necessarily mean a significant relationship if the p-value is high. Therefore, it's essential to consider both the chi-square scores and p-values when interpreting the results.

Discussing the Limitations of Univariate Feature Selection

While Univariate Feature Selection is an excellent way to filter out irrelevant features, it's important to be aware of its limitations:

It exhibits a considerable limitation in handling negative input values, as the chi-square test is founded on the premise of categorical data or frequency counts.
It may result in selecting multiple heavily correlated features because it treats each feature independently.
It doesn't consider the relationship between features, which may lead to the selection of redundant features.

Understanding these limitations can guide your decision-making when choosing the most suitable feature selection technique for your particular dataset.

Lesson Summary and Introduction to Practice Exercises

Today, we learned about the power of univariate feature selection and how it improves the effectiveness and efficiency of our machine learning models. We delved into the concept of univariate selection and examined how to implement it using scikit-learn's SelectKBest function to select the most informative features.

However, keep in mind the technique's limitations—it can't handle negative input values and may select heavily correlated features.

The best way to solidify the concepts you've learned in this lesson is through practice. So, let's proceed to some exercises that will allow you to implement univariate feature selection on various datasets. This critical practice will prepare you for more advanced feature selection and dimensionality reduction techniques in future lessons. Onwards!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.