Today, we will explore Logistic Regression — a powerful and efficient machine learning algorithm for binary classification tasks — especially in text classification. Our goal is to help you grasp the principles of Logistic Regression, create a Logistic model to classify texting messages, and validate the performance of this model. Let's dive right in!
Logistic Regression is a statistical method that we use for binary classification problems. Unlike linear regression, which predicts a continuous output, logistic regression is designed to predict the probability of a particular class or event. It produces a logistic curve, which is limited to values between 0 and 1.
The logistic function, also known as the sigmoid function, maps any real-valued number into a range between 0 and 1. This function forms the foundation of logistic regression and is also a key element in neural networks, which lie at the heart of deep learning.
Logistic regression is often used in fields such as machine learning, and most applications of logistic regression involve binary classification. A classic use case is predicting whether an email is spam or not. Logistic regression has both advantages and drawbacks: it's efficient, does not require too many computational resources, it’s easy to implement, and it's highly interpretable. On the other hand, it can't solve non-linear problems as it has a linear decision surface, and it also tends to underperform when there are multiple or non-linear decision boundaries.
Our first step is to load the SMS Spam Collection dataset. After that, we will preprocess the data to make it suitable for our model.
Our preprocessing will include splitting the data into a training set and a testing set using stratified cross-validation. Then, we will convert the input features (message
) from text format to a numerical format that our machine can understand. Lastly, we will define our output labels (label
).
Even though the loading and preprocessing steps are crucial, we won't delve too much into them as you've already got a solid understanding of these points from previous lessons.
So, let's just have a look at the code needed to perform these steps using Python and Scikit-learn.
Python1# Import necessary libraries 2import pandas as pd 3from sklearn.model_selection import train_test_split 4from sklearn.feature_extraction.text import CountVectorizer 5import datasets 6 7# Load the dataset 8spam_dataset = datasets.load_dataset('codesignal/sms-spam-collection', split='train') 9spam_dataset = pd.DataFrame(spam_dataset) 10 11# Preprocess the data 12X = spam_dataset["message"] 13Y = spam_dataset["label"] 14X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42, stratify=Y) 15count_vectorizer = CountVectorizer() 16X_train_count = count_vectorizer.fit_transform(X_train) 17X_test_count = count_vectorizer.transform(X_test)
Having preprocessed the data, we can now initialize and train a Logistic Regression model.
The Logistic Regression model in Scikit-learn is initialized with the LogisticRegression()
function. We can subsequently train the model using the fit()
function.
Here, you pass to the fit()
function the input features of the training dataset and the corresponding labels. During training, the model tries to discover relationships between the features and labels that can be used for making predictions.
The Logistic Regression model initializer has a few hyperparameters that you could adjust to optimize the model's performance. Commonly adjusted hyperparameters include 'C' and 'penalty'. 'C' stands for inverse of regularization strength. Smaller values specify stronger regularization. The 'penalty' parameter specifies the norm used in the penalization.
Python1from sklearn.linear_model import LogisticRegression 2 3# Initialize the Logistic Regression model 4logistic_regression_model = LogisticRegression(random_state=42) 5 6# Train the model 7logistic_regression_model.fit(X_train_count, Y_train)
Once the model has been trained, we can use it to classify new, unseen messages. Classification is done with the predict()
function, which takes as input the features of the test dataset and returns predicted labels.
To evaluate the quality of the model, we compare its predictions to the actual labels of the test dataset. Here, we calculate accuracy as our evaluation metric using the accuracy_score
function from Scikit-learn's metrics module.
Python1from sklearn import metrics 2 3# Make predictions 4y_pred = logistic_regression_model.predict(X_test_count) 5 6# Calculate and print the accuracy 7accuracy = metrics.accuracy_score(Y_test, y_pred) 8print(f"Accuracy of Logistic Regression Classifier: {accuracy:.2f}")
The output of the above code will be:
Plain text1Accuracy of Logistic Regression Classifier: 0.98
This output signifies a very high accuracy, indicating that our Logistic Regression model is excellent at classifying messages into 'spam' or 'not spam'. Such a high level of accuracy showcases the model's effectiveness in solving the text classification task at hand.
Today, you have learned how to apply Logistic Regression, an essential machine learning algorithm, to classify text data. Specifically, you learned how logistic regression works, carried out necessary data preprocessing, trained a logistic regression model, and evaluated its accuracy.
Now comes the fun part — hands-on practice! Up next are several practice exercises that will challenge you to implement what you've learned in different scenarios and with varying datasets. These exercises will reinforce your understanding and broaden your skills in modelling and classifying text data with Logistic Regression. Enjoy the exploration!