Logistic regression is a type of classification algorithm used to predict the probability of a class or event existing. Unlike linear regression, which predicts a continuous number, logistic regression predicts a discrete outcome—a sort of yes or no, true or false, or in data terms, class 0 or class 1.
For example, imagine you have data on whether an email is spam or not. Logistic regression can help you predict whether a new email is spam based on its content.
You'll use logistic regression when you need to classify data into categories. Real-life examples include:
- Predicting if a student will pass or fail
- Determining whether an email is spam
- Diagnosing whether a patient has a certain disease
Logistic regression works by fitting a logistic function (also known as the sigmoid function) to the data. The sigmoid function is defined as:
where is the linear combination of input features:
Here, is the intercept (bias term), and are the weights (coefficients) associated with the input features .
The sigmoid function output is a probability value between 0 and 1. If this probability is greater than a certain threshold (commonly 0.5), the outcome is class 1 (e.g., the mail is spam). Otherwise, it is class 0 (e.g., the mail is not spam). The model adjusts the weights during training to minimize the difference between predicted and actual class labels.
Here is how the sigmoid function fits the classification data:
Before we can train a logistic regression model, we'll need some data. Scikit-Learn
, a popular Python library for machine learning, provides many built-in datasets. For this lesson, we'll use the wine dataset, which helps predict the class of wine based on its chemical properties.
Python1from sklearn.datasets import load_wine 2 3# Load real dataset 4X, y = load_wine(return_X_y=True) 5print(X.shape, y.shape) # (178, 13) (178,)
Here, X
represents the features (input data), and y
represents the labels (output data) we want to predict. The wine dataset is well-known for this kind of task.
If we trained our model on all the data, we wouldn't know how well it performs on unseen data. So, we'll split the data into training and testing sets using the train_test_split
function from Scikit-Learn
.
Python1from sklearn.model_selection import train_test_split 2from sklearn.preprocessing import StandardScaler 3 4# Scaling for the better convergence 5X = StandardScaler().fit_transform(X) 6 7# Splitting the dataset 8X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42) 9print(X_train.shape, X_test.shape, y_train.shape, y_test.shape) # (106, 13) (72, 13) (106,) (72,)
In this example:
test_size=0.2
means 20% of the data will be used for testing, and 80% for training.random_state=42
ensures that we get the same split each time we run the code, making our results reproducible.
Splitting the data this way ensures that we can test how well our model generalizes to new data.
Next, we'll train our model on the training data using the fit
method. Training a model simply means finding the best parameters (weights) that map inputs to outputs accurately.
Python1from sklearn.linear_model import LogisticRegression 2 3 4# Training the logistic regression model 5log_reg = LogisticRegression(max_iter=1000) 6log_reg.fit(X_train, y_train)
The max_iter
parameter in logistic regression (and other iterative algorithms) sets the maximum number of iterations the algorithm will perform during the optimization process. It controls how long the algorithm should keep iterating to find the best solution or converge to an optimal set of coefficients. The choice of max_iter
can influence the model training process's convergence behavior and computational efficiency. In practice, it’s often helpful to start with a default value and adjust it based on the specific dataset and the observed convergence behavior.
During this training process, the model learns to distinguish between the classes—whether a wine is of the type 1, 2, or 3.
Once the model is trained, we can use it to make predictions on the test data and evaluate its performance. We'll use the predict
method to make predictions and the accuracy_score
function to calculate the accuracy of our model.
Python1# Making predictions on the test data 2y_pred = log_reg.predict(X_test) 3 4# Calculating the accuracy of the model 5from sklearn.metrics import accuracy_score 6 7accuracy = accuracy_score(y_test, y_pred) 8print(f"Accuracy of the Logistic Regression model: {accuracy:.2f}") #0.99
The accuracy_score
function compares the true labels (y_test
) with the predicted labels (y_pred
) and calculates the fraction of correct predictions. This gives us an idea of how well our model generalizes to new, unseen data. The accuracy
is the simplest metric for predictions. We will use it in this course and learn about more specific, complex metrics in the next one.
Congratulations! You've just taken your first step into the world of logistic regression. Here's a quick recap of what we've covered:
- Logistic regression is used for classification tasks.
- Logistic regression works by fitting a logistic (sigmoid) function to the data.
- We loaded a dataset from
Scikit-Learn
. - We split the dataset into training and testing sets.
- We initialized and trained a logistic regression model.
- We made predictions and calculated the model's accuracy.
Up next, you'll get hands-on experience with logistic regression in CodeSignal's environment. You'll practice loading data, splitting it, training a logistic regression model yourself, making predictions, and evaluating its performance. This hands-on practice will solidify your understanding and prepare you for more advanced topics. Happy coding!