Lesson 1

Logistic Regression Basics

Introduction to Logistic Regression

Logistic regression is a type of classification algorithm used to predict the probability of a class or event existing. Unlike linear regression, which predicts a continuous number, logistic regression predicts a discrete outcome—a sort of yes or no, true or false, or in data terms, class 0 or class 1.

For example, imagine you have data on whether an email is spam or not. Logistic regression can help you predict whether a new email is spam based on its content.

You'll use logistic regression when you need to classify data into categories. Real-life examples include:

  • Predicting if a student will pass or fail
  • Determining whether an email is spam
  • Diagnosing whether a patient has a certain disease
How Logistic Regression Works:

Logistic regression works by fitting a logistic function (also known as the sigmoid function) to the data. The sigmoid function is defined as:

σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}

where zz is the linear combination of input features:

z=w0+w1x1+w2x2+...+wnxnz = w_0 + w_1x_1 + w_2x_2 + ... + w_nx_n

Here, w0w_0 is the intercept (bias term), and w1,w2,...,wnw_1, w_2, ..., w_n are the weights (coefficients) associated with the input features x1,x2,...,xnx_1, x_2, ..., x_n.

The sigmoid function output is a probability value between 0 and 1. If this probability is greater than a certain threshold (commonly 0.5), the outcome is class 1 (e.g., the mail is spam). Otherwise, it is class 0 (e.g., the mail is not spam). The model adjusts the weights during training to minimize the difference between predicted and actual class labels.

Here is how the sigmoid function fits the classification data:

Example of Loading a Dataset

Before we can train a logistic regression model, we'll need some data. Scikit-Learn, a popular Python library for machine learning, provides many built-in datasets. For this lesson, we'll use the wine dataset, which helps predict the class of wine based on its chemical properties.

Python
1from sklearn.datasets import load_wine 2 3# Load real dataset 4X, y = load_wine(return_X_y=True) 5print(X.shape, y.shape) # (178, 13) (178,)

Here, X represents the features (input data), and y represents the labels (output data) we want to predict. The wine dataset is well-known for this kind of task.

Splitting the Dataset

If we trained our model on all the data, we wouldn't know how well it performs on unseen data. So, we'll split the data into training and testing sets using the train_test_split function from Scikit-Learn.

Python
1from sklearn.model_selection import train_test_split 2from sklearn.preprocessing import StandardScaler 3 4# Scaling for the better convergence 5X = StandardScaler().fit_transform(X) 6 7# Splitting the dataset 8X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42) 9print(X_train.shape, X_test.shape, y_train.shape, y_test.shape) # (106, 13) (72, 13) (106,) (72,)

In this example:

  • test_size=0.2 means 20% of the data will be used for testing, and 80% for training.
  • random_state=42 ensures that we get the same split each time we run the code, making our results reproducible.

Splitting the data this way ensures that we can test how well our model generalizes to new data.

Training the Logistic Regression Model

Next, we'll train our model on the training data using the fit method. Training a model simply means finding the best parameters (weights) that map inputs to outputs accurately.

Python
1from sklearn.linear_model import LogisticRegression 2 3 4# Training the logistic regression model 5log_reg = LogisticRegression(max_iter=1000) 6log_reg.fit(X_train, y_train)

The max_iter parameter in logistic regression (and other iterative algorithms) sets the maximum number of iterations the algorithm will perform during the optimization process. It controls how long the algorithm should keep iterating to find the best solution or converge to an optimal set of coefficients. The choice of max_iter can influence the model training process's convergence behavior and computational efficiency. In practice, it’s often helpful to start with a default value and adjust it based on the specific dataset and the observed convergence behavior.

During this training process, the model learns to distinguish between the classes—whether a wine is of the type 1, 2, or 3.

Making Predictions and Calculating Accuracy

Once the model is trained, we can use it to make predictions on the test data and evaluate its performance. We'll use the predict method to make predictions and the accuracy_score function to calculate the accuracy of our model.

Python
1# Making predictions on the test data 2y_pred = log_reg.predict(X_test) 3 4# Calculating the accuracy of the model 5from sklearn.metrics import accuracy_score 6 7accuracy = accuracy_score(y_test, y_pred) 8print(f"Accuracy of the Logistic Regression model: {accuracy:.2f}") #0.99

The accuracy_score function compares the true labels (y_test) with the predicted labels (y_pred) and calculates the fraction of correct predictions. This gives us an idea of how well our model generalizes to new, unseen data. The accuracy is the simplest metric for predictions. We will use it in this course and learn about more specific, complex metrics in the next one.

Lesson Summary

Congratulations! You've just taken your first step into the world of logistic regression. Here's a quick recap of what we've covered:

  • Logistic regression is used for classification tasks.
  • Logistic regression works by fitting a logistic (sigmoid) function to the data.
  • We loaded a dataset from Scikit-Learn.
  • We split the dataset into training and testing sets.
  • We initialized and trained a logistic regression model.
  • We made predictions and calculated the model's accuracy.

Up next, you'll get hands-on experience with logistic regression in CodeSignal's environment. You'll practice loading data, splitting it, training a logistic regression model yourself, making predictions, and evaluating its performance. This hands-on practice will solidify your understanding and prepare you for more advanced topics. Happy coding!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.