Understanding Logistic Regression and Its Implementation Using Gradient Descent

Lesson 4

Introduction

Welcome to our new lesson on Logistic Regression and its implementation using the Gradient Descent technique. Having familiarized yourself with the fundamentals of Regression Analysis and the operation of Gradient Descent in optimizing regression models, we'll now address a different kind of problem: Classification. While Regression Analysis is suitable for predicting continuous variables, when predicting categories such as whether an email is spam or not, we need specially designed tools — one of them being Logistic Regression.

In this lesson, we'll guide you through the basic concepts that define Logistic Regression, focusing on its unique components like the Sigmoid function and Log-Likelihood. Eventually, we'll utilize Python to engineer a straightforward Logistic Regression model using Gradient Descent. By the end of this lesson, you will have broadened your theoretical understanding of another vital machine learning concept and enhanced your practical Python coding skills.

Classification: From Linear Regression to Logistic Regression

So far, we've dealt with tasks where a continuous output needs prediction based on one or more input variables - these tasks are known as regression tasks. There is, however, another category of tasks known as classification tasks, where the objective is to predict a categorical outcome. These categories are often binary, like "spam"/"not spam" for an email or "malignant"/"benign" for a tumor. The models we've studied so far are not optimal for predicting categorical outcomes - for example, it isn't intuitive to understand what it means for an email to be "0.67" spam. Enter Logistic Regression - a classification algorithm that can predict the probability of a binary outcome.

Sigmoid Function: the Heart of Logistic Regression

While Linear Regression makes predictions by directly calculating the output, Logistic Regression does it differently. Instead of directly predicting the output, Logistic Regression calculates a raw model output, then transforms it using the sigmoid function, mapping it to a range between 0 and 1, thus making it a probability.

The Sigmoid function is defined as $S(x) = \frac{1} {1+e^{-x}}$

We can implement it like this:

Python
1def sigmoid(z):
2    return 1 / (1 + np.exp(-z))

It looks like this:

When providing a high positive input, the output of $S(x)$ is close to 1, and for a large negative input, the output is close to 0. This feature of the Sigmoid function makes it a perfect fit when we want to classify emails into two categories: "spam" or "not-spam".

Understanding Logistic Regression

The mathematical form of Logistic Regression can be expressed as follows:

$P(Y=1| x) = \frac{1} {1+e^{-( β_0+β_1x)}}$

Where:

$P(Y=1| x)$ is the probability of event Y=1 given x.
$β_0$ and $β_1$ are parameters of the model.
$x$ is the input variable.
$β_0 + β_1x$ is the linear combination of parameters and feature(s).

Log-Likelihood in Logistic Regression plays a similar role to the Least Squares method in Linear Regression. A maximum likelihood estimation method estimates parameters that maximize the likelihood of making the observations we collected. In Logistic Regression, we seek to maximize the log-likelihood.

The Cost Function in Logistic Regression

We've seen the least squares cost function in Linear Regression. However, in Logistic Regression, the cost function is defined differently.

The cost function for a single training instance can be expressed as:

$-[ylog(\hat{p}) + (1-y)log(1-\hat{p})]$

Where $\hat{p}$ denotes the predicted probability.

We can implement it like this:

Python
1def cost_function(h, y):
2    return (-y * np.log(h) - (1 - y) * np.log(1 - h)).mean()

Let's plot it:

This function makes sense because $−log(t)$ approaches $0$ when $t$ approaches $1$ , so the cost will be close to $0$ if the predicted probability is near the actual target. However, the cost will approach $\inf$ when $t$ approaches $0$ , which coincides with predicting a probability close to $0$ for a positive instance will be highly penalized. This peculiar feature of the cost function gives rise to another concern, the threshold selection. You might wonder why we often consider a probability of more than 0.5 as belonging to Category 1, and less than 0.5 as Category 0. This is simply a convention for binary classification and can be adjusted based on the problem at hand.

Implementing Logistic Regression with Gradient Descent

As we already know, the Gradient Descent technique is highly efficient at finding the global minimum of a function. Logistic regression is used to calculate the values of parameters that result in the smallest cost. Here's a simple Python implementation of a Logistic Regression model:

Python
1def logistic_regression(X, y, num_iterations, learning_rate):
2    # Add intercept to X
3    intercept = np.ones((X.shape[0], 1))
4    X = np.concatenate((intercept, X), axis=1)
5
6    # Weights initialization
7    theta = np.zeros(X.shape[1])
8
9    for i in range(num_iterations):
10        z = np.dot(X, theta)
11        h = sigmoid(z)
12        gradient = np.dot(X.T, (h - y)) / y.size
13        theta -= learning_rate * gradient
14
15        z = np.dot(X, theta)
16        h = sigmoid(z)
17        loss = cost_function(h, y)
18
19        if i % 10000 == 0:
20            print(f'Loss: {loss}\t')
21
22    return theta

In this code:

The sigmoid() function computes the sigmoid of the input value.
The cost_function() computes the cost for given inputs and outputs using the weights.
The logistic_regression() applies Gradient Descent to Logistic Regression to find the optimum weights for minimizing the cost.

This simple function can be a Logistic Regression model for classifying emails as "spam" or "not-spam".

Applying Logistic Regression with Gradient Descent

Now, we can define the predict function, which makes the prediction:

Python
1def predict_prob(X, theta):
2    # Add intercept to X
3    intercept = np.ones((X.shape[0], 1))
4    X = np.concatenate((intercept, X), axis=1)
5    return sigmoid(np.dot(X, theta))
6
7def predict(X, theta, threshold=0.5):
8    return predict_prob(X, theta) >= threshold

Lesson Summary and Practice

That wraps up our lesson on the fundamentals of Logistic Regression and its Python implementation using Gradient Descent. Throughout this lesson, we've highlighted the differences between regression and classification tasks, introduced Logistic Regression as a classification algorithm, and elaborated on the components that define it.

You'll have ample opportunities to refine these skills in our forthcoming practice exercises. Remember, the more you practice, the more fluent you'll become. So, practice away and have fun doing it!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.