Welcome back, learners! Having grasped the subtleties of the Wine Quality Dataset and understood the implementation of the Linear Regression Model, we are now embarking on our journey through the Logistic Regression Model. A key player in the machine learning universe, Logistic Regression is indispensable in supervised learning problems, particularly binary classification.
As you may recall from prior lessons, Linear Regression is effective for regression problems. However, regarding classification problems, Logistic Regression takes the spotlight. We'll understand why as we predict the binary outcomes of wine quality - either good or bad - using our Wine Quality Dataset based on its physicochemical properties. Let's delve into the concept of Logistic Regression, breaking down its theory, internal mechanisms, design, and implementation across various datasets.
Contrary to its name, Logistic Regression is a classification algorithm used to estimate the probabilities of a binary response based on one or more predictor (also known as independent) variables. It is particularly beneficial for binary outcomes, meaning situations with only two possible results.
Now, let's bring this concept to life by relating it to our Wine Dataset. Our goal is to predict wine quality, which, as you may remember, ranges from 0 to 10. To keep things simple and focus on a binary classification problem, let's classify the wines as good (a quality rating of 7 or above) and not good (a quality rating below 7). Therefore, we will be using Logistic Regression to predict whether the quality of a specific type of wine is 'good' or 'not good' based on its physicochemical features.
In Logistic Regression, all of this is achieved by using a logistic function, which limits the unlimited outcome of the linear equation to a number between 0 and 1. Also known as the Sigmoid function, this logistic function is an S-shaped curve that maps any real-valued number into a value falling within these bounds. The function is defined as follows,
In this equation, represents the output of a linear combination of feature values and their corresponding coefficients,
In this informative equation:
- (Beta) terms are the model's parameters, signifying the influence of each input feature (denoted by ) on the predicted outcome.
- terms represent independent predictor variables.
To understand the intricacies of Logistic Regression, we need to unpack the mathematical marvel that it is. The Logistic or Sigmoid function forms the backbone of Logistic Regression. Once we compute the predicted probability () using the Sigmoid function, we can assign classes by defining a threshold (which is generally 0.5
):
- If , the label for the example is
1
(orGood
in our case). - If , the label for the example is
0
(orNot Good
in our case).
The next critical component in Logistic Regression is the cost function. Unlike in Linear Regression, we can't use Mean Square Error as the cost function because the Logistic function would introduce a non-linear term into the cost function, making the cost function non-convex anymore. In Logistic Regression, the cost function is defined as:
Where:
- represents the parameters we must determine using an optimization algorithm to minimize the cost function.
- is the number of samples.
- and represent the target and input of each sample, respectively.
- is the logistic function that computes the predicted probability that .
While discussing the cost function, it's crucial to consider optimization algorithms like Gradient Descent used to find the parameters to minimize this cost.
Disclaimer: in most scenarios, you don't have to remember and implement the cost function yourself, as there are plenty of libraries (e.g., scikit-learn
that provide the built-in implementation of the Logistic Regression). However, it's still essential to understand high-level concepts and what's being optimized.
Before we dive deeper, let's discuss the underlying assumptions that guide a Logistic Regression Model. These assumptions serve as rules of thumb when modeling Logistic Regression:
- Each observation is independent of others: This means the outcome or probability of success (
p
in our logistic function) for one example neither influences nor is influenced by the outcomes of other examples. - There is no multicollinearity among explanatory variables: In simple terms, the input variables should not be too highly correlated with each other. Any correlation implies that they carry similar information to the model, which is redundant.
- The input variables have a linear relationship with the log odds: Although the outcome in logistic regression is a binary variable, logistic regression stipulates that the input variables are linearly related to the log odds , and hence, to the logit of the probability,
p
.
Violating these assumptions may result in inaccurate models and misinterpretations. Therefore, validating these assumptions while modeling Logistic Regression is essential.
Let's transition from understanding the Logistic Regression concept to its design and implementation. Using Python and scikit-learn
, we'll see how to design a Logistic Regression model:
- Specify the hypothesis or function the model should learn. In Logistic Regression, this is the
Sigmoid function
. - Define an error, cost, or loss function we aim to minimize. For Logistic Regression, the cost function is defined as Cross-Entropy Loss.
- Define a learning algorithm that optimizes the parameters for the hypothesis to fit the model to the training data. In our case, it's the Gradient Descent algorithm.
Let's look at a quick implementation using scikit-learn
:
Python1# Import necessary libraries 2from sklearn.linear_model import LogisticRegression 3from sklearn.model_selection import train_test_split 4 5# Load the wine dataset 6import datasets 7import pandas as pd 8red_wine = datasets.load_dataset('codesignal/wine-quality', split='red') 9red_wine = pd.DataFrame(red_wine) 10 11# Convert the multi-class problem to a binary one 12red_wine['quality'] = red_wine['quality'].apply(lambda x : 1 if x >= 7 else 0) 13 14# Split the dataset into features and target variable 15X = red_wine.drop('quality', axis=1) 16y = red_wine['quality'] 17 18# Split the dataset into a training set and a test set 19X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) 20 21# Create a Logistic Regression object 22lr = LogisticRegression() 23 24# Train the model using the training sets 25lr.fit(X_train, y_train) 26 27# Print the learned parameters 28print(lr.coef_, lr.intercept_) 29""" 30[[-0.02641816 -3.24280912 -0.04024957 0.07795443 -1.26020881 0.02151089 31 -0.01866486 -1.04040183 -2.50766981 2.00156001 0.9266963 ]] 32[-1.77875604] 33"""
In the script above, we create a LogisticRegression
object and use the fit
function to train it on the training sets, X_train
and y_train
. The learned parameters of the Logistic function can be printed as shown in the last line. The coef_
variable gives the coefficients for different features (or ), while intercept_
provides the intercept term (or ).
With the fundamentals and design in place, let's dive into an example to see how the Logistic Regression model
predicts the wine quality for our test dataset and evaluates its performance:
Python1# Make predictions on the test dataset 2y_pred = lr.predict(X_test) 3 4# Import metrics module for accuracy calculation 5from sklearn import metrics 6 7# Model accuracy 8print("Accuracy: ", metrics.accuracy_score(y_test, y_pred)) 9# Accuracy: 0.8875
In the above code block, we use the predict
function to predict the wine quality for the test dataset. The accuracy_score()
function from the metrics module of the sklearn
library is used to calculate the accuracy of our logistic regression model. The function takes in the actual wine qualities and predicted wine qualities and returns the proportion of correct predictions.
Now that we have our trained Logistic Regression model, we might wonder how to interpret the output of our model. The output of the model includes the coefficients (also known as weights) of each feature and a bias (also known as the intercept). The coefficients represent the log of the odds ratio of the corresponding feature.
For example, if the coefficient of a feature, say pH
(with log odds ratio = ), is 0.5
, it indicates that for each unit change in pH
, keeping other features constant, the odds of our outcome (whether the wine quality is good) would increase by a factor of .
Evaluating the performance of a model is crucial to assess its usability and reliability. We evaluate our Logistic Regression model's performance using several important metrics. Let's define a few key metrics:
- Confusion Matrix: This table describes the performance of a classification model. It's essentially a matrix that visualizes the performance of the regression, representing actual and predicted classifications in terms of true positives, false positives, true negatives, and false negatives.
- Accuracy: This is the ratio of correctly predicted observations to total observations.
Accuracy = (True Positives + True Negatives) / Total Observations
. - Precision: This is the ratio of correctly predicted positive observations to the total predicted positives.
Precision = True Positives / (True Positives + False Positives)
. - Recall (Sensitivity): This is the ratio of correctly predicted positive observations to all observations in the actual class.
Recall = True Positives / (True Positives + False Negatives)
. - F1 Score: This is the weighted average of Precision and recall.
F1Score = 2 * Recall * Precision / (Recall + Precision)
. - ROC-AUC : This is the area under the Receiver Operating Characteristic curve. It indicates how much the model can distinguish between classes.
Here is how these metrics can be calculated using sklearn:
Python1from sklearn import metrics 2 3# Model Accuracy 4print("Accuracy: ", metrics.accuracy_score(y_test, y_pred)) 5# Accuracy: 0.8875 6 7# Model Precision 8print("Precision: ", metrics.precision_score(y_test, y_pred)) 9# Precision: 0.5172413793103449 10 11# Model Recall 12print("Recall: ", metrics.recall_score(y_test, y_pred)) 13# Recall: 0.2727272727272727 14 15# Model F1-Score 16print("F1 Score: ", metrics.f1_score(y_test, y_pred)) 17# F1 Score: 0.3571428571428571 18 19# Model AUC 20print("AUC: ", metrics.roc_auc_score(y_test, y_pred)) 21# AUC: 0.6198930481283422
The accuracy_score()
function calculates the model's accuracy by comparing the actual output in y_test
with the predicted output in y_pred
. Similarly, other functions like precision_score()
, recall_score()
, f1_score()
, and roc_auc_score()
calculate their respective metrics.
That's a wrap for the Logistic Regression Model! We've explored the landscape of Logistic Regression, unpacked its internals, understood the designing process, and implemented it on our Wine Quality Dataset to predict wine quality. You've successfully navigated through the critical components, theoretical aspects, and practical application of Logistic Regression. Well done!
We've also dissected the performance evaluation process of our model in detail. This process is instrumental in assessing and enhancing the model's fit and precision. Model performance reveals strengths and areas for improvement, enabling you to create a reliable and efficient model.
Now that you have gathered knowledge, it's time to start applying it! Gear up for some hands-on practice exercises where you can apply your understanding of Logistic Regression and solidify your learning. These exercises will fortify your knowledge and help you understand the nuances of Logistic Regression and its Python implementation. Ready, set, go!