Logistic Regression with the Iris Dataset

Lesson 3

Lesson Overview

Welcome back! In this session, we're going to delve into the realm of Logistic Regression. Our primary materials for this exercise are Python, the Scikit-learn library, and the renowned Iris dataset. The end goal is the mastery of the theory and practical implementation of logistic regression for classification tasks and a solid comprehension of its utility and applications.

Understanding Logistic Regression

Logistic Regression is your go-to statistical model for binary classification tasks. For example, if you need to classify objects, such as distinguishing an apple from an orange based on features like color or size, logistic regression can do the job. Don't let the term 'regression' in its name mislead you, though. Unlike linear regression that predicts a continuous outcome, logistic regression works in the realm of probabilities, making it superb for dichotomous outcomes, like yes/no or true/false decisions.

The magic behind logistic regression is the sigmoid function, also known as the logistic function. It's a mathematical function that takes any real-valued number and squeezes it into a range between 0 and 1. In mathematical terms, the sigmoid function is defined as:

$S(x) = \frac{1}{1 + e^{-x}}$

Where:

$S(x)$ is the output between 0 and 1 (the probability estimate),
$e$ is the base of natural logarithms,
And $x$ is the input to the function (your data).

With the sigmoid function, we can convert linear regression outputs into probabilities that a certain instance belongs to a particular class. In logistic regression, the probabilities determining class membership are modeled as a function of predictor variables. This model then predicts the probability that a data point falls into one of the two categories being studied. When we apply this to the Iris dataset, the model would estimate the probability of a flower being of a certain Iris species based on its features, such as petal and sepal measurements.

Imagine the process like flipping a coin with a twist—the logistic model informs us with a certain probability whether we should expect heads or tails, based on the coin's features and past coin flips. The outcome isn't just a guess; it's a statistically informed prediction based on the data at hand.

Data Preprocessing

Before jumping into the modeling process, we need to ensure our dataset is structured and clean. We'll be using the Iris dataset, composed of measurements for numerous flowers from each of three species of Iris: Iris setosa, Iris virginica, and Iris versicolor. Let's load and inspect it:

Python
1# Importing the iris dataset from sklearn
2from sklearn.datasets import load_iris
3
4# Loading the dataset
5iris = load_iris()
6
7# Printing a description
8print(iris.DESCR)

The output of the code will provide a full description of the Iris dataset, including its features, number of instances, and summary statistics.

Once loaded and understood, we need to partition the data into a training set, which the model will learn from, and a test set, used to evaluate the model's accuracy:

Python
1from sklearn.model_selection import train_test_split
2
3# Splitting the data into a training set and a test set 
4X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)

Implementing Logistic Regression

With data in place, it's time to initialize a logistic regression model. Scikit-learn simplifies this process, providing robust pseudo-English ways to implement most machine learning models, including logistic regression:

Python
1from sklearn.linear_model import LogisticRegression
2
3# Instantiating a Logistic Regression model
4lr_model = LogisticRegression(solver='liblinear', multi_class='ovr')
5
6# Fitting the model with our training data
7lr_model.fit(X_train, y_train)

This process creates a logistic regression model and trains it with our features (X_train) and our target variable (y_train).

Interpreting Model Outcomes

Once fitted, we can use the model to predict the outcome for our test data:

Python
1# Generating predictions
2predictions = lr_model.predict(X_test)

The predict() function returns the predicted classes according to the fitted model. By comparing the predicted output to the actual values, we can understand how well the model performed.

Python
1# Comparing predictions with actual values
2print("Predictions: ", predictions)
3print("Actual: ", y_test)

output:


1Predictions:  [1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0]
2Actual:  [1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0]

The output of the code consists of arrays of predicted and actual target values for comparison.

Evaluating the Logistic Regression Model

After training the Logistic Regression model, it's important to evaluate its performance to understand how well it's likely to perform on unseen data. One straightforward method for this is using the score method provided by Scikit-learn. This method returns the mean accuracy on the given test data and labels - in simpler terms, it tells you the proportion of correctly predicted instances.

Here's how you can use it:

Python
1# Evaluating the model on the training set
2train_accuracy = lr_model.score(X_train, y_train)
3print("Training Accuracy:", train_accuracy)
4
5# Evaluating the model on the test set
6test_accuracy = lr_model.score(X_test, y_test)
7print("Test Accuracy:", test_accuracy)

output:

Python
1Training Accuracy: 0.9583333333333334
2Test Accuracy: 1.0

Interpreting the Results:

Training Accuracy: This is the accuracy of the model on the data it was trained on. A high training accuracy typically indicates that the model has learned the training data well.
Test Accuracy: This is the accuracy of the model on the test data, which it has not seen during training. This is a more robust indicator of how well the model will perform on new, unseen data.

It's crucial to compare these two metrics. If your training accuracy is much higher than your test accuracy, it might be a sign that the model is overfitting to the training data. Conversely, if both accuracies are low, the model might be under-fitting.

Limitations and Caveats of Logistic Regression

While logistic regression is commonly used and easy to comprehend, it does have a few limitations. Primarily, it is best suited for binary classification problems. For instance, predicting whether an email is spam or not is a binary (two-class) problem. While logistic regression can be used for multiple categories with a bit of modification (like what we did in this lesson with the "ovr" strategy: one vs. rest), it is important to remember that it isn't the default use of logistic regression. A key assumption with logistic regression is that observations should be independent of each other. Therefore, repeated measurements or matched data should not be modeled with logistic regression.

Lesson Summary & Practice

Well done! You've journeyed through the theory of logistic regression, applied it to a real-world dataset, and interpreted the prediction outcome. You now understand the limitations and special conditions for using logistic regression models. In upcoming practice exercises, you'll gain more hands-on experience with logistic regression and other datasets. Practicing is the most reliable way to cement your understanding of these concepts, so continue experimenting with data and models. An exciting learning journey lies ahead!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.