Making Predictions and Evaluating Model Performance

Lesson 3

Welcome back! As we continue our journey through PySpark MLlib, we've reached an exciting stage where your logistic regression model, trained on the Iris dataset, will now be put to the test. In this lesson, we focus on generating predictions using your model and understanding how well it performs. Predicting outcomes is a cornerstone of machine learning, and evaluating model performance with reliable metrics is crucial for making informed decisions. By the end of this lesson, you will be adept at deriving predictions from your models and evaluating their effectiveness using accuracy metrics.

Preparing the Model

Before diving into predictions, let’s ensure your Spark environment is ready. Although we've done this in previous lessons, it's crucial to reiterate its importance. A consistent environment guarantees reproducibility and reliability when working with data science tasks. We will reuse the preprocessed data and the trained logistic regression model from the last lesson.

Recall how we initialized the Spark session and had the train and test datasets ready:

Python
1from pyspark.sql import SparkSession
2from pyspark.ml.classification import LogisticRegression
3from preprocess_data import preprocess_data
4
5# Initialize a Spark session
6spark = SparkSession.builder.appName("ModelEvaluation").getOrCreate()
7
8# Preprocess the dataset
9train_data, test_data = preprocess_data(spark, "iris.csv")
10
11# Initialize the logistic regression model with specified feature and label columns
12lr = LogisticRegression(featuresCol="features", labelCol="label")
13
14# Fit the logistic regression model to the training data
15lr_model = lr.fit(train_data)

These snippets ensure we're working with the right setup before proceeding to predictions and evaluations.

Generating Predictions with the Trained Model

With your logistic regression model trained, predicting outcomes on unseen data is the next logical step. Utilizing the transform method on our test dataset allows the model to generate predictions based on the learned patterns.

Python
1# Make predictions on the test set
2predictions = lr_model.transform(test_data)
3
4# Display the first 5 rows of the predictions DataFrame
5predictions.show(5)

When you run this code, you'll observe an output similar to the following, which showcases sample predictions, including features, true labels, raw predictions, probabilities, and final predicted labels:

Plain text
1+-----------------+-----+--------------------+--------------------+----------+
2|         features|label|       rawPrediction|         probability|prediction|
3+-----------------+-----+--------------------+--------------------+----------+
4|[4.4,3.0,1.3,0.2]|  0.0|[54.8316681361287...|[1.0,5.8046421116...|       0.0|
5|[4.6,3.2,1.4,0.2]|  0.0|[59.0189626418402...|[1.0,1.7180155135...|       0.0|
6|[4.6,3.6,1.0,0.2]|  0.0|[76.5904949565625...|[1.0,1.3749985323...|       0.0|
7|[4.8,3.1,1.6,0.2]|  0.0|[52.2432044937557...|[1.0,1.6438342987...|       0.0|
8|[4.9,3.1,1.5,0.1]|  0.0|[55.1240971942048...|[1.0,8.6038688129...|       0.0|
9+-----------------+-----+--------------------+--------------------+----------+

This output provides a comprehensive view of the predicted data:

features: The input features used for prediction, corresponding to each sample in the test dataset.
label: The actual class label for each sample, serving as a point of reference for evaluating prediction accuracy.
rawPrediction: The raw output of the logistic regression algorithm before converting it into class probabilities, representing the confidence levels for each class.
probability: The model converts raw predictions to class probabilities using the logistic function, showcasing the likelihood of each class for a given sample.
prediction: The predicted class label for each sample, determined by selecting the class with the highest probability score.

By reviewing these columns, you can better understand how your model arrives at its predictions and compare them against actual class labels to assess performance.

Evaluating Model Performance with MulticlassClassificationEvaluator

After generating predictions, it's important to evaluate how well our model performs. PySpark provides a useful tool called MulticlassClassificationEvaluator, which helps us measure various aspects of a model's performance. In this lesson, we will focus on accuracy — a key metric that tells us the portion of correct predictions out of all predictions made.

Here’s how you use it in your code:

Python
1from pyspark.ml.evaluation import MulticlassClassificationEvaluator
2
3# Initialize the evaluator with the desired settings
4evaluator = MulticlassClassificationEvaluator(
5    labelCol="label",
6    predictionCol="prediction",
7    metricName="accuracy"
8)
9
10# Compute the accuracy of the model on the test data
11accuracy = evaluator.evaluate(predictions)
12
13# Output the calculated accuracy of the model
14print("Model Accuracy:", accuracy)

Here, labelCol specifies the column with the actual labels, predictionCol is the column with the predictions, and metricName="accuracy" tells the evaluator to calculate the accuracy metric.

When you run this code, you might see an output like this:

Plain text
1Model Accuracy: 1.0

This indicates that the model has perfectly predicted the labels for the test data, achieving an accuracy of 100%.

Summary and Preparation for Practice Exercises

In this lesson, you gained the ability to make predictions using a trained logistic regression model and evaluate its performance using a fundamental accuracy metric. These skills are pivotal for dealing with real-world datasets, where prediction accuracy can significantly impact outcomes and solutions.

Now, I encourage you to dive into the practice exercises that follow this lesson. These exercises will solidify your understanding and help you experiment with different configurations. As we progress, mastering these skills will expand your proficiency in PySpark's MLlib and broaden your machine learning capabilities. Keep up the great work!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.