Welcome back! As we continue our journey through PySpark MLlib, we've reached an exciting stage where your logistic regression model, trained on the Iris dataset, will now be put to the test. In this lesson, we focus on generating predictions using your model and understanding how well it performs. Predicting outcomes is a cornerstone of machine learning, and evaluating model performance with reliable metrics is crucial for making informed decisions. By the end of this lesson, you will be adept at deriving predictions from your models and evaluating their effectiveness using accuracy metrics.
Before diving into predictions, let’s ensure your Spark environment is ready. Although we've done this in previous lessons, it's crucial to reiterate its importance. A consistent environment guarantees reproducibility and reliability when working with data science tasks. We will reuse the preprocessed data and the trained logistic regression model from the last lesson.
Recall how we initialized the Spark session and had the train and test datasets ready:
Python1from pyspark.sql import SparkSession 2from pyspark.ml.classification import LogisticRegression 3from preprocess_data import preprocess_data 4 5# Initialize a Spark session 6spark = SparkSession.builder.appName("ModelEvaluation").getOrCreate() 7 8# Preprocess the dataset 9train_data, test_data = preprocess_data(spark, "iris.csv") 10 11# Initialize the logistic regression model with specified feature and label columns 12lr = LogisticRegression(featuresCol="features", labelCol="label") 13 14# Fit the logistic regression model to the training data 15lr_model = lr.fit(train_data)
These snippets ensure we're working with the right setup before proceeding to predictions and evaluations.
With your logistic regression model trained, predicting outcomes on unseen data is the next logical step. Utilizing the transform
method on our test dataset allows the model to generate predictions based on the learned patterns.
Python1# Make predictions on the test set 2predictions = lr_model.transform(test_data) 3 4# Display the first 5 rows of the predictions DataFrame 5predictions.show(5)
When you run this code, you'll observe an output similar to the following, which showcases sample predictions, including features, true labels, raw predictions, probabilities, and final predicted labels:
Plain text1+-----------------+-----+--------------------+--------------------+----------+ 2| features|label| rawPrediction| probability|prediction| 3+-----------------+-----+--------------------+--------------------+----------+ 4|[4.4,3.0,1.3,0.2]| 0.0|[54.8316681361287...|[1.0,5.8046421116...| 0.0| 5|[4.6,3.2,1.4,0.2]| 0.0|[59.0189626418402...|[1.0,1.7180155135...| 0.0| 6|[4.6,3.6,1.0,0.2]| 0.0|[76.5904949565625...|[1.0,1.3749985323...| 0.0| 7|[4.8,3.1,1.6,0.2]| 0.0|[52.2432044937557...|[1.0,1.6438342987...| 0.0| 8|[4.9,3.1,1.5,0.1]| 0.0|[55.1240971942048...|[1.0,8.6038688129...| 0.0| 9+-----------------+-----+--------------------+--------------------+----------+
This output provides a comprehensive view of the predicted data:
- features: The input features used for prediction, corresponding to each sample in the test dataset.
- label: The actual class label for each sample, serving as a point of reference for evaluating prediction accuracy.
- rawPrediction: The raw output of the logistic regression algorithm before converting it into class probabilities, representing the confidence levels for each class.
- probability: The model converts raw predictions to class probabilities using the logistic function, showcasing the likelihood of each class for a given sample.
- prediction: The predicted class label for each sample, determined by selecting the class with the highest probability score.
By reviewing these columns, you can better understand how your model arrives at its predictions and compare them against actual class labels to assess performance.
After generating predictions, it's important to evaluate how well our model performs. PySpark provides a useful tool called MulticlassClassificationEvaluator
, which helps us measure various aspects of a model's performance. In this lesson, we will focus on accuracy — a key metric that tells us the portion of correct predictions out of all predictions made.
Here’s how you use it in your code:
Python1from pyspark.ml.evaluation import MulticlassClassificationEvaluator 2 3# Initialize the evaluator with the desired settings 4evaluator = MulticlassClassificationEvaluator( 5 labelCol="label", 6 predictionCol="prediction", 7 metricName="accuracy" 8) 9 10# Compute the accuracy of the model on the test data 11accuracy = evaluator.evaluate(predictions) 12 13# Output the calculated accuracy of the model 14print("Model Accuracy:", accuracy)
Here, labelCol
specifies the column with the actual labels, predictionCol
is the column with the predictions, and metricName="accuracy"
tells the evaluator to calculate the accuracy metric.
When you run this code, you might see an output like this:
Plain text1Model Accuracy: 1.0
This indicates that the model has perfectly predicted the labels for the test data, achieving an accuracy of 100%.
In this lesson, you gained the ability to make predictions using a trained logistic regression model and evaluate its performance using a fundamental accuracy metric. These skills are pivotal for dealing with real-world datasets, where prediction accuracy can significantly impact outcomes and solutions.
Now, I encourage you to dive into the practice exercises that follow this lesson. These exercises will solidify your understanding and help you experiment with different configurations. As we progress, mastering these skills will expand your proficiency in PySpark's MLlib and broaden your machine learning capabilities. Keep up the great work!