Welcome! In this lesson, we're going to delve into an essential part of the data analysis and Machine Learning process — Model Evaluation — and specifically focus on understanding various evaluation metrics. In the language of Machine Learning, models
are mathematical formulas, or algorithms, that process your input data to calculate the result for the task they're designed to perform. To ensure a model's predictions are accurate and reliable, we evaluate it against a set of standards or criteria, known as evaluation metrics.
Our primary goal in this lesson is to understand these metrics and learn how to apply them using Python and Sklearn on the Iris dataset. Given the numerous machine learning models available, knowing how to calculate and interpret these evaluation metrics will be crucial in selecting the most suitable model for any task. So, let's dive in!
In the fascinating world of Machine Learning, we often encounter a question -- "How well is our model performing?". The response to this question is provided through the process of model evaluation. Model evaluation allows us to quantify our model's performance, essentially telling us how 'good' or 'bad' it is.
A standard method for model evaluation is splitting our data into Training and Test Sets. By training our model on the Training Set and then testing it on the Test Set, we ensure that our evaluation is unbiased and indicative of how the model will perform on new, unseen data.
The concept of Cross-Validation further refines this process. In Cross-Validation, we divide our dataset into 'K' parts, or folds. We then train our model 'K' times, each time using a different fold as our Test Set. This yields 'K' performance scores, which we average to get a final score.
Let's take a look at how this plays out in code:
Python1from sklearn.model_selection import train_test_split 2 3# Split the data into Training and Test sets 4X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) 5 6# Print the size of our training and test sets 7print("Number of instances in Training set: ", len(X_train)) 8print("Number of instances in Test set: ", len(X_test))
The output will be:
1Number of instances in Training set: 105 2Number of instances in Test set: 45
In this example, we split our original dataset, represented by X
and y
, into Training and Test Sets. We'll use 70% of the data (the Training Set) to train our model, and the remaining 30% (the Test Set) to test our model's performance.
When dealing with regression problems (where the output is a numeric or continuous value), we use metrics such as Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE). MAE provides the average difference between our predictions and the actual values, while MSE and RMSE yield the average squared difference, with RMSE taking the square root of that difference.
Let's apply these metrics to a simple Linear Regression model:
Python1from sklearn.metrics import mean_absolute_error, mean_squared_error 2from sklearn.linear_model import LinearRegression 3from math import sqrt 4 5# Instantiate and train a Linear Regression model 6lr_model = LinearRegression() 7lr_model.fit(X_train, y_train) 8 9# Predict test set labels and calculate errors 10y_pred = lr_model.predict(X_test) 11mae = mean_absolute_error(y_test, y_pred) 12mse = mean_squared_error(y_test, y_pred) 13rmse = sqrt(mse) 14 15print('Mean Absolute Error: ', mae) 16print('Mean Squared Error: ', mse) 17print('Root Mean Squared Error: ', rmse)
The output will be:
1Mean Absolute Error: 1.23632 2Mean Squared Error: 2.37823 3Root Mean Squared Error: 1.54128
In this Python snippet, we create a Linear Regression model using LinearRegression()
, train it using our training data with fit()
, make predictions on the test data with predict()
, and calculate MAE, MSE, and RMSE using the corresponding Scikit-Learn functions.
In classification tasks, where the model's output is a category or class, we use metrics such as Accuracy, Precision, Recall, and the F1 Score.
Let's learn what each of these metrics is:
-
Accuracy: Measures the proportion of true results (both true positives and true negatives) among the total number of cases examined. It is a straightforward metric for overall success but can be misleading in cases of class imbalance.
-
Recall: Also known as sensitivity, this metric measures the proportion of actual positives that are correctly identified. It is particularly important when the consequences of false negatives are significant.
-
F1 Score: A harmonic mean of Precision and Recall, providing a balanced measure between the two. It is most useful when we need a single metric to reflect both false positives and false negatives. An ideal F1 Score is 1, indicating perfect precision and recall, while 0 is the worst.
These metrics collectively offer a nuanced view of a model's performance, particularly in situations where certain types of errors are more consequential than others.
Additionally, we employ the Confusion Matrix — a table that summarizes the performance of classification models.
Let's examine this by training and evaluating a Logistic Regression model.
Python1from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score 2from sklearn.linear_model import LogisticRegression 3 4# Instantiate and train a Logistic Regression model 5log_model = LogisticRegression(max_iter=200) 6log_model.fit(X_train, y_train) 7 8# Predict test set labels and calculate scores 9y_pred = log_model.predict(X_test) 10accuracy = log_model.score(X_test, y_test) 11precision = precision_score(y_test, y_pred, average='micro') 12recall = recall_score(y_test, y_pred, average='micro') 13f1 = f1_score(y_test, y_pred, average='micro') 14 15print("Accuracy: ", accuracy) 16print("Precision: ", precision) 17print("Recall: ", recall) 18print("F1 Score: ", f1)
The output will be:
1Accuracy: 0.97777 2Precision: 0.97777 3Recall: 0.97777 4F1 Score: 0.97777
In this block of code, we created a Logistic Regression model with LogisticRegression()
, fitted it to the training data, predicted the labels of the testing set, and calculated Accuracy, Precision, Recall, and the F1 Score for the model on the test data.
Lastly, when evaluating the performance of a decision tree model, we often use Accuracy and the Gini Index. Accuracy measures the fraction of correct predictions, while the Gini Index quantifies the impurity of an input set.
Let's explore this with an example:
Python1from sklearn.metrics import accuracy_score 2from sklearn.tree import DecisionTreeClassifier 3 4# Instantiate and train a Decision Tree model 5tree_model = DecisionTreeClassifier() 6tree_model.fit(X_train, y_train) 7 8# Predict test set labels and calculate accuracy 9y_pred = tree_model.predict(X_test) 10accuracy = accuracy_score(y_test, y_pred) 11 12print("Accuracy: ", accuracy)
The output will be:
1Accuracy: 0.95555
Here, we created a Decision Tree classifier and calculated its accuracy. Give yourself a round of applause if your model has high accuracy!
Bravo! You've actively learned about and implemented various model evaluation metrics, such as Mean Absolute Error, Mean Squared Error, Root Mean Squared Error for regression models, and Accuracy, Precision, Recall, F1 Score for classification models. Applying these metrics using Python and Sklearn should bring you closer to selecting the most suitable model for your datasets.
Now, it's time to apply what you've learned. Delve into practice exercises that will help you consolidate these core concepts and polish your newfound skills. Always remember that understanding and learning are key to success in your machine learning journey. Onwards and upwards!