Lesson 4

Welcome! Today, we're learning to evaluate your *machine learning model's performance*. Evaluating your model is crucial because it tells you how well it will predict new data it hasn't seen before. In simpler terms, it tells you if your model is good at its job.

We will focus on one metric – **Mean Squared Error (MSE)**. This metric is like a report card for your model, showing its prediction accuracy. By the end of this lesson, you'll know how to calculate it and understand what it mean.s

Let's review what we've done so far. We have been working with synthetic data representing house areas and their prices. We used this data to train a simple *linear regression model* to predict house prices based on their area. Here's a reminder snippet:

Python`1import numpy as np 2import pandas as pd 3import matplotlib.pyplot as plt 4from sklearn.linear_model import LinearRegression 5 6# Generate synthetic data 7np.random.seed(42) 8num_samples = 100 9area = np.random.uniform(500, 3500, num_samples) # House area in square feet 10base_price = 50000 11price_per_sqft = 200 12noise = np.random.normal(0, 25000, num_samples) # Adding some noise 13price = base_price + (area * price_per_sqft) + noise 14 15# Create DataFrame 16df = pd.DataFrame({'Area': area, 'Price': price}) 17 18# Extract features and target variable 19X = df['Area'].values.reshape(-1, 1) 20y = df['Price'].values 21 22# Initialize and train the model 23model = LinearRegression() 24model.fit(X, y)`

With our model trained, we can evaluate its performance.

**Mean Squared Error (MSE)** measures how far off our model's predictions are from the actual values. It’s like checking how precise your aim is in darts. The lower the MSE, the better.

Steps to calculate MSE:

- Make predictions using your model.
- Calculate the difference between actual and predicted prices for each house.
- Square these differences.
- Find the average of these squared differences.

Mathematically, it looks like this: $\text{MSE} = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2$ where:

- $N$ is the number of data points,
- $y_i$ is the actual value for the $i$-th data point,
- $\hat{y}_i$ is the predicted value for the $i$-th data point.

Let's visualize it with a plot:

Here, green vertical lines show distance between the actual price and the model's prediction. If we square all these distances and find their average, it will be the MSE metric!

Here's the code:

Python`1from sklearn.metrics import mean_squared_error 2 3# Make predictions 4y_train_predict = model.predict(X) 5 6# Calculate MSE 7mse = mean_squared_error(y, y_train_predict) 8 9print(f"Mean Squared Error: {mse:.2f}")`

This will output something like:

`1Mean Squared Error: 504115352.48`

In real life, MSE helps you understand if your model's predictions are close to the actual prices. For example, if predicting toy prices, an MSE of 1000 means predictions are off by about 31.62 (since $\sqrt{1000} \approx 31.62$).

So, what do our MSE score tell us? Lower values are better. If your MSE is high, it means your model’s predictions are not accurate. For example, while predicting toy prices, an MSE of 10 might be great, but an MSE of 1000 means the model is often very wrong.

Understanding these metrics helps you improve your model. Suppose your MSE score is high; to improve your predictions, you might need to consider other features or preprocess data more accurately or even choose a different model!

Today, you learned to evaluate your model's performance using **Mean Squared Error (MSE)**. MSE measures how close your model’s predictions are to actual values.

Understanding these metrics helps you assess and improve your model. Next, we'll practice using MSE Score to evaluate models on CodeSignal!