Lesson 2

Hey there! Welcome to another enlightening session on predictive modeling where we're diving into Regression Models, specifically Polynomial Regression, using Python along with the sklearn library. Think of Polynomial Regression as an extended version of Linear Regression, which is capable of modeling the relationship between two variables, i.e., predictors (x) and response (y), as an nth degree polynomial. By the end of this lesson, the main goal is to own the practical knowledge of how Polynomial Regression works and how to implement the same in Python using sklearn.

Let's start!

First and foremost, let's try to understand what Polynomial Regression really entails. At its core, Polynomial Regression extends the simple linear regression by adding extra predictors, which are derived by raising each of the original predictors to a power. This extension enables us to encapsulate relationships between the variable that are not merely linear.

Suppose you're trying to estimate the price of a house. While the price depends on its size, the correlation isn't linear because the price does not increase proportionally with the size. This is where Polynomial Regression comes in!

However, one must be cautious. Use of a very high degree polynomial can lead to complex models which might result in overfitting.

In Polynomial Regression, the relationship between the independent variable `x`

and the dependent variable `y`

is modeled as an nth degree polynomial. This intricate process begins by elevating the basic premise of simple linear regression. We add extra predictors derived by raising each of the original predictors to a power, thus introducing the concept of polynomial degree. Herein, the degree signifies the highest power of any predictor in the model, distinguishing it fundamentally from the number of features. While the degree describes the highest exponent, the number of features pertains to the distinct variables in the dataset.

Consider the equation for a single predictor variable model: $y = b_0 + b_1x + b_2x^2$

**$y$**targets the variable we aim to predict,**$x$**is the predictor,**$b_0$**is the y-intercept,**$b_1x$**represents the linear term,**$b_2x^2$**contributes the quadratic term, adding curvature and thus, a nonlinear aspect.

This equation demonstrates a polynomial of 2nd degree; however, despite having a single feature (`x`

), we integrate its power to create multiple predictors ($x$, $x^2$). In scenarios with multiple features, each feature can be similarly elevated to create a richer set of predictors. Consequently, while the model complexity increases with the degree due to the heightened curvature, it's crucial to distinguish this complexity from the actual number of features within the dataset.

Through the adaptive nature of Polynomial Regression, we can decode complex, nonlinear relationships by meticulously managing the balance between the polynomial degree and the dataset's intrinsic dimensionality. This balance is paramount in unveiling the model's predictive prowess while avoiding the pitfalls of overfitting, thus unleashing the full potential of Polynomial Regression in capturing a wide spectrum of data nuances.

In Polynomial Regression, finding the best-fit polynomial curve involves minimizing the discrepancies between the observed values and the predictions through the **Least Squares Method**. This approach optimizes the model's coefficients to reduce the sum of squared residuals, focusing on accurately representing the data.

The optimization function, tailored for Polynomial Regression, is defined as:

$F(b_0, b_1, \ldots, b_n) = \sum^{m}_{i=1}(y_i - (b_0 + b_1x_i + b_2x_i^2 + \ldots + b_nx_i^n))^2$

Where:

- $F(b_0, b_1, \ldots, b_n)$ is the residual sum of squares (RSS),
- $y_i$ and $x_i$ are the observed values and predictor values, respectively,
- $b_0, b_1, \ldots, b_n$ are the model coefficients,
- $m$ is the number of observations.

Minimizing the RSS ensures the model's predictions closely align with the actual data, effectively capturing the non-linear relationship via a fitted polynomial curve.

Let's walk through preparing our data for implementing Polynomial Regression. We create a simple dataset, then reshape it for compatibility with sklearn. The simplification to a handmade dataset allows us to avert the complexity of real-world data, focusing purely on the mechanics of Polynomial Regression.

Here's how we go about it:

Python`1import numpy as np 2 3# Creating our dataset 4X = np.array([5, 6, 7, 8, 9]) 5Y = np.array([4, 6, 10, 15, 23]) 6 7# Reshaping to 2D for compatibility with sklearn 8X = X.reshape(-1, 1)`

With our data ready and well-prepared, we can traverse the path of implementing and analyzing Polynomial Regression.

Implementing Polynomial Regression in sklearn requires us first to transform our input variables into polynomial features. This transformation step is vital as it equips a linear model with the capability to fit more complex relationships by introducing higher-order terms for the independent and dependent variables.

Python`1from sklearn.preprocessing import PolynomialFeatures 2 3# Creating polynomial features 4poly_features = PolynomialFeatures(degree=2) 5 6# Transforming the input features 7X_poly = poly_features.fit_transform(X) 8 9# Displaying the transformation 10print("Original feature:", X[0]) 11print("Transformed polynomial feature:", X_poly[0])`

In the output, observe how the original feature is expanded into its polynomial counterpart, enhancing the model's capability to decipher more complex patterns in the data.

Plain text`1Original feature: [5] 2Transformed polynomial feature: [ 1. 5. 25.]`

With our input features now transformed into polynomial features, the next step involves training a Linear Regression model on these new, transformed features. This process allows us to capture the nonlinear relationships between the variables through a fundamentally linear model, but with polynomial capacities. By fitting our model with the transformed polynomial features and the target variable `Y`

, we enable our Linear Regression framework to interpret and adapt to a nonlinear association.

Python`1from sklearn.linear_model import LinearRegression 2 3# Training the Linear Regression model on polynomial features 4model = LinearRegression() 5model.fit(X_poly, Y) 6 7# Printing model coefficients and intercept 8print("Coefficients:", np.round(model.coef_,2)) # Prints: [0. -8.3 0.93] 9print("Intercept:", np.round(model.intercept_, 2)) # Prints: 22.34`

Displaying the coefficients and the intercept offers us a glimpse into the way each feature shapes our predictions, enriching our grasp of the model's inner workings. The output presents our model's coefficients as $b_0 = 22.34$, $b_1 = -8.3$, and $b_2 = 0.93$. With these values, we can construct our polynomial equation as follows:

$y = 22.34 - 8.3x + 0.93x^2$

This equation allows us to visualize the mathematical relationship that the model has identified between the features and the target variable, underpinning the predictive logic of our polynomial regression model.

After developing our Polynomial Regression model, it's insightful to visualize how well the model fits our data. To do so, we generate a smooth curve that represents our Polynomial Regression model's predictions. This step requires creating a dense range of input values (`X_smooth`

), transforming these into polynomial features, and then using our model to predict the corresponding `Y`

values. This approach allows us to plot a smooth curve that closely follows the trends captured by our model, offering a clear visualization of the model's performance with respect to the actual data points.

Python`1import matplotlib.pyplot as plt 2 3# Creating more points for a smoother curve 4X_smooth = np.linspace(X.min(), X.max(), 300).reshape(-1, 1) # Reshape for transformation 5X_poly_smooth = poly_features.transform(X_smooth) # Transforming for polynomial regression 6Y_pred_smooth = model.predict(X_poly_smooth) # Predicting the values 7 8# Plotting the smoother regression line 9plt.scatter(X.flatten(), Y, color='blue') # Plotting the actual data points 10plt.plot(X_smooth, Y_pred_smooth, color='red') 11plt.title("Polynomial Regression") 12plt.xlabel('X') 13plt.ylabel('Y') 14plt.grid(True) 15plt.show()`

This plot vividly illustrates the non-linear relationship captured by our Polynomial Regression model, differing substantially from what we'd expect with Simple Linear Regression.

Whew! There's a lot to digest; good work getting here! In this lesson, we've learned the essence of Polynomial Regression, groomed our data, implemented the Regression, and visualized our model's fit.

Please remember, hands-on practice is crucial. In subsequent lessons, we'll delve into real-world exercises, fortifying our grasp on today's knowledge.

Let's keep pushing forward in mastering Polynomial Regression. Happy Coding!