Unfurling the scroll of knowledge further, we welcome you to another insightful journey into the dynamic realm of machine learning! Today, we aim to delve into the fascinating topic of model improvement and fine-tuning. We will refine the models we've constructed so far, enhancing their reliability and accuracy in predicting outcomes based on given data.
The art of model improvement and fine-tuning forms the crux of machine learning. Even with a preliminary model already developed to analyze data and predict outcomes, there's always room for refinement. We can dramatically enrich the model's performance by fine-tuning parameters or applying advanced techniques, thereby providing us with more accurate and decisive predictions.
Today's discussion will focus on our existing linear and logistic regression models, which predict wine quality based on physicochemical properties. We aim to make these models even more proficient at their tasks. Therefore, you are in the right place if you're curious about achieving more precise predictions and seeking knowledge to optimize machine-learning models for real-world applications. Let's dive right in!
A necessary beginning to our discussion on model improvement and fine-tuning is with the concepts of overfitting and underfitting. In machine learning language, a model is said to be overfitting when it performs exceedingly well on the training data but fails to generalize well with new, unseen data. Conversely, underfitting occurs when a model performs poorly on both the training and test data—it fails to capture the underlying pattern in the data.
To better understand, let's use an analogy of these twin pitfalls of model training with learning a new sport. Suppose you're learning to play golf. If you practice solely on a particular field with specific conditions (say, calm weather and flat terrain), there might be a downfall: you could only perform well under those specific conditions. Your performance could suffer if an upcoming tournament is on another course or under harsh weather conditions. This scenario is akin to overfitting in machine learning, where a model might perform well on specific training data but fails to generalize to new or different data.
On the other hand, imagine you are learning golf but not practicing enough or focusing only on specific aspects (say, only working on the driving range and not practicing putting). In this case, your overall golfing skills won't be as complete. This reflects underfitting, where the model fails to capture the full scope of patterns in the data, resulting in poor performance on both the training and the test data.
To visualize these concepts tangibly, let's take a Python example:
Python1# Exemplify overfitting and underfitting 2import numpy as np 3import matplotlib.pyplot as plt 4from sklearn.preprocessing import PolynomialFeatures 5from sklearn.linear_model import LinearRegression 6from sklearn.pipeline import make_pipeline 7 8# Generate some data 9np.random.seed(0) 10x = np.random.rand(40, 1) ** 2 11y = (10 - 1. / (x.ravel() + 0.1)) + np.random.randn(40) 12 13# Define a function to fit the model 14def PolynomialRegression(degree=2, **kwargs): 15 return make_pipeline(PolynomialFeatures(degree), LinearRegression(**kwargs)) 16 17# Fit the model 18X_test = np.linspace(-0.1, 1.1, 500)[:, None] 19plt.figure(figsize=(12, 6)) 20plt.scatter(x.ravel(), y, color='black') 21axis = plt.axis() 22for degree in [1, 3, 30]: 23 y_test = PolynomialRegression(degree).fit(x, y).predict(X_test) 24 plt.plot(X_test.ravel(), y_test, label='degree={0}'.format(degree)) 25plt.xlim(-0.1, 1.0) 26plt.ylim(-2, 12) 27plt.legend(loc='best'); 28 29# Show the plot 30plt.show()
In the plot above, we observe the same data being fitted by polynomials of degree 1 (underfitting), 3 (a good fit), and 30 (overfitting). The model with degree 1 is too simple and doesn't capture the pattern in the data well, indicating underfitting. The model with degree 30 captures too much of the noise, leading to overfitting. The model with degree 3 strikes the right balance.
So, how can we address overfitting and underfitting in our wine quality models? Let's explore this next!
An essential component contributing to the performance of machine learning models is the parameters that govern the learning process. You can compare these parameters to the strategic commands in a chess game; they decide the course and dynamics. Similarly, machine learning has two types of decision-making elements: model parameters and hyperparameters.
Model parameters are the learned attributes that influence the performance of the training data, such as the weights in a linear regression model. These are learned during training from the data itself. Hyperparameters, on the other hand, are preset before training and guide the learning process. For instance, the learning rate in gradient descent is a hyperparameter.
Hyperparameters mold the strategy of the learning process. Therefore, optimizing hyperparameters—an endeavor known as Hyperparameter Optimization—becomes crucial for improving a machine learning model. We'll look into two popular techniques: Grid Search and Random Search.
Grid Search: In grid search, we specify a subset of the hyperparameter space as a grid. We then evaluate the model performance for each point in the grid and choose the most effective hyperparameters, i.e., the ones that produce the best model based on a pre-selected measure.
Random Search: While grid search offers a straightforward and exhaustive method of searching through hyperparameters, it can heavily draw computational resources. As an alternative, random search selects random points in the parameter space, evaluates these points, and does this repetitively for a fixed number of iterations—randomizing the search process.
Let's see how we can fine-tune the C
and penalty
hyperparameters of a logistic regression model:
Python1from sklearn.model_selection import GridSearchCV 2from sklearn.linear_model import LogisticRegression 3from sklearn.preprocessing import StandardScaler 4from sklearn.pipeline import make_pipeline 5from sklearn.model_selection import train_test_split 6import pandas as pd 7import numpy as np 8 9# Load the dataset 10import datasets 11red_wine = datasets.load_dataset('codesignal/wine-quality', split='red') 12red_wine = pd.DataFrame(red_wine) 13 14# Separate features and target 15X = red_wine.drop(columns='quality') 16y = pd.cut(red_wine['quality'], bins=[0, 6.5, 10], labels=['bad', 'good']) 17 18# Standardize the features 19scaler = StandardScaler() 20X = pd.DataFrame(scaler.fit_transform(X), columns=X.columns) 21 22# Split the dataset into train and test sets 23X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0) 24 25# Create a Logistic Regression model 26logistic = LogisticRegression(solver='saga', tol=0.01) 27pipe = make_pipeline(logistic) 28 29# Set up the grid 30param_grid = { 31 'logisticregression__C': np.logspace(-2, 2, 5), 32 'logisticregression__penalty': ['l1', 'l2'], 33} 34 35# Initiate Grid search with cross-validation 36grid = GridSearchCV(pipe, param_grid=param_grid, cv=5, verbose=0) 37grid.fit(X_train, y_train) 38 39# Print the best parameters 40print('Best parameters: ', grid.best_params_) 41# Best parameters: {'logisticregression__C': 0.09999999999999999, 'logisticregression__penalty': 'l2'}
The GridSearchCV
function performs an exhaustive search over the specified parameter grid and then returns the most effective parameters that generate the model with the highest cross-validated score.
While hyperparameter tuning is undoubtedly beneficial and can improve model performance, there can be situations when advanced techniques are required for further improvement. For instance, let's talk about Regularization, one of the techniques used to curb overfitting.
Regularization: By integrating a penalty term to the cost function proportional to the size of the weights, regularization discourages learning overly complex models. It prompts the model to keep the weights as small as possible, simplifying it and preventing overfitting. Let's see how this works with our logistic regression model:
Python1from sklearn.linear_model import LogisticRegressionCV 2 3# Create a Logistic Regression model with regularization 4logistic_l2 = LogisticRegressionCV(cv=5, random_state=0, penalty='l2').fit(X_train, y_train) 5 6# Print the best C_ 7print('Best C: ', logistic_l2.C_[0]) 8# Best C: 0.046415888336127774
In this step, we use LogisticRegressionCV
, logistic regression with default L2 regularization that finds the optimal C
parameter through cross-validation. Here, penalty='l2'
indicates that we are using L2 regularization, which encourages the coefficient estimates, or simply the 'weights'
, to be small.
Now that we have a deeper understanding of model improvement and enhancement mechanisms - including overfitting and underfitting, hyperparameter optimization through grid and random search, and advanced techniques like regularization - we can put these to the test. Consider enhancing our wine quality prediction models; the goal is to predict whether a wine will be 'good'
or 'not good'
based on its physicochemical properties – not forgetting to consider the correlation between them.
Congratulations on enhancing your machine-learning knowledge! In this lesson, we have covered:
By applying all these techniques to our linear and logistic regression models, we've improved the prediction of wine quality based on various physicochemical features. With these newly acquired skills, you've taken another step towards becoming a machine learning marvel!
Ready to get your hands dirty with some real practice? We have several exercises where you'll apply what you've learned in today's session. These exercises reinforce your learning and prepare you for real-world machine-learning problems. Fine-tuning a machine learning model is primarily a practical skill, but the key to becoming proficient lies in practice. Onwards to the exercises, and happy coding!