Hello and welcome to this thrilling session where we dive deep into the world of machine learning! Today, we'll build upon our knowledge of predictive modeling and explore a special tool known as ensemble methods. We'll focus on how ensemble methods harness the power of multiple algorithms to create a better, stronger algorithm, significantly improving the accuracy of our predictions.
This lesson aims to guide you through the theory and practical implementation of ensemble methods in Python. We'll learn and practice two popular ensemble techniques—RandomForest and GradientBoosting—and apply them to the Breast Cancer Wisconsin Dataset, a binary classification dataset frequently used for education and research in machine learning.
So, roll up your sleeves and let's embark on this journey into the realm of ensemble methods!
Ensemble methods rely on a simple yet potent concept in machine learning — combining the decisions from multiple weak learning algorithms to construct a powerful and robust model. We refer to a model that performs just slightly better than random guessing as a weak learner in the realm of machine learning. Ensemble methods bring together these weak models to create a stronger learner. To simplify this, let's envision a team-building exercise where each team member is proficient at one specific task. By working together, they can combine their skills and effectively complete complex projects.
In this lesson, we'll specifically focus on two popular ensemble techniques: bagging and boosting.
-
Bagging: This term is an acronym for Bootstrap AGGregatINg. It works by constructing multiple subsets of the original dataset and training on each to produce multiple weak learners. The final output for a given input is then combined (usually taking the average for regression problems or voting for classification problems) to make the final prediction.
-
Boosting: This method operates a little differently compared to Bagging. It works in a sequential manner by correcting the errors from the previous models. This continuous focus on misclassified examples often leads to improved model performance, but it can also induce overfitting because boosting pays considerable attention to outliers.
Both of these methods help decrease variance and bias in model predictions, resulting in enhanced model performance. For instance, if we want to predict whether it's going to rain tomorrow and have three models that each say 'Yes', 'No', 'Yes', based on a majority vote, our robust ensemble method would predict it as 'Yes'.
Let's look at two Ensemble methods RandomForest
and GradientBoosting
which use Bagging
and Boosting
, respectively, to improve performance of Decision Tree Classifiers.
RandomForest is a versatile and widely-used machine learning algorithm that is part of the larger family of ensemble methods. As the name suggests, a "forest" consists of a large number of individual decision trees that operate as an ensemble. Each individual tree in the random forest outputs a class prediction, and the class with the most votes becomes the model’s final prediction.
Let's now see how to implement a RandomForest classifier in Python using the Breast Cancer Wisconsin dataset and the scikit-learn
library:
Python1from sklearn.ensemble import RandomForestClassifier 2from sklearn.datasets import load_breast_cancer 3from sklearn.model_selection import train_test_split 4 5# Load the dataset 6data = load_breast_cancer() 7X, y = data.data, data.target 8 9# Split the dataset into training and testing sets 10X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 11 12# Train a RandomForestClassifier 13rfc = RandomForestClassifier(n_estimators=100, max_features='sqrt') # Create a RandomForestClassifier object 14rfc.fit(X_train, y_train) # Fit the classifier on the training set 15 16# Predict on the testing set 17y_pred = rfc.predict(X_test)
This code just trained a RandomForest classifier on the Breast Cancer Wisconsin Dataset and made predictions on the testing set. The n_estimators
parameter sets the number of trees to be constructed in the forest, and max_features
determines the number of features to consider when looking for the best split. You can also adjust other parameters like max_depth
(depth of the tree), min_samples_split
(minimum number of samples required to split a node), among others.
Gradient Boosting is another powerful ensemble learning method. Unlike RandomForest, which grows an entire forest of full-sized trees, Gradient Boosting grows trees sequentially; each new tree helps correct errors made by the previously trained tree.
To understand this, imagine a situation where a student is taught by a series of teachers, with each teacher correcting the work done by the previous one, constantly improving the student's knowledge. This is essentially how Boosting works!
Let's see how to implement Gradient Boosting in Python using the Breast Cancer Wisconsin dataset and the scikit-learn
library:
Python1from sklearn.ensemble import GradientBoostingClassifier 2from sklearn.datasets import load_breast_cancer 3from sklearn.model_selection import train_test_split 4 5# Load the dataset 6data = load_breast_cancer() 7X, y = data.data, data.target 8 9# Split the dataset into training and testing sets 10X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 11 12# Train a GradientBoostingClassifier 13gbc = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3) # create a GradientBoostingClassifier object 14gbc.fit(X_train, y_train) # fit the classifier on the training set 15 16# Predict on the testing set 17y_pred = gbc.predict(X_test)
This code trained a GradientBoosting classifier on the same dataset and made predictions on the testing set. The learning_rate
parameter controls the impact of each tree on the final outcome while n_estimators
and max_depth
regulate the number of sequential trees to be modeled and their depth, respectively.
Lastly, but certainly not least, it's time to evaluate the performance of our models. The Accuracy Score is one of the simplest ways to evaluate a classification model, as it represents the fraction of correct predictions out of total predictions:
Python1# Train a DecisionTreeClassifier 2from sklearn.tree import DecisionTreeClassifier 3 4dtc = DecisionTreeClassifier() 5dtc.fit(X_train, y_train) 6 7# Calculate the accuracy of RandomForest 8rfc_score = rfc.score(X_test, y_test) 9 10# Calculate the accuracy of GradientBoosting 11gbc_score = gbc.score(X_test, y_test) 12 13# Calculate the accuracy of DecisionTree 14dtc_score = dtc.score(X_test, y_test) 15 16# Print scores 17print('RandomForest Accuracy:', rfc_score) 18print('GradientBoosting Accuracy:', gbc_score) 19print('DecisionTree Accuracy:', dtc_score)
output
1RandomForest Accuracy: 0.9649122807017544 2GradientBoosting Accuracy: 0.956140350877193 3DecisionTree Accuracy: 0.9385964912280702
This snippet of code first calculates and then prints the accuracy score of the RandomForest and GradientBoosting classifiers as well as vanilla DecisionTree so we can compare. As you can see RandomForest does best even though GradientBoosting also improves accuracy of the model.
Congratulations! You've now successfully completed the lesson and learned about the power of ensemble learning algorithms. You have a firm understanding of ensemble techniques, RandomForest, GradientBoosting classifiers, and their practical implementation.
In the exercises that follow, you'll be given the opportunity to further challenge your newly acquired skills by applying these concepts to different datasets. Brace yourself for a thrilling roller-coaster ride into the realm of ensemble methods!