Training a Simple Model

Lesson 3

Welcome to Training a Simple Model

Hello again! In the last lesson, we learned how to split your dataset into training and testing sets using the caret package in R. Now, we are ready to step into the next phase of our machine learning journey: training a simple model.

What You'll Learn

In this lesson, you'll discover how to train a Linear Support Vector Machine (SVM) model using the caret package. Specifically, you will learn to:

Understand the purpose and basic concept of a Linear SVM.
Train a Linear SVM model on your training dataset.
Display and interpret some basic model details.

This process is straightforward and builds nicely on what you’ve learned so far.

Why It Matters

Training your first machine learning model is a significant milestone. The Linear SVM is a powerful and commonly used model in machine learning for classification tasks. By mastering the basics of training this model, you'll gain essential skills that will serve as a foundation for more advanced machine learning techniques and algorithms. This is where your data preprocessing and dataset splitting efforts come together to create a predictive model.

Understanding the Basic Concept of a Linear SVM

A Linear Support Vector Machine (SVM) is a type of algorithm used primarily for classification tasks. The basic concept is to find a hyperplane that best separates the classes in the feature space. In a two-dimensional space, this hyperplane is a line, but in higher dimensions, it becomes a plane or a hyperplane. The objective is to maximize the margin between the classes, which helps in achieving better generalization on unseen data. Linear SVMs are effective when the data is linearly separable, meaning that a straight line (or hyperplane) can separate the classes.

Step 1: Preparing the Dataset

Before we train the Linear SVM model, let's quickly prepare our dataset by loading it and splitting it into training and testing sets. This process was covered in the previous lesson.

R
1# Load iris dataset
2data(iris)
3
4# For reproducibility
5set.seed(123)
6
7# Splitting data into train and test sets
8trainIndex <- createDataPartition(iris$Species, p = 0.7, list = FALSE, times = 1)
9irisTrain <- iris[trainIndex,]
10irisTest  <- iris[-trainIndex,]

data(iris) loads the iris dataset.
set.seed(123) ensures reproducibility.
createDataPartition(iris$Species, p = 0.7, list = FALSE, times = 1) splits the dataset, with 70% allocated for training.

Step 2: Training the Linear SVM Model

Now, let's train the Linear SVM model using the train function from the caret package. This function allows us to specify the model formula (the relationship between the target variable and the predictors), the dataset to use, and the method for training the model. In this case, we use svmLinear to train a linear SVM.

R
1# Training a Linear SVM model
2model <- train(Species ~ ., data = irisTrain, method = "svmLinear")

Species ~ . specifies that Species is the target variable we aim to predict, and . means using all other variables in the dataset as predictors.
data = irisTrain indicates that we are using the training subset of the iris dataset.
method = "svmLinear" tells caret to use a Linear SVM for training.

Step 3: Displaying the Model Details

Once the model is trained, it’s essential to understand and interpret the details of the trained model.

R
1# Display the model details
2print(model)

The print function will display a summary of the model, including information about the accuracy and any other useful metrics. This helps you understand how well your model was trained on the dataset.

Step 4: Interpreting the Output

The output of the print(model) command will provide details such as the accuracy of the model and the parameters used for tuning. Here’s a rough idea of what you might see:


1Support Vector Machines with Linear Kernel 
2
3105 samples
4  4 predictor
5  3 classes: 'setosa', 'versicolor', 'virginica' 
6
7No pre-processing
8Resampling: Bootstrapped (25 reps) 
9Summary of sample sizes: 105, 105, 105, 105, 105, 105, ... 
10Resampling results:
11
12  Accuracy  Kappa    
13  0.962487  0.9428477
14
15Tuning parameter 'C' was held constant at a value of 1

Accuracy: Indicates how well the model performs on the training data.
Kappa: A statistic that measures inter-rater agreement for categorical items, providing an idea about the model’s agreement with true labels beyond just accuracy.

Note that these scores are calculated on the training set.

Conclusion

Ready to see your efforts come to life? Let's start the practice section and begin training our first model together.

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.