Welcome! This lesson paves your path toward understanding machine learning and the powerful Python library, sklearn.
Machine learning, an application of artificial intelligence, enables systems to learn and improve without being explicitly programmed. It plays a key role in various sectors, such as autonomous vehicles, voice recognition systems, and recommendation engines.
Suppose you aim to predict housing prices as an illustration. This scenario constitutes a standard supervised learning problem wherein you train your model using past data. With sklearn
, you can import the data, preprocess it, select an algorithm (like linear regression), train the model with the training data, and make predictions. All these steps can be accomplished without manually implementing algorithms.
Datasets form the backbone of machine learning. In this course, we'll use the Iris dataset, which consists of measurements — namely, sepal length, sepal width, petal length, and petal width — for 150 flowers representing three species of iris.
Sklearn
provides an easy-to-use load_iris
function to import the Iris dataset. Let's see how it works:
Python1from sklearn.datasets import load_iris 2 3iris = load_iris() 4X = iris.data 5y = iris.target
Here, the load_iris()
function loads the dataset and assigns it to the iris
variable. We then separate the dataset into X
for features and y
for the target.
Furthermore, you can print the description of the dataset for more detailed insight using the DESCR
attribute as follows:
Python1print(iris.DESCR)
Output:
Markdown1.. _iris_dataset:
2
3Iris plants dataset
4--------------------
5
6**Data Set Characteristics:**
7
8 :Number of Instances: 150 (50 in each of three classes)
9 :Number of Attributes: 4 numeric, predictive attributes and the class
10 :Attribute Information:
11 - sepal length in cm
12 - sepal width in cm
13 - petal length in cm
14 - petal width in cm
15 - class:
16 - Iris-Setosa
17 - Iris-Versicolour
18 - Iris-Virginica
19
20 :Missing Attribute Values: None
21 :Class Distribution: 33.3% for each of 3 classes.
22 :Creator: R.A. Fisher
23 :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
24 :Date: July, 1988
25
26The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
27from Fisher's paper. Note that it's the same as in R, but not as in the UCI
28Machine Learning Repository, which has two wrong data points.
29
30The mean sepal length is 5.843, which is close to the mean of all other
31attributes. The petal width varies from 0.1cm to 2.5cm, indicating a large
32range.
33
34This is perhaps the best-known database to be found in the
35pattern recognition literature. Fisher's paper is a classic in the field and
36is referenced frequently to this day.
This code prints a detailed description of the dataset and its attributes.
After the data loading, let's delve into how Python and sklearn
enable us to explore it. 'Features’ and 'Target' are two critical terms related to the dataset. Here, 'Features' refer to the attributes of the Iris flower: sepal length, sepal width, petal length, and petal width. 'Target', on the other hand, refers to the species of the Iris flower, which we aim to predict based on these features.
The data
and target
attributes of the iris
object hold the feature matrix and the response vector, respectively. The shape
property gives information about their dimensionality - how many examples we have and how many features each example consists of.
Python1print("Data shape: ", iris.data.shape) # Prints (150, 4) 2print("Targets shape: ", iris.target.shape) # Prints (150,)
Output:
Markdown1Data shape: (150, 4) 2Targets shape: (150,)
Before feeding our data to the machine learning model, we must split it into a training set and a test set. The training set teaches our model, while the test set evaluates its performance. Sklearn
allows for the convenient split of these datasets using the train_test_split
function from the model_selection
module.
Python1from sklearn.model_selection import train_test_split 2 3X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 4 5print("Training set size: ", len(X_train)) # Prints 120 6print("Test set size: ", len(X_test)) # Prints 30
Output:
Markdown1Training set size: 120 2Test set size: 30
Here, the train_test_split
function has divided our data into a training set — 80% of the original data, and a test set — the remaining 20%.
Each machine learning model in sklearn
is represented as a Python class. These classes offer an interface that includes methods for building the model (fit
), making predictions (predict
), and evaluating the model's performance (score
).
In the next, more concrete lesson, you'll see how to apply these methods after selecting a specific type of machine learning model. For now, understand that the procedure of using these models would look something as follows:
Python1# model = SomeModel(args) 2# model.fit(X_train, y_train) 3# predictions = model.predict(X_test) 4# score = model.score(X_test, y_test)
Congratulations! With the knowledge acquired from this lesson, you now understand what sklearn
is, how to import data using it, the process of preparing data for machine learning tasks, and the rudimentary structure of sklearn
models. The upcoming sessions will build upon this fundamental understanding by introducing you to more specific machine-learning models and optimization tricks. Keep practicing and continue learning as we're taking the first steps into the exciting world of machine learning! Keep up the good work!