Welcome to our next course, Introduction to Supervised Machine Learning, a plunge into the intriguing world of Supervised Machine Learning seasoned with a savory twist. Throughout this journey, your senses will be filled with a balanced mix of theory, hands-on exercises, and real-world case studies as we strive to perfect the coveted technique of predicting wine quality.
In this first lesson of the course, you will explore the renowned Wine Quality dataset. This dataset, sourced from the UCI Machine Learning Repository, provides information about various wines and their quality ratings.
A thorough understanding of your dataset is essential before developing machine learning models. A comprehensive dataset review empowers us to identify potential features that can significantly influence output variables. This process is akin to familiarizing oneself with a novel's characters before delving into the plot; possessing nuanced knowledge of the dataset makes the subsequent modeling phase more coherent.
In the spirit of curiosity, the Wine Quality dataset paves the way for us to explore a real-world problem: determining wine quality based on its physicochemical characteristics. As budding machine learning practitioners, this experience enlivens our learning journey by engaging us in practical applications within an accessible context. So, shall we make a toast to learning and dive right in?
As the name suggests, the Wine Quality dataset encompasses data on wines, specifically, the physicochemical properties of red and white variants of Portuguese "Vinho Verde" wine. The dataset consists of 12 variables, inclusive of quality
— the target variable. Here's a quick summary of key columns:
fixed acidity
volatile acidity
citric acid
residual sugar
chlorides
free sulfur dioxide
total sulfur dioxide
density
pH
sulphates
alcohol
quality
(score between 0 and 10)Now, let's learn how to load the dataset. As referred to in the course brief, we'll employ the datasets
Python library, which conveniently facilitates the loading of various datasets. This specific dataset is already available in the CodeSignal environment.
Python1import datasets 2import pandas as pd 3 4# Loading Dataset 5red_wine = datasets.load_dataset('codesignal/wine-quality', split='red') 6white_wine = datasets.load_dataset('codesignal/wine-quality', split='white') 7red_wine = pd.DataFrame(red_wine) 8white_wine = pd.DataFrame(white_wine) 9 10# Checking the shape of the dataset 11print("Red Wine Dataset Shape: ", red_wine.shape) # Red Wine Dataset Shape: (1599, 12) 12print("White Wine Dataset Shape: ", white_wine.shape) # White Wine Dataset Shape: (4898, 12)
In the snippet above, we load the red and white wine datasets separately and subsequently display their respective sizes as an output of the shape
function.
Digging deeper, we can examine various features, their types, statistical summaries, and unique value counts for a richer understanding. The Python code below checks the data types of the features.
Python1# Check Red Wine Dataset data types 2print("Red Wine Dataset Data Types:") 3print(red_wine.dtypes) 4""" 5Red Wine Dataset Data Types: 6fixed acidity float64 7volatile acidity float64 8citric acid float64 9residual sugar float64 10chlorides float64 11free sulfur dioxide float64 12total sulfur dioxide float64 13density float64 14pH float64 15sulphates float64 16alcohol float64 17quality float64 18dtype: object 19""" 20 21# Check White Wine Dataset data types 22print("\nWhite Wine Dataset Data Types:") 23print(white_wine.dtypes) 24""" 25the structure is the same as in the red wine dataset 26"""
Next, we'll obtain a brief stats summary and unique value count using Python:
Python1# Describing Red Wine Dataset 2print("Red Wine Dataset Description:") 3print(red_wine.describe()) 4""" 5Red Wine Dataset Description: 6 fixed acidity volatile acidity ... alcohol quality 7count 1599.000000 1599.000000 ... 1599.000000 1599.000000 8mean 8.319637 0.527821 ... 10.422983 5.636023 9std 1.741096 0.179060 ... 1.065668 0.807569 10min 4.600000 0.120000 ... 8.400000 3.000000 1125% 7.100000 0.390000 ... 9.500000 5.000000 1250% 7.900000 0.520000 ... 10.200000 6.000000 1375% 9.200000 0.640000 ... 11.100000 6.000000 14max 15.900000 1.580000 ... 14.900000 8.000000 15 16[8 rows x 12 columns] 17""" 18 19# Unique values 20print("\nUnique values in Red Wine Dataset:") 21print(red_wine.nunique()) 22""" 23Unique values in Red Wine Dataset: 24fixed acidity 96 25volatile acidity 143 26citric acid 80 27residual sugar 91 28chlorides 153 29free sulfur dioxide 60 30total sulfur dioxide 144 31density 436 32pH 89 33sulphates 96 34alcohol 65 35quality 6 36dtype: int64 37""" 38 39# Describing White Wine Dataset 40print("\nWhite Wine Dataset Description:") 41print(white_wine.describe()) 42""" 43White Wine Dataset Description: 44 fixed acidity volatile acidity ... alcohol quality 45count 4898.000000 4898.000000 ... 4898.000000 4898.000000 46mean 6.854788 0.278241 ... 10.514267 5.877909 47std 0.843868 0.100795 ... 1.230621 0.885639 48min 3.800000 0.080000 ... 8.000000 3.000000 4925% 6.300000 0.210000 ... 9.500000 5.000000 5050% 6.800000 0.260000 ... 10.400000 6.000000 5175% 7.300000 0.320000 ... 11.400000 6.000000 52max 14.200000 1.100000 ... 14.200000 9.000000 53 54[8 rows x 12 columns] 55""" 56 57# Unique values 58print("\nUnique values in White Wine Dataset:") 59print(white_wine.nunique()) 60""" 61Unique values in White Wine Dataset: 62fixed acidity 68 63volatile acidity 125 64citric acid 87 65residual sugar 310 66chlorides 160 67free sulfur dioxide 132 68total sulfur dioxide 251 69density 890 70pH 103 71sulphates 79 72alcohol 103 73quality 7 74dtype: int64 75"""
Executing the above Python script generates a statistical summary for each feature in the dataset and counts the unique values, thus shedding light on the diversity of the datasets.
It is crucial to check if our data contain missing values, as these can significantly affect the outcomes of our data analysis and model accuracy. Here's how to check for missing data:
Python1# Check missing values in Red Wine Dataset 2print("Missing values in Red Wine Dataset:") 3print(red_wine.isnull().sum()) # There are no null values in all columns 4 5 6# Check missing values in White Wine Dataset 7print("\nMissing values in White Wine Dataset:") 8print(white_wine.isnull().sum()) # There are no null values in all columns
Let's delve one step further to better understand our dataset by visualizing the target variable quality
. We'll use the matplotlib
library to generate histograms of the wine quality for the red and white wine datasets.
Python1import matplotlib.pyplot as plt 2 3# Plot for Red Wine 4plt.hist(red_wine.quality, bins=10, color='red', alpha=0.7) 5plt.xlabel('Quality') 6plt.ylabel('Count') 7plt.title('Quality Distribution for Red Wine') 8plt.show() 9 10# Plot for White Wine 11plt.hist(white_wine.quality, bins=10, color='skyblue', alpha=0.7) 12plt.xlabel('Quality') 13plt.ylabel('Count') 14plt.title('Quality Distribution for White Wine') 15plt.show()
These histograms visualize the count of wine samples at each quality score, providing insight into how the quality of the wine is distributed.
By the end of this lesson, you will have attained a deep understanding of the Wine Quality dataset, including:
datasets
Python library.This profound understanding sets the foundation for upcoming lessons wherein we'll wear our data scientist hats and begin predicting wine quality!
Are you ready to get hands-on with the Wine Quality dataset? Up next are practice exercises designed to deepen your understanding of datasets and Python programming. These exercises play a pivotal role in the learning process, enabling you to apply the concepts you've learned and strengthen your newfound knowledge. So, grab a glass of your favorite 'vinho' and let's get rolling!