Hello and welcome! In today's lesson, we will focus on visualizing the distribution of diamond prices using histograms and Kernel Density Estimates (KDE). This visualization is a crucial part of Exploratory Data Analysis (EDA) and helps us uncover patterns in our data.
By the end of this lesson, you will be able to create a histogram, overlay it with a KDE, and interpret the resulting visualization effectively.
A histogram is a type of bar plot that groups data points into specified ranges (bins) and then displays the number of points that fall into each bin. This makes histograms useful for understanding the distribution, central tendency, and variability of your data. Here is a simple example, with the corresponding figure below:
Python1import matplotlib.pyplot as plt 2 3data = [1, 2, 2, 3, 3, 3, 4, 4, 4, 4] 4plt.hist(data, bins=4) 5plt.show()
The hist()
function in Matplotlib takes several parameters:
- x: The data array for which the histogram will be generated.
- bins: The number of intervals the data range is divided into. It can also be a sequence defining the bin edges.
- range: The lower and upper range of the bins.
- density: If True, it normalizes the histogram to form a probability density.
- cumulative: If True, it computes a cumulative histogram.
Kernel Density Estimate (KDE) is a method used to estimate the probability density function of a continuous variable. Unlike histograms, KDEs provide a smooth curve representing the data distribution, as presented below. This can offer a clearer picture of the data.
Python1import seaborn as sns 2 3data = [1, 2, 2, 3, 3, 3, 4, 4, 4, 4] 4 5sns.kdeplot(data)
The kdeplot()
function takes several parameters:
- data: The dataset array for which the KDE plot will be generated.
- bw_adjust: A factor that adjusts the bandwidth of the kernel, affecting smoothness. Higher values increase smoothness.
- shade: If True, it fills the area under the KDE curve with a color.
- clip: A tuple defining the range within which to restrict the plot.
- cumulative: If True, it plots the cumulative distribution.
- Histograms are ideal for understanding the frequency of data points in different ranges.
- KDEs provide a smoothed-out representation of the data distribution, more useful for identifying trends that might not be immediately evident in a histogram.
Now that we've loaded the dataset, it's time to plot our first histogram for the 'price' column. We will use sns.histplot()
from the seaborn
library to create the histogram. This function allows us to easily plot a histogram and customize it as needed.
Python1import seaborn as sns 2import matplotlib.pyplot as plt 3 4diamonds = sns.load_dataset('diamonds') 5 6plt.figure(figsize=(10, 6)) 7sns.histplot(diamonds['price'], bins=50) 8plt.title('Distribution of Diamond Prices') 9plt.xlabel('Price') 10plt.ylabel('Frequency') 11plt.show()
- Bins: The number of intervals the data range is divided into. Selecting the right number of bins is crucial.
- Figure Size: Adjusted using
figsize
for better readability. - Labels: Titles and labels make the histogram more informative.
To enhance our histogram, we can overlay it with a KDE to provide a smoother density estimate. KDE will help us visualize the data distribution more smoothly, providing clearer insight into the data's density patterns.
Python1plt.figure(figsize=(10, 6)) 2sns.histplot(diamonds['price'], kde=True, bins=50) 3plt.title('Distribution of Diamond Prices with KDE') 4plt.xlabel('Price') 5plt.ylabel('Frequency') 6plt.show()
- KDE Line: This line gives us a smoothed estimate of the data's distribution.
- Combining Histograms and KDEs: Allows for more comprehensive data visualization.
Customization is key to making our plots not only informative but also presentable. We can modify colors, edge colors, grid lines, and more to improve the plot's readability and aesthetics.
Python1plt.figure(figsize=(10, 6)) 2sns.histplot(diamonds['price'], kde=True, bins=50, color='blue', edgecolor='black') 3plt.title('Customized Distribution of Diamond Prices with KDE') 4plt.xlabel('Diamond Price ($)') 5plt.ylabel('Frequency') 6plt.grid(True) 7plt.show()
- Color and Edgecolor: Helps differentiate between the bars and background.
- Axis Labels and Title: Provides clearer context.
- Grid Lines: Improves readability by making it easier to follow trends and spot values.
Finally, let's discuss how to interpret our histogram and KDE plot. In our histogram, if we observe more bars on the lower end with a gradual decline as prices increase, it suggests that lower-priced diamonds are more common. The KDE line will provide a smoother interpretation of this pattern.
- Central Tendency: Identify where most of the data points center, indicated by the peak of the histogram and KDE.
- Skewness: Notice if the data is skewed to the left (negatively) or right (positively). For example, the diamond prices may be positively skewed if most diamonds are cheaper, with a long tail for higher-priced diamonds.
- Spread and Variability: Examine the width of the histogram and KDE to understand the variability in diamond prices.
Great job! Today, you've learned how to visualize the distribution of diamond prices using histograms and KDE. You can now load datasets, plot histograms, add KDE overlays, customize your plots, and interpret the visualizations effectively.
These skills are vital for exploratory data analysis, allowing you to uncover hidden patterns and distributions in your data. Practice these techniques with other dataset columns to master the art of data visualization, and let's continue our journey in data science!