Lesson 5

Visualizing Distributions with Histograms Using Seaborn

Setting the Stage

Can you recall from our last lesson how we used Seaborn to make our plots more aesthetically pleasing? We'll continue our journey with Seaborn in this lesson, but this time, we'll explore a different type of visualization - histograms.

Histograms are powerful graphical representations that allow us to inspect the underlying frequency distribution (shape) of a continuous or discrete data set. This is particularly useful when we want to visualize the distribution of a variable over a range of values.

Why is understanding the data distribution important, you might ask? In the field of data analytics and statistics, most statistical tests and models assume certain data distribution patterns. Histograms, therefore, are ways for us to validate these assumptions. In other words, knowing our data well sets the stage for more complex analyses later on.

This lesson will take you further into Seaborn's capabilities. We'll cover how to create and customize histograms, offering a sharper lens to inspect our Titanic dataset.

Diving into Histograms

Let's illustrate a histogram using the passenger ages (age) from titanic_df. As we saw in our previous lessons, there were a variety of ages amongst the passengers that should make for an interesting distribution.

Seaborn provides a function called histplot for creating histograms. Here's the basic syntax:

Python
1import seaborn as sns 2 3titanic_df = sns.load_dataset('titanic') 4 5sns.histplot(data=titanic_df, x='age', kde=True)

In the code snippet above, we are telling Seaborn to look at the age column from our titanic_df DataFrame, and the kde=True part is there also to draw a curve of Kernel Density Estimation (KDE) that estimates the probability density function of the variable age (more on this shortly).

This delivers a histogram that shows the distribution of passenger ages. The x-axis represents the ages, and the y-axis represents the number of passengers with ages within the corresponding bin of the histogram.

image

Understanding Kernel Density Estimation (KDE)

You may wonder what the smooth, continuous line overlaying our histogram represents. This smooth line, created by turning on the kde parameter in histplot, is a Kernel Density Estimate (KDE) plot that provides a smooth estimate of our distribution.

The KDE is useful when we want to derive a smooth, continuous function from our discrete observations. Often, this can make the output much more interpretable, aiding in the presentation of our data. However, remember that KDEs are just estimates. The true distribution of your data may be different, especially if you have a small number of observations.

Here is one more example of using KDE in action!

Python
1sns.histplot(data=titanic_df, x='fare', kde=True, color='green')

image

As you can see, the KDE gives us a smooth curve that fits our observations, providing a pleasing and easy-to-understand representation of our distribution.

Customizing Your Histogram

Like all plots in Seaborn, histograms are highly customizable. Let's look at improving the readability of our histogram by adding more bins and labeling our axes.

Python
1# Increase the number of bins to 30 (default is 10) 2sns.histplot(data=titanic_df, x='age', bins=30, kde=True) 3 4# Give your plot a comprehensive title 5plt.title('Age Distribution Among Titanic Passengers') 6 7# Label your axes 8plt.xlabel('Age') 9plt.ylabel('Number of Passengers') 10plt.show()

image

Our histogram is instantly more legible, offering an improved perception of the underlying distribution. The increased number of bins provides a more defined structure for our data, while the informative title and labels allow us to understand the plot without any additional context.

Further Customization of Histograms Using Seaborn

Seaborn's histplot function enables the drawing of histograms with rich features. Here are some additional useful parameters that you should know about:

  • hue: This is a very important parameter when you have a categorical column that you want to represent on the histogram. The hue parameter instructs seaborn to color the histogram bars for the age distribution differently depending on the passenger's gender values (i.e., male or female).
  • multiple: This parameter is used with hue to change how the different categories are displayed on the histogram. The default is "layer", but you could set it to "stack", "fill" or "dodge".
    • "layer": Draw one histogram per variable. Each histogram will represent a separate layer; layers will be superimposed on each other.
    • "stack": Draw one histogram, stacking the values of each variable on top of the other.
    • "fill": Draw one histogram, with the area of each filled up to the total height, the cumulating contribution of each variable (like a percentage plot, where the whole plot is 100%).
    • "dodge": Draw one histogram, but "dodge" them, i.e., move them slightly to the side so each contributes to the overall figure separately and all can be seen.
  • palette: This parameter allows you to change the colors used for the different categories.
  • binwidth: This parameter allows you to set the width of the bins rather than the number of bins. This can be useful for a more direct control of the granularity of the histogram.
  • element: By default, the histogram is made of bars, but you could set this parameter to "step" or "poly" to change the appearance of the histogram.

Let's look at examples demonstrating some of these parameters:

Python
1# A histogram using 'hue', 'multiple', and 'palette' 2sns.histplot(data=titanic_df, x='age', hue="sex", multiple="stack", palette="pastel")

image

Python
1# A histogram using 'binwidth' and 'element' 2sns.histplot(data=titanic_df, x="age", binwidth=1, element="step", color="purple")

image

Wrapping Up

Congratulations! You have now extended your seaborn capabilities to draw and customize histograms, which are one of the first tools of choice in exploratory data analysis. As we learned, histograms show the distribution of a numerical variable, and it's considered the simplest representation of a distribution. Thanks to Seaborn and Matplotlib's customization options, we've seen how easy it is to make a histogram more understandable.

What's next?

Have you treated your curiosity about age distribution on the Titanic? Feeling the exhilaration as you realize the power of what you can do with histograms? Fantastic! Take this excitement with you to the next activity, which cements your understanding of histograms by letting you create and edit them on your own.

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.