Are you ready for another captivating session? Today, we are taking a step further into the captivating world of data visualization by learning how to use box plots. Box plots are unique in providing a snapshot of a dataset's distribution and outlier detection, all in one plot!
Box plots are crucial in understanding the Titanic
dataset, particularly in discovering relationships between survival rates, passenger classes, and fares. This can answer our central question: How did the passenger class and fare correlate with survival?
A box plot, also known as a whisker plot, is a standardized way of displaying the data distribution based on a five-number summary: the minimum, the maximum, the sample median, and the first and third quartiles. The box shows the quartiles of the dataset while the whiskers extend to show the rest of the distribution, except for points that are determined to be “outliers” using a method that is a function of the interquartile range.
We can create a box plot using the boxplot()
function in the Python Seaborn library. First, let's start with pclass (passenger class)
against fare
:
Python1import seaborn as sns 2import matplotlib.pyplot as plt 3 4# Load the dataset 5titanic_df = sns.load_dataset('titanic') 6 7# Create a box plot 8sns.boxplot(x='pclass', y='fare', data=titanic_df) 9plt.title('Fares vs Passenger Classes') 10plt.show()
In the box plot:
- The box represents the interquartile range (i.e., 25th to 75th percentile) of the fares in each passenger class.
- The line in the middle of the box is the median fare price in that class.
- The whiskers (lines extending from the box) represent the fare range within 1.5 times the interquartile range above the upper and lower quartile.
- Any points beyond the whiskers can be considered outliers in the fare distribution within each class.
A great feature of box plots in Seaborn is that it allows you to add a hue
parameter to add a third dimension of categorical data. For instance, we can differentiate the passengers who survived from those who didn't on the same pclass
vs fare
plot:
Python1sns.boxplot(x='pclass', y='fare', hue='survived', data=titanic_df) 2plt.title('Fares vs Passenger Classes Differentiated by Survival') 3plt.show()
This plot visually compares fares among different passenger classes regarding their survival status, enhancing our grasp of the data.
There are many ways you can modify your box plot to better suit your needs, such as:
orient
: if set to"h"
, it changes the box plot orientation from vertical to horizontal. Alternatively, you can swapx
andy
values in the bot plot configuration.width
: adjusts the width of the boxes.palette
: this modifies the color palette.linewidth
: adjusts the width of the line.
Let's try them:
Python1sns.boxplot( 2 x='survived', y='fare', 3 hue='pclass', data=titanic_df, 4 palette='Set3', linewidth=1.5 5) 6plt.title('Survival and Passenger Classes by Fare') 7plt.show()
In addition to the above, Seaborn's boxplot()
function has more parameters that you can use to enhance your box plots and cater them to your needs. Let's dive a bit deeper!
order
: You can change the order of display of categorical levels by passing the desired order.hue_order
: Similar to the order parameter, the hue_order changes the order of display of your hue variable levels.color
: If you want all boxes the same color.saturation
: Saturation makes patches drawn by the function look darker (if less than 1) or brighter (if greater than 1).dodge
: When hue nesting is used, whether elements should be shifted along the categorical axis.fliersize
: Size of the markers used to indicate outlier observations.
Here is how these parameters can be used in a sample box plot:
Python1sns.boxplot( 2 x='pclass', y='fare', 3 hue='survived', 4 data=titanic_df, 5 palette='Set3', linewidth=1.5, 6 order=[3,1,2], hue_order=[1,0], 7 color='skyblue', saturation=0.7, 8 dodge=True, fliersize=5 9) 10plt.title('Fares vs Passenger Classes Differentiated by Survival') 11plt.show()
Congratulations on mastering another vital tool in data visualization: the box plot! With this helpful visualization technique, you can now gain insights into the computational distribution of your dataset and detect any outliers.
In this lesson, we also learned about the importance of inspecting relationships between variables with our Titanic
dataset, focusing on variables like passenger class, fare, and survival. This way, we can observe patterns and reach more precise conclusions about survival rates and contributing factors like fare and class.
Next up, we have some practice to give you hands-on experience with data visualization, further refining your skills. Remember, practice is a crucial step towards mastering these concepts and developing your skills further!