Welcome to the next guide on our remarkable voyage! Moving into our discussion on multivariate data visualization, we'll introduce you to scatter plots, one of the most powerful tools for visualizing the relationship between multiple variables. We'll guide you through plotting scatter plots for different variable pairs in our Titanic
dataset and stepping further into correlating these variables.
Why is it important to understand the correlation among variables? Imagine, we want to know whether passengers in the higher classes were more likely to survive. Or maybe we are interested in the fare paid for a ticket correlates with the survival on Titanic
. Finding correlations among variables will help us generate hypotheses, create insightful visualizations, and eventually enable efficient predictive modeling.
By the end of this lesson, you'll be conversant with how scatter plots and correlation techniques can be used to explore and visualize relationships between different features present in a multivariate dataset.
A scatter plot is a versatile visualization tool that can disclose the relationship, if any exists, between two variables. Each point on the plot represents an observation in the dataset, with its position along the X and Y axes representing the values of two variables.
Let's initiate with a scatter plot depicting the relationship between age
and fare
.
Python1import seaborn as sns 2import matplotlib.pyplot as plt 3 4# Load Titanic dataset 5titanic = sns.load_dataset('titanic') 6 7# Display Scatter Plot of Age vs Fare 8sns.scatterplot(x='age', y='fare', data=titanic) 9plt.title("Age vs Fare") 10plt.show()
In scatterplot()
function:
x
is for the data along the horizontal axisy
is for the data along the vertical axisdata
: it's a required parameter, providing the data source.Looking at the scatter plot, there seems to be no apparent correlation between age
and fare
. But what if we consider another variable - class
in our analysis? We might hypothesize that higher class passengers (1st or 2nd) could have paid more fare regardless of age.
Using the hue
parameter, we can visualize this by adding color discrimination to our scatter plot. Setting hue='pclass'
will provide different colors to data points belonging to different passenger classes:
Python1sns.scatterplot(x='age', y='fare', hue='pclass', data=titanic) 2plt.title("Age vs Fare (Separate colors for Passenger Class)") 3plt.show()
hue
: you can think of it as a fourth dimension of data, it can determine the color of data points using an additional variable.
To add further dimensions to your scatter plot, you can opt for different marker styles for different categories and sizes to represent another numerical variable. Let's try adding styles based on sex
and sizes based on fare
.
Python1sns.scatterplot(x='age', y='fare', hue='pclass', style='sex', size='fare', sizes=(20, 200), data=titanic) 2plt.title("Age vs Fare (Separate markers for Sex and Sizes for Fare)") 3plt.show()
Here is what we'll see:
Here, style
has been used to depict different markers for male
and female
, and size
has been used to give varying point sizes based on the fare
. sizes=(20, 200)
sets the range of sizes to scale the scatter plot points. By adding both style
and size
aspects, we achieve a four-variable scatter plot in a two-dimensional space.
style
: This attribute will make different marks on the plot for different categories.size
: This attribute can determine the size of a plotting mark using an additional variable. This represents another layer of information, providing you with a 3-dimensional plot.While scatter plots may visually hint at correlations to quantify the extent of the correlation, we need to move towards correlation coefficients. A correlation coefficient is a numerical measure of the statistical relationship between two variables. The correlation coefficient ranges from -1 to 1 where:
+1
represents an exact positive linear relationship between variables,-1
represents a perfect negative linear relationship between variables,0
suggests no linear relationship between variables.Let's determine the correlation between all variables in the Titanic
dataset. For the same, we'll use the corr()
function of pandas:
Python1# Correlation of all numeric variables in the Titanic dataset 2corr_vals = titanic.corr(numeric_only=True) 3print(corr_vals)
This code outputs:
Markdown1 survived pclass age sibsp parch fare
2survived 1.000000 -0.338481 -0.077221 -0.035322 0.081629 0.257307
3pclass -0.338481 1.000000 -0.369226 0.083081 0.018443 -0.549500
4age -0.077221 -0.369226 1.000000 -0.308247 -0.189119 0.096067
5sibsp -0.035322 0.083081 -0.308247 1.000000 0.414838 0.159651
6parch 0.081629 0.018443 -0.189119 0.414838 1.000000 0.216225
7fare 0.257307 -0.549500 0.096067 0.159651 0.216225 1.000000
This code provides the correlation coefficients among all pairs of numerical variables in the dataset.
The corr()
function of pandas calculates the pairwise correlation of columns, excluding NA/null values. It operates on Series
as well as DataFrame
objects. We use numeric_only=True
to show correlation only for numeric columns (of int
, float
, and bool
type).
Kudos! You've now entered the world of multivariate analysis, learned about scatter plots, and understood the correlation of variables. To encapsulate, we delved into:
With these skills in your repertoire, you can explore more intricate relationships among variables, gain insightful knowledge, and represent it effectively.
Following this lesson, paving the way for you are several real-world exercises encompassing the Titanic dataset to help you concrete your understanding and make you comfortable with multivariate analysis. Keep practicing!