Welcome to the next step in our exploration journey, where we dive deeper into the world of using heatmaps for correlation analysis. Correlation analysis is a critical method used for understanding the relationship between two or more variables. When we look at two variables over time, if one variable changes, how does this affect change in the other variable?
Heatmaps are a powerful visual tool that lets us examine and understand complex correlations and interdependencies across multiple variables. They are widely used for exploring the correlations between features and visualizing correlation matrices.
Correlation analysis and visualization using heatmaps provide vital insights, especially in real-world scenarios where we need to understand multiple features' relationships towards a target. For instance, in our Titanic
dataset, we will unlock interdependencies between multiple variables such as age
, fare
, pclass
, and survived
.
We start by loading the Titanic dataset using Seaborn, the data visualization library:
Python1import seaborn as sns 2 3# Load Titanic dataset 4titanic_df = sns.load_dataset('titanic')
In Python, correlation analysis can be quickly performed using the corr()
method available in the Pandas library. Just applying it to a DataFrame will give you the correlation matrix. Each cell in the correlation matrix represents the correlation coefficient that measures the statistical relationship between a pair of variables.
Let's move ahead and calculate the correlation matrix for our Titanic dataset:
Python1# Calculate correlation matrix 2correlation_matrix = titanic_df.corr(numeric_only=True) 3 4print(correlation_matrix) 5""" 6 survived pclass age ... fare adult_male alone 7survived 1.000000 -0.338481 -0.077221 ... 0.257307 -0.557080 -0.203367 8pclass -0.338481 1.000000 -0.369226 ... -0.549500 0.094035 0.135207 9age -0.077221 -0.369226 1.000000 ... 0.096067 0.280328 0.198270 10sibsp -0.035322 0.083081 -0.308247 ... 0.159651 -0.253586 -0.584471 11parch 0.081629 0.018443 -0.189119 ... 0.216225 -0.349943 -0.583398 12fare 0.257307 -0.549500 0.096067 ... 1.000000 -0.182024 -0.271832 13adult_male -0.557080 0.094035 0.280328 ... -0.182024 1.000000 0.404744 14alone -0.203367 0.135207 0.198270 ... -0.271832 0.404744 1.000000 15 16[8 rows x 8 columns] 17"""
Correlation coefficients in the matrix depict the relationships between variables, and they lie in the -1 to 1 range. When two features have a high positive correlation, their values tend to rise and fall together. On the other hand, when they have a negative correlation when one variable's value rises, the other one tends to fall. If the correlation is close to 0, it largely signifies that there is no linear relationship between the variables.
Seaborn is a versatile Python library that enriches Matplotlib plots by providing a high-level interface for creating a variety of informative and attractive statistical graphics. Among them, a powerful tool is the heatmap plot. Heatmap plots display numeric tabular data where the cells are colored depending on the contained value.
Let's visualize our correlation matrix as a heatmap:
Python1import matplotlib.pyplot as plt 2 3# Create a heatmap 4sns.heatmap(correlation_matrix, annot=True) 5 6# Show plot 7plt.show()
The argument annot=True
in the heatmap()
function is used to write the data value into each cell, providing instant insights.
The heatmap()
function offers a lot of parameters that can be useful for customization according to our requirements:
cbar
: IfTrue
, draw a colorbar.vmin
,vmax
: Establish the colormap limits.
Let's try to create a heatmap with a color bar:
Python1# Create a heatmap 2sns.heatmap(correlation_matrix, annot=True, cbar=True, vmin=-1, vmax=1) 3 4# Show plot 5plt.show()
Here is the result:
We can use the cmap
parameter to define a colormap for the heatmap.
The colormap can help us perceive the strength of the correlations between the variables at a glance:
Python1# Create a heatmap 2sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm') 3 4plt.show()
Here is the result:
The coolwarm
colormap used here is a diverging colormap. It means the colors diverge from a neutral color at 0 to two contrasting colors at the negative and positive extremes. The colormap scale goes from -1 to +1, corresponding to the correlation coefficient range.
Alternatively, you can build a color map on your own:
Python1# Building a color map 2color_map = sns.diverging_palette(220, 20, as_cmap=True) 3sns.heatmap(correlation_matrix, annot=True, cmap=color_map) 4 5plt.show()
In this case, sns.diverging_palette(220, 20, as_cmap=True)
, the arguments 220
and 20
denote the hues in degrees on the color wheel, starting from 0 to 360. 220
refers to a blue hue, and 20
refers to an orange. as_cmap=True
means the output will be a matplotlib colormap object that can be used with matplotlib and seaborn plotting functions.
Congratulations! You've just learned how to perform correlation analysis and effectively communicate the insights from your analysis using heatmaps in Python. You've also explored how color mapping techniques can amplify the readability of your plots and provide instant insights into the relationships between variables.
Understanding and capturing the correlation between different variables is crucial in exploratory data analysis and can help you shape significant insights.
Each of these concepts is a stepping stone on your journey of mastery. As we move ahead, they weave into a rich tapestry of skillfulness. It's time to translate these concepts into hands-on experience with some practical exercises. Through these exercises, you will gain practical experience with data correlation analysis and heatmaps, further building and strengthening your skills.