Hello! In today's lesson, we will dive into the concept of correlation and focus specifically on highlighting certain correlation values within the diamonds
dataset.
Correlation is a statistical measure that describes the extent to which two variables change together. Understanding correlations is crucial in data analysis as it helps us identify relationships between different variables.
For example:
- Positive Correlation: As one variable increases, the other also increases (e.g., height and weight).
- Negative Correlation: As one variable increases, the other decreases (e.g., speed and travel time).
By the end of this lesson, you will be able to compute, mask, and visually represent these correlations to get a clearer picture of the underlying data relationships.
Let's compute the correlation matrix for our prepared diamonds
dataset. As mentioned before, the correlation matrix is a table showing correlation coefficients between many variables. Each cell in the table shows the correlation between two variables.
You might be familiar with the process by now, but here's how to compute and display the correlation matrix using pandas:
Python1import seaborn as sns 2import pandas as pd 3 4# Load the diamonds dataset 5diamonds = sns.load_dataset('diamonds') 6 7diamonds['cut'] = diamonds['cut'].astype('category').cat.codes 8diamonds['color'] = diamonds['color'].astype('category').cat.codes 9diamonds['clarity'] = diamonds['clarity'].astype('category').cat.codes 10 11# Calculate the correlation matrix 12correlation_matrix = diamonds.corr() 13 14# Display the correlation matrix 15print(correlation_matrix)
Output:
Plain text1 carat cut color clarity depth table price \ 2carat 1.000000 0.134967 0.291437 0.352841 0.028224 0.181618 0.921591 3cut 0.134967 1.000000 0.020519 0.189175 0.218055 0.433405 0.053491 4color 0.291437 0.020519 1.000000 -0.025631 0.047279 0.026465 0.172511 5clarity 0.352841 0.189175 -0.025631 1.000000 0.067384 0.160327 0.146800 6depth 0.028224 0.218055 0.047279 0.067384 1.000000 -0.295779 -0.010647 7table 0.181618 0.433405 0.026465 0.160327 -0.295779 1.000000 0.127134 8price 0.921591 0.053491 0.172511 0.146800 -0.010647 0.127134 1.000000 9x 0.975094 0.125565 0.270287 0.371999 -0.025289 0.195344 0.884435 10y 0.951722 0.121462 0.263584 0.358420 -0.029341 0.183760 0.865421 11z 0.953387 0.149323 0.268227 0.366952 0.094924 0.150929 0.861249 12 13 x y z 14carat 0.975094 0.951722 0.953387 15cut 0.125565 0.121462 0.149323 16color 0.270287 0.263584 0.268227 17clarity 0.371999 0.358420 0.366952 18depth -0.025289 -0.029341 0.094924 19table 0.195344 0.183760 0.150929 20price 0.884435 0.865421 0.861249 21x 1.000000 0.974701 0.970772 22y 0.974701 1.000000 0.952006 23z 0.970772 0.952006 1.000000
In the matrix, correlation coefficients range from -1 to 1. Values close to 1 imply a strong positive correlation, while values close to -1 imply a strong negative correlation. Values near 0 imply little to no correlation.
To enhance visibility, we'll mask correlation values within a specified range (e.g., -0.3 to 0.3). Masking helps us focus on more significant relationships.
We'll use the map
function in pandas to mask these values:
Python1# Masking the correlation matrix for values within the range -0.3 to 0.3 2mask = correlation_matrix.map(lambda x: -0.3 < x < 0.3) 3corr_to_plot = correlation_matrix.mask(mask) 4 5# Display the masked correlation matrix 6print(corr_to_plot)
Output:
Plain text1 carat cut color ... x y z 2carat 1.000000 NaN NaN ... 0.975094 0.951722 0.953387 3cut NaN 1.000000 NaN ... NaN NaN NaN 4color NaN NaN 1.0 ... NaN NaN NaN 5clarity 0.352841 NaN NaN ... 0.371999 0.358420 0.366952 6depth NaN NaN NaN ... NaN NaN NaN 7table NaN 0.433405 NaN ... NaN NaN NaN 8price 0.921591 NaN NaN ... 0.884435 0.865421 0.861249 9x 0.975094 NaN NaN ... 1.000000 0.974701 0.970772 10y 0.951722 NaN NaN ... 0.974701 1.000000 0.952006 11z 0.953387 NaN NaN ... 0.970772 0.952006 1.000000
With the mask applied, we'll only see the correlations with absolute values greater than 0.3.
Finally, let's visualize the masked correlation matrix with a heatmap. Heatmaps are a great way to represent data, providing an easily interpretable and visually appealing view of our correlations.
Here's how to create and display a heatmap:
Python1import matplotlib.pyplot as plt 2import seaborn as sns 3 4plt.figure(figsize=(10, 6)) 5sns.heatmap(corr_to_plot, annot=True, cmap='coolwarm', linewidths=0.5) 6plt.title('Correlations with Absolute Values > 0.3') 7plt.show()
The output of the above code will be a heatmap visualization, showing the diamonds
dataset's correlations with absolute values greater than 0.3. This heatmap aids in quickly identifying the variables that either have a strong positive or negative correlation with each other.
In this heatmap:
- The color gradient represents the strength of the correlation.
- We use annotations (
annot=True
) to display the correlation values directly in the heatmap.
By focusing on significant correlations, you can better understand the relationships within your dataset, making your data analysis more insightful.
Today, you learned what correlation is and how to compute and visualize it using the diamonds
dataset. We covered:
- Converting categorical variables.
- Computing the correlation matrix.
- Masking values within a specified range.
- Creating a heatmap to visualize significant correlations.
These skills will help you in identifying and focusing on meaningful relationships in your data, improving the quality of your analyses. Now, it's time to practice these concepts with some exercises to reinforce your understanding and boost your data analysis capabilities.
Great job, and keep up the good work!