Highlighting Correlation Values Within a Range

Lesson 5

Introduction to Correlation and Its Importance

Hello! In today's lesson, we will dive into the concept of correlation and focus specifically on highlighting certain correlation values within the diamonds dataset.

Correlation is a statistical measure that describes the extent to which two variables change together. Understanding correlations is crucial in data analysis as it helps us identify relationships between different variables.

For example:

Positive Correlation: As one variable increases, the other also increases (e.g., height and weight).
Negative Correlation: As one variable increases, the other decreases (e.g., speed and travel time).

By the end of this lesson, you will be able to compute, mask, and visually represent these correlations to get a clearer picture of the underlying data relationships.

Computing the Correlation Matrix

Let's compute the correlation matrix for our prepared diamonds dataset. As mentioned before, the correlation matrix is a table showing correlation coefficients between many variables. Each cell in the table shows the correlation between two variables.

You might be familiar with the process by now, but here's how to compute and display the correlation matrix using pandas:

Python
1import seaborn as sns
2import pandas as pd
3
4# Load the diamonds dataset
5diamonds = sns.load_dataset('diamonds')
6
7diamonds['cut'] = diamonds['cut'].astype('category').cat.codes
8diamonds['color'] = diamonds['color'].astype('category').cat.codes
9diamonds['clarity'] = diamonds['clarity'].astype('category').cat.codes
10
11# Calculate the correlation matrix
12correlation_matrix = diamonds.corr()
13
14# Display the correlation matrix
15print(correlation_matrix)

Output:

Plain text
1            carat       cut     color   clarity     depth     table     price  \
2carat    1.000000  0.134967  0.291437  0.352841  0.028224  0.181618  0.921591   
3cut      0.134967  1.000000  0.020519  0.189175  0.218055  0.433405  0.053491   
4color    0.291437  0.020519  1.000000 -0.025631  0.047279  0.026465  0.172511   
5clarity  0.352841  0.189175 -0.025631  1.000000  0.067384  0.160327  0.146800   
6depth    0.028224  0.218055  0.047279  0.067384  1.000000 -0.295779 -0.010647   
7table    0.181618  0.433405  0.026465  0.160327 -0.295779  1.000000  0.127134   
8price    0.921591  0.053491  0.172511  0.146800 -0.010647  0.127134  1.000000   
9x        0.975094  0.125565  0.270287  0.371999 -0.025289  0.195344  0.884435   
10y        0.951722  0.121462  0.263584  0.358420 -0.029341  0.183760  0.865421   
11z        0.953387  0.149323  0.268227  0.366952  0.094924  0.150929  0.861249   
12
13                x         y         z  
14carat    0.975094  0.951722  0.953387  
15cut      0.125565  0.121462  0.149323  
16color    0.270287  0.263584  0.268227  
17clarity  0.371999  0.358420  0.366952  
18depth   -0.025289 -0.029341  0.094924  
19table    0.195344  0.183760  0.150929  
20price    0.884435  0.865421  0.861249  
21x        1.000000  0.974701  0.970772  
22y        0.974701  1.000000  0.952006  
23z        0.970772  0.952006  1.000000

In the matrix, correlation coefficients range from -1 to 1. Values close to 1 imply a strong positive correlation, while values close to -1 imply a strong negative correlation. Values near 0 imply little to no correlation.

Masking Values in the Correlation Matrix

To enhance visibility, we'll mask correlation values within a specified range (e.g., -0.3 to 0.3). Masking helps us focus on more significant relationships.

We'll use the map function in pandas to mask these values:

Python
1# Masking the correlation matrix for values within the range -0.3 to 0.3
2mask = correlation_matrix.map(lambda x: -0.3 < x < 0.3)
3corr_to_plot = correlation_matrix.mask(mask)
4
5# Display the masked correlation matrix
6print(corr_to_plot)

Output:

Plain text
1            carat       cut  color  ...         x         y         z
2carat    1.000000       NaN    NaN  ...  0.975094  0.951722  0.953387
3cut           NaN  1.000000    NaN  ...       NaN       NaN       NaN
4color         NaN       NaN    1.0  ...       NaN       NaN       NaN
5clarity  0.352841       NaN    NaN  ...  0.371999  0.358420  0.366952
6depth         NaN       NaN    NaN  ...       NaN       NaN       NaN
7table         NaN  0.433405    NaN  ...       NaN       NaN       NaN
8price    0.921591       NaN    NaN  ...  0.884435  0.865421  0.861249
9x        0.975094       NaN    NaN  ...  1.000000  0.974701  0.970772
10y        0.951722       NaN    NaN  ...  0.974701  1.000000  0.952006
11z        0.953387       NaN    NaN  ...  0.970772  0.952006  1.000000

With the mask applied, we'll only see the correlations with absolute values greater than 0.3.

Visualizing the Correlation Matrix with a Heatmap

Finally, let's visualize the masked correlation matrix with a heatmap. Heatmaps are a great way to represent data, providing an easily interpretable and visually appealing view of our correlations.

Here's how to create and display a heatmap:

Python
1import matplotlib.pyplot as plt
2import seaborn as sns
3
4plt.figure(figsize=(10, 6))
5sns.heatmap(corr_to_plot, annot=True, cmap='coolwarm', linewidths=0.5)
6plt.title('Correlations with Absolute Values > 0.3')
7plt.show()

The output of the above code will be a heatmap visualization, showing the diamonds dataset's correlations with absolute values greater than 0.3. This heatmap aids in quickly identifying the variables that either have a strong positive or negative correlation with each other.

In this heatmap:

The color gradient represents the strength of the correlation.
We use annotations (annot=True) to display the correlation values directly in the heatmap.

By focusing on significant correlations, you can better understand the relationships within your dataset, making your data analysis more insightful.

Lesson Summary and Practice

Today, you learned what correlation is and how to compute and visualize it using the diamonds dataset. We covered:

Converting categorical variables.
Computing the correlation matrix.
Masking values within a specified range.
Creating a heatmap to visualize significant correlations.

These skills will help you in identifying and focusing on meaningful relationships in your data, improving the quality of your analyses. Now, it's time to practice these concepts with some exercises to reinforce your understanding and boost your data analysis capabilities.

Great job, and keep up the good work!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.