Hello and welcome! In today's lesson, we'll dive into the advanced technique of calculating and plotting correlations using hue in scatterplots and heatmaps, focusing on the diamonds dataset. These visualization methods will help you understand the relationships between multiple features in the dataset, enhancing your ability to derive insights for better decision-making.
Correlation analysis is essential in data science as it measures the relationship between two variables. Understanding these correlations helps in feature selection, understanding data relationships, and making predictive models more accurate.
- Pearson Correlation: Measures linear correlation.
- Spearman Correlation: Measures monotonic relationships.
- Kendall Correlation: Measures ordinal relationships.
In this lesson, we will focus on the Pearson correlation, which is commonly used for continuous data.
First, let's load the diamonds
dataset and preprocess it by converting categorical variables into numerical values for easier plotting and analysis.
Python1import seaborn as sns 2import pandas as pd 3import matplotlib.pyplot as plt 4 5# Load the diamonds dataset 6diamonds = sns.load_dataset('diamonds') 7 8# Convert categorical variables for easier plotting 9diamonds['cut'] = diamonds['cut'].astype('category').cat.codes 10diamonds['color'] = diamonds['color'].astype('category').cat.codes 11diamonds['clarity'] = diamonds['clarity'].astype('category').cat.codes 12 13print(diamonds.head())
By converting cut
, color
, and clarity
into numerical codes, we make these features easier to handle when plotting and calculating correlations.
The output of the above code will be:
Plain text1 carat cut color clarity depth table price x y z 20 0.23 0 1 6 61.5 55.0 326 3.95 3.98 2.43 31 0.21 1 1 5 59.8 61.0 326 3.89 3.84 2.31 42 0.23 3 1 3 56.9 65.0 327 4.05 4.07 2.31 53 0.29 1 5 4 62.4 58.0 334 4.20 4.23 2.63 64 0.31 3 6 6 63.3 58.0 335 4.34 4.35 2.75
This output displays the first five rows of the diamonds
dataset after converting cut
, color
, and clarity
into numerical codes, making it ready for correlation analysis and plotting.
A scatter plot can reveal the relationship between two continuous variables. As mentioned, by using hue and size, we can add more dimensions to our plot.
Python1# Scatter plot of carat vs. price colored by cut with size representing clarity 2plt.figure(figsize=(10,6)) 3sns.scatterplot(x='carat', y='price', hue='cut', size='clarity', palette='viridis', data=diamonds) 4plt.title('Scatter Plot of Carat vs. Price With Hues and Sizes') 5plt.xlabel('Carat') 6plt.ylabel('Price') 7plt.legend(title='Cut') 8plt.show()
This scatter plot shows how carat
and price
are related, while also illustrating the impact of cut
and clarity
on this relationship. The use of hues and sizes adds layers of information, demonstrating cut and clarity’s role in the pricing alongside carat weight.
To better understand the density of points in the scatter plot, we can overlay a density heatmap using a KDE plot. This combination provides a richer visualization of data concentration areas.
Python1# Scatter plot with heatmap overlay 2plt.subplots(figsize=(10,6)) 3sns.scatterplot(x='carat', y='price', hue='cut', size='clarity', palette='viridis', data=diamonds) 4sns.kdeplot(x=diamonds['carat'], y=diamonds['price'], cmap='Reds', fill=True, alpha=0.3) 5plt.title('Enhanced Scatter Plot with Density Overlay') 6plt.xlabel('Carat') 7plt.ylabel('Price') 8plt.legend(title='Cut') 9plt.show()
This enhanced scatter plot with a density overlay provides a vivid visual representation of the distribution of data points, highlighting areas with higher concentrations of diamonds. The contrasting colors of the scatter plot against the density heatmap allow for easy identification of clusters within the data, enhancing the pattern recognition and analysis capabilities.
Congratulations! You've successfully learned how to calculate correlations and visualize them using scatter plots with hue and heatmap overlays. These skills are vital for data-driven decision-making, enabling you to identify and interpret complex relationships within your dataset.
In our next practice exercises, you'll apply these techniques to further solidify your understanding and enhance your data analysis skills. Keep practicing and exploring to become proficient in visualizing and interpreting data correlations!