Hello and welcome! In today’s lesson, we’re diving into advanced scatter plot customization using the diamonds dataset in Python. By the end of this lesson, you will have learned how to customize scatter plots to reveal complex patterns and provide better insights into your data. We'll do this by adjusting aesthetics like colors, markers, and transparency, as well as incorporating additional details using size and hue.
Changing the marker size with the s
parameter can help in making the scatter plot more readable, especially when dealing with overlapping points.
Python1import seaborn as sns 2import matplotlib.pyplot as plt 3import pandas as pd 4 5# Load the diamonds dataset 6diamonds = sns.load_dataset('diamonds') 7 8# Creating a new feature 'volume' (x * y * z) 9diamonds['volume'] = diamonds['x'] * diamonds['y'] * diamonds['z'] 10 11# Filter out unrealistic volumes 12diamonds = diamonds[(diamonds['volume'] > 0) & (diamonds['volume'] < 300)] 13 14# Scatter plot with adjusted marker size 15plt.figure(figsize=(10,6)) 16sns.scatterplot(x='volume', y='price', data=diamonds, s=100, alpha=0.4) 17plt.title('Scatter Plot of Volume vs. Price (Adjusted Marker Size)') 18plt.xlabel('Volume') 19plt.ylabel('Price') 20plt.show()
Customizing marker styles with the marker
parameter helps in distinguishing between different types of data points.
Python1import seaborn as sns 2import matplotlib.pyplot as plt 3import pandas as pd 4 5# Load the diamonds dataset 6diamonds = sns.load_dataset('diamonds') 7 8# Creating a new feature 'volume' (x * y * z) 9diamonds['volume'] = diamonds['x'] * diamonds['y'] * diamonds['z'] 10 11# Filter out unrealistic volumes 12diamonds = diamonds[(diamonds['volume'] > 0) & (diamonds['volume'] < 300)] 13 14# Scatter plot with adjusted marker style 15plt.figure(figsize=(10,6)) 16sns.scatterplot(x='volume', y='price', data=diamonds, marker='x', alpha=0.6, s=100) 17plt.title('Scatter Plot of Volume vs. Price (Marker Style X)') 18plt.xlabel('Volume') 19plt.ylabel('Price') 20plt.show()
Beyond basic aesthetics, leveraging the hue
parameter can add more layers of information to your scatter plot. This adds color differentiation to the data points, representing another dimension by using distinct colors for different categories of a variable. Here is an example that uses the 'cut'
feature with the hue
parameter to differentiate between the cut categories:
Python1import seaborn as sns 2import matplotlib.pyplot as plt 3import pandas as pd 4 5# Load the diamonds dataset 6diamonds = sns.load_dataset('diamonds') 7 8# Creating a new feature 'volume' (x * y * z) 9diamonds['volume'] = diamonds['x'] * diamonds['y'] * diamonds['z'] 10 11# Filter out unrealistic volumes 12diamonds = diamonds[(diamonds['volume'] > 0) & (diamonds['volume'] < 300)] 13 14# Scatter plot using hue to differentiate cut categories 15plt.figure(figsize=(10,6)) 16sns.scatterplot(x='volume', y='price', hue='cut', data=diamonds, alpha=0.6, s=100) 17plt.title('Scatter Plot of Volume vs. Price (Hue by Cut)') 18plt.xlabel('Volume') 19plt.ylabel('Price') 20plt.legend(title='Cut', bbox_to_anchor=(1.05, 1), loc='upper left') 21plt.show()
Adjusting marker sizes size
to represent the 'carat'
variable results in the following plot. Note that this is different to the s
parameter, which sets a constant marker size for all the points in the plot. The sizes
parameter is used to set the range.
Python1import seaborn as sns 2import matplotlib.pyplot as plt 3import pandas as pd 4 5# Load the diamonds dataset 6diamonds = sns.load_dataset('diamonds') 7 8# Creating a new feature 'volume' (x * y * z) 9diamonds['volume'] = diamonds['x'] * diamonds['y'] * diamonds['z'] 10 11# Filter out unrealistic volumes 12diamonds = diamonds[(diamonds['volume'] > 0) & (diamonds['volume'] < 300)] 13 14# Scatter plot with marker sizes representing carat values 15plt.figure(figsize=(10,6)) 16sns.scatterplot(x='volume', y='price', hue='cut', size='carat', sizes=(20, 200), data=diamonds, alpha=0.6) 17plt.title('Scatter Plot of Volume vs. Price (Size by Carat)') 18plt.xlabel('Volume') 19plt.ylabel('Price') 20plt.legend(title='Cut', bbox_to_anchor=(1.05, 1), loc='upper left') 21plt.show()
Adding regression lines can help to identify trends. This is done using the regplot
function, which uses the same primary parametes as the other plotting functions.
Python1import seaborn as sns 2import matplotlib.pyplot as plt 3import pandas as pd 4 5# Load the diamonds dataset 6diamonds = sns.load_dataset('diamonds') 7 8# Creating a new feature 'volume' (x * y * z) 9diamonds['volume'] = diamonds['x'] * diamonds['y'] * diamonds['z'] 10 11# Filter out unrealistic volumes 12diamonds = diamonds[(diamonds['volume'] > 0) & (diamonds['volume'] < 300)] 13 14plt.figure(figsize=(10,6)) 15sns.scatterplot(x='volume', y='price', hue='cut', size='carat', sizes=(20, 200), data=diamonds, alpha=0.6) 16sns.regplot(x='volume', y='price', data=diamonds, scatter=False, color='gray') 17plt.title('Scatter Plot with Regression Line') 18plt.xlabel('Volume') 19plt.ylabel('Price') 20plt.legend(title='Cut', bbox_to_anchor=(1.05, 1), loc='upper left') 21plt.show()
In this lesson, you mastered advanced scatter plot customization techniques, including adjusting aesthetic properties, using size and hue to encode additional information, and adding regression lines. These skills are essential for better data representation and uncovering deeper insights.
Next, we'll have practice exercises to help solidify these concepts and further enhance your data visualization skills. Customizing scatter plots in such detail will make your data storytelling more effective and impactful!