Lesson 4
Data Correlation
Lesson Introduction

Welcome to today's lesson on data correlation! Data correlation is crucial in data analysis as it helps us understand how different variables relate to each other. Our goal today is to learn how to find and interpret correlations in a dataset using the Pandas library in Python.

Imagine you're a detective trying to figure out if two clues are connected. Similarly, in data analysis, correlation helps us determine if two numerical variables have any relationship. By the end of this lesson, you'll be able to find these relationships in your dataset and understand what they mean.

Understanding Correlation

Let's start with what correlation is. Correlation is a statistical measure that describes how much two variables change together. Here are the two main types of correlation:

  • Positive Correlation: When one variable increases, the other tends to increase. For example, studying more hours can lead to higher exam scores.
  • Negative Correlation: When one variable increases, the other tends to decrease. For instance, more TV time usually means less study time.

Isn't it fascinating how numbers tell stories? Let's dive in!

Setting Up a Dataset

Before finding correlations, we need data. Let's use a simple dataset of house prices with information about different houses, their prices, sizes, the number of bedrooms, etc.

Python
1import pandas as pd 2 3# Sample dataset creation 4data = { 5 'Price': [300000, 450000, 200000, 350000, 500000], 6 'Size': [1500, 2000, 1000, 1700, 2200], 7 'Bedrooms': [3, 4, 2, 3, 4], 8 'Age': [20, 15, 40, 10, 5] 9} 10 11# Create DataFrame 12houses = pd.DataFrame(data) 13 14# Display the DataFrame 15print(houses)

Output:

1 Price Size Bedrooms Age 20 300000 1500 3 20 31 450000 2000 4 15 42 200000 1000 2 40 53 350000 1700 3 10 64 500000 2200 4 5
Correlation Calculation

Once we have our data, we can find correlations using the corr method. This method calculates the correlation coefficient between each pair of columns.

Python
1# Finding the correlation between numerical variables 2correlation_matrix = houses.corr() 3print(correlation_matrix)

Output:

1 Price Size Bedrooms Age 2Price 1.000000 0.993562 0.976221 -0.875890 3Size 0.993562 1.000000 0.975000 -0.921651 4Bedrooms 0.976221 0.975000 1.000000 -0.840511 5Age -0.875890 -0.921651 -0.840511 1.000000

The corr method returns a correlation matrix that shows the correlation coefficients between each pair of variables.

Let's interpret the results. The values in the correlation matrix are called correlation coefficients:

  • A value of 1 means perfect positive correlation.
  • A value of -1 means perfect negative correlation.
  • A value of 0 means no correlation.

For example, Price and Size have a correlation coefficient of 0.99, meaning they have a strong positive relationship. It means that larger houses with more bedrooms tend to have higher prices. You might also see a negative correlation between Price and Age, meaning newer houses tend to be more expensive.

Finding Correlation Between Two Columns

If you're interested in the correlation between just two columns instead of the entire dataset, you can use the corr method directly on those columns.

Python
1# Finding correlation between Price and Size 2correlation_price_size = houses['Price'].corr(houses['Size']) 3print(f'Correlation between Price and Size: {correlation_price_size}')

Output:

1Correlation between Price and Size: 0.9935620234193304

This example shows a strong positive correlation between Price and Size, meaning larger houses tend to have higher prices. This method is useful when you want to focus on the relationship between specific pairs of variables.

Handling Missing Data

It's important to handle missing data before finding correlations, as missing values can affect the results. We can do it using one of the ways we described in a corresponding lesson in the previous course. Let's recall it:

Python
1# Handling missing data (example with house prices) 2houses = houses.fillna(houses.mean()) # Fill missing values with the mean of each column

This line replaces any missing values in the DataFrame with the mean of their respective columns, ensuring accurate correlation results.

Lesson Summary

In this lesson, we explored data correlation and its importance. We learned how to use the corr method in Pandas to find correlations and interpret the coefficients. We also covered handling missing data.

Now, it's time for hands-on practice! You'll apply what you've learned by finding correlations in different datasets. This will help solidify your understanding and gain practical experience. Let's get started!

Enjoy this lesson? Now it's time to practice with Cosmo!
Practice is how you turn knowledge into actual skills.