Welcome to today's lesson on data correlation! Data correlation is crucial in data analysis as it helps us understand how different variables relate to each other. Our goal today is to learn how to find and interpret correlations in a dataset using the Pandas
library in Python.
Imagine you're a detective trying to figure out if two clues are connected. Similarly, in data analysis, correlation helps us determine if two numerical variables have any relationship. By the end of this lesson, you'll be able to find these relationships in your dataset and understand what they mean.
Let's start with what correlation is. Correlation is a statistical measure that describes how much two variables change together. Here are the two main types of correlation:
Isn't it fascinating how numbers tell stories? Let's dive in!
Before finding correlations, we need data. Let's use a simple dataset of house prices with information about different houses, their prices, sizes, the number of bedrooms, etc.
Python1import pandas as pd 2 3# Sample dataset creation 4data = { 5 'Price': [300000, 450000, 200000, 350000, 500000], 6 'Size': [1500, 2000, 1000, 1700, 2200], 7 'Bedrooms': [3, 4, 2, 3, 4], 8 'Age': [20, 15, 40, 10, 5] 9} 10 11# Create DataFrame 12houses = pd.DataFrame(data) 13 14# Display the DataFrame 15print(houses)
Output:
1 Price Size Bedrooms Age 20 300000 1500 3 20 31 450000 2000 4 15 42 200000 1000 2 40 53 350000 1700 3 10 64 500000 2200 4 5
Once we have our data, we can find correlations using the corr
method. This method calculates the correlation coefficient between each pair of columns.
Python1# Finding the correlation between numerical variables 2correlation_matrix = houses.corr() 3print(correlation_matrix)
Output:
1 Price Size Bedrooms Age 2Price 1.000000 0.993562 0.976221 -0.875890 3Size 0.993562 1.000000 0.975000 -0.921651 4Bedrooms 0.976221 0.975000 1.000000 -0.840511 5Age -0.875890 -0.921651 -0.840511 1.000000
The corr
method returns a correlation matrix that shows the correlation coefficients between each pair of variables.
Let's interpret the results. The values in the correlation matrix are called correlation coefficients:
1
means perfect positive correlation.-1
means perfect negative correlation.0
means no correlation.For example, Price
and Size
have a correlation coefficient of 0.99
, meaning they have a strong positive relationship. It means that larger houses with more bedrooms tend to have higher prices. You might also see a negative correlation between Price
and Age
, meaning newer houses tend to be more expensive.
If you're interested in the correlation between just two columns instead of the entire dataset, you can use the corr
method directly on those columns.
Python1# Finding correlation between Price and Size 2correlation_price_size = houses['Price'].corr(houses['Size']) 3print(f'Correlation between Price and Size: {correlation_price_size}')
Output:
1Correlation between Price and Size: 0.9935620234193304
This example shows a strong positive correlation between Price
and Size
, meaning larger houses tend to have higher prices. This method is useful when you want to focus on the relationship between specific pairs of variables.
It's important to handle missing data before finding correlations, as missing values can affect the results. We can do it using one of the ways we described in a corresponding lesson in the previous course. Let's recall it:
Python1# Handling missing data (example with house prices) 2houses = houses.fillna(houses.mean()) # Fill missing values with the mean of each column
This line replaces any missing values in the DataFrame with the mean of their respective columns, ensuring accurate correlation results.
In this lesson, we explored data correlation and its importance. We learned how to use the corr
method in Pandas
to find correlations and interpret the coefficients. We also covered handling missing data.
Now, it's time for hands-on practice! You'll apply what you've learned by finding correlations in different datasets. This will help solidify your understanding and gain practical experience. Let's get started!