Welcome to another informative lesson. Today, we're diving deep into the domain of outliers: how to detect and handle them effectively using Python. As always, we'll use our Titanic dataset to illustrate these concepts.
Why are outliers significant, you might wonder? Outliers are anomalous or unusual values that significantly deviate from other observations. They can adversely impact the performance of our machine-learning models by introducing bias or skewness. Detecting outliers helps us maintain our dataset's integrity by ensuring all data falls within a reasonable range of values.
Going back to our Titanic
example. What if some passengers had absurdly high ages, like 800, or an impossible fare of $50,000? We can't just ignore these anomalies. We must deal with them appropriately, ensuring our models learn from accurate, realistic data.
A commonly used method to detect outliers in a dataset is the Z-score method
. Given a set of values, the Z-score of a value is the distance between that value and the dataset's mean, expressed in terms of the standard deviation.
A Z-score of 0 indicates that the data point is identical to the mean score. A Z-score of 1.0 indicates a value that is one standard deviation from the mean. Higher Z-scores denote farther (and potentially outlier) values.
Let's use this method to detect any potential outliers in the age
feature of our Titanic dataset
. We'll only consider positive Z-scores, as negative ages are illogical in our context.
Python1import numpy as np 2import pandas as pd 3import seaborn as sns 4 5# Load the dataset 6titanic_df = sns.load_dataset('titanic') 7 8# Calculate Z-scores 9titanic_df['age_zscore'] = np.abs((titanic_df.age - titanic_df.age.mean()) / titanic_df.age.std()) 10 11# Get rows of outliers according to the Z-score method (using a threshold of 3) 12outliers_zscore = titanic_df[(titanic_df['age_zscore'] > 3)] 13print(outliers_zscore) 14""" 15 survived pclass sex age ... embark_town alive alone age_zscore 16630 1 1 male 80.0 ... Southampton yes True 3.462699 17851 0 3 male 74.0 ... Southampton no True 3.049660 18 19[2 rows x 16 columns] 20"""
In the code snippet above, the Z-score calculates the distance between each age
value and the mean age (titanic_df.age.mean()
), in terms of standard deviation (titanic_df.age.std()
). We add the results as a new column, age_zscore
, into our dataframe. High values (above 3 in our case) are presumed to be potential outliers.
Another method to detect outliers is the Interquartile Range (IQR)
method. IQR
is the range between the first quartile (25th percentile) and the third quartile (75th percentile). An outlier is any value that falls below Q1 - 1.5 * IQR
or above Q3 + 1.5 * IQR
.
Let's detect outliers in the age
column of the Titanic dataset
using this method:
Python1# Calculate IQR 2Q1 = titanic_df['age'].quantile(0.25) 3Q3 = titanic_df['age'].quantile(0.75) 4IQR = Q3 - Q1 5 6# Define Bounds 7lower_bound = Q1 - 1.5 * IQR 8upper_bound = Q3 + 1.5 * IQR 9 10# Get rows of outliers according to IQR method 11outliers_iqr = titanic_df[(titanic_df['age'] < lower_bound) | (titanic_df['age'] > upper_bound)] 12print(outliers_iqr) 13""" 14 survived pclass sex age ... embark_town alive alone age_zscore 1533 0 2 male 66.0 ... Southampton no True 2.498943 1654 0 1 male 65.0 ... Cherbourg no False 2.430103 1796 0 1 male 71.0 ... Cherbourg no True 2.843141 18116 0 3 male 70.5 ... Queenstown no True 2.808721 19280 0 3 male 65.0 ... Queenstown no True 2.430103 20456 0 1 male 65.0 ... Southampton no True 2.430103 21493 0 1 male 71.0 ... Cherbourg no True 2.843141 22630 1 1 male 80.0 ... Southampton yes True 3.462699 23672 0 2 male 70.0 ... Southampton no True 2.774301 24745 0 1 male 70.0 ... Southampton no False 2.774301 25851 0 3 male 74.0 ... Southampton no True 3.049660 26 27[11 rows x 16 columns] 28"""
Here, we first calculate Q1
and Q3
, representing the 25th and 75th percentile of the age
field, respectively. The IQR
is simply the difference between Q3
and Q1
. Outliers are defined as any age below Q1 - 1.5 * IQR
or above Q3 + 1.5 * IQR
.
After identifying outliers, you'll have to decide what to do with them—whether to keep them, discard them, or modify them. Regardless of how you identify outliers, applying the most suitable handling technique is crucial.
In data cleaning, there's no one-size-fits-all rule when it comes to dealing with outliers—your decision should depend on the dataset and the specific problem you're working on. Sometimes, removing outliers can improve your model's accuracy. Other times, outliers might be crucial, and removing them could lead to inaccurate models or conclusions.
You might deal with outliers by:
Python1# Using the Z-score method 2titanic_df = titanic_df[titanic_df['age_zscore'] <= 3] 3 4# Using the IQR method 5titanic_df = titanic_df[(titanic_df['age'] >= lower_bound) & (titanic_df['age'] <= upper_bound)]
Here, we exclude rows where the age lies in the outlier zone according to the chosen outlier detection method.
Python1# using mean 2titanic_df.loc[titanic_df['age_zscore'] > 3, 'age'] = titanic_df['age'].mean() 3 4# using median 5titanic_df.loc[(titanic_df['age'] < lower_bound) | (titanic_df['age'] > upper_bound), 'age'] = titanic_df['age'].median()
In these examples, outliers are replaced by the mean
or median
value of the age
column. The specific age
value to use for replacement would depend on the particularities of your dataset.
Congratulations! Now, you know how to identify and handle outliers in a dataset using Python. You've also got a glimpse of how these skills apply to real-world problems, like improving accuracy for machine learning models.
Remember, handling outliers is more of an art than a science. Your strategies will largely depend on your data and the problem you're trying to solve.
Note that outliers are not always 'bad' or 'undesirable'. In certain scenarios, outliers can provide significant and meaningful insights into the matter you're investigating. It is crucial to consider their effect on your specific task and process them accordingly.
Having absorbed all the concepts, you're ready to delve into some hands-on practice to cement your learning. Remember the golden rule of mastering anything — 'Practice makes perfect.'