Welcome, friends! Today, we're tackling "Data Binning," a key data preprocessing technique that categorizes raw data into manageable groups. We aim to learn the concept of data binning, understand its significance, and implement it using the Pandas
library.
Imagine a shopkeeper sorting different types of fruit into separate baskets. That’s much like what binning is. In data preprocessing, binning converts continuous values into categorical bins or groups, thus simplifying data analysis.
Datasets with numerous variables can lead to complex relationships that may distort analysis results. Binning groups similar data together, simplifying the dataset and reducing the impact of individual observation errors. It's indispensable for handling missing values and reducing outlier effects.
Pandas
offers functions such as cut()
and qcut()
for binning purposes. Let's dive into a practical example.
Python1import pandas as pd 2 3 4df = pd.DataFrame({'ages': [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]}) 5 6bins = [17, 25, 35, 60, 100] 7labels = ["Youth", "YoungAdult", "MiddleAged", "Senior"] 8df['categories'] = pd.cut(df['ages'], bins, labels=labels)
In the example above, we utilized the pd.cut()
function to divide a set of ages into distinct age groups or bins. This approach allows us to categorize a wide range of ages into a selected age group, simplifying data analysis. In this particular case, we have ages 18-25
in the "Youth"
bin, ages 26-35
in the "YoungAdult"
bin, and so on.
While using the pd.cut()
function, it's noteworthy that the bins we create are generally right-closed intervals. It means they include their right endpoint but exclude their left endpoint. So, in our example, an age of 25
falls into the "Youth" bin (17;25]
, not in the "YoungAdult" bin (25;35]
.
The way it looks in the dataset is that each age is mapped to a category via the "categories"
column.
Let's print out the resulting bins by choosing all the ages for each category separately:
Python1for category in set(df['categories']): 2 print(f"{category}: {list(df[df['categories'] == category]['ages'])}") 3 4'''Output: 5Youth: [20, 22, 25, 21, 23] 6YoungAdult: [27, 31, 32] 7MiddleAged: [37, 45, 41] 8Senior: [61] 9'''
Now let's consider an example of the qcut()
function.
Python1df['categories'] = pd.qcut(df['ages'], q=4)
Unlike the cut()
function, the qcut()
function aims to divide the data into bins such that each bin contains nearly the same number of observations.
Let's print it using the same method:
Python1'''Output: 2(22.75, 29.0]: [25, 27, 23] 3(19.999, 22.75]: [20, 22, 21] 4(38.0, 61.0]: [61, 45, 41] 5(29.0, 38.0]: [37, 31, 32] 6'''
As you can see, bins' boundaries are adjusted so that all the bins contain the same number of values.
Like with the cut()
function, you can specify labels for the bins created by qcut()
.
Python1labels = ["Q1", "Q2", "Q3", "Q4"] 2df['quartile_categories'] = pd.qcut(df['ages'], q=4, labels=labels)
In this example, we have divided the ages into 4 equal-sized bins (quartiles), and we have labeled these quartiles as Q1, Q2, Q3, and Q4. The labels make understanding each bin's place in data distribution easier.
We can print this with:
Python1for category in sorted(set(df['quartile_categories'])): 2 print(f"{category}: {list(df[df['quartile_categories'] == category]['ages'])}") 3 4'''Output: 5Q1: [20, 22, 21] 6Q2: [25, 23, 27] 7Q3: [31, 37, 32] 8Q4: [45, 41, 61] 9'''
As you can see, providing labels to bins makes the quartiles easier to understand and interpret. It's beneficial when dealing with data where the quartiles have a specific meaning or significance.
Excellent! We’ve covered data binning, understood its importance, and implemented it using Pandas
. This knowledge lays a solid foundation for your data preprocessing journey. Let's fast-forward to the exercises, where you'll apply your newly acquired ability to different datasets. By practicing, you'll consolidate your understanding and improve your proficiency in data binning. Let's move on to the exercises! Happy coding!