Lesson 5

Welcome, friends! Today, we're tackling **"Data Binning,"** a key data preprocessing technique that categorizes raw data into manageable groups. We aim to learn the concept of data binning, understand its significance, and implement it using the `Pandas`

library.

Imagine a shopkeeper sorting different types of fruit into separate baskets. That’s much like what binning is. In data preprocessing, binning converts continuous values into categorical bins or groups, thus simplifying data analysis.

Datasets with numerous variables can lead to complex relationships that may distort analysis results. Binning groups similar data together, simplifying the dataset and reducing the impact of individual observation errors. It's indispensable for handling missing values and reducing outlier effects.

`Pandas`

offers functions such as `cut()`

and `qcut()`

for binning purposes. Let's dive into a practical example.

Python`1import pandas as pd 2 3 4df = pd.DataFrame({'ages': [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]}) 5 6bins = [17, 25, 35, 60, 100] 7labels = ["Youth", "YoungAdult", "MiddleAged", "Senior"] 8df['categories'] = pd.cut(df['ages'], bins, labels=labels)`

In the example above, we utilized the `pd.cut()`

function to divide a set of ages into distinct age groups or bins. This approach allows us to categorize a wide range of ages into a selected age group, simplifying data analysis. In this particular case, we have ages `18-25`

in the `"Youth"`

bin, ages `26-35`

in the `"YoungAdult"`

bin, and so on.

While using the `pd.cut()`

function, it's noteworthy that the bins we create are generally right-closed intervals. It means they include their right endpoint but exclude their left endpoint. So, in our example, an age of `25`

falls into the "Youth" bin `(17;25]`

, not in the "YoungAdult" bin `(25;35]`

.

The way it looks in the dataset is that each age is mapped to a category via the `"categories"`

column.

Let's print out the resulting bins by choosing all the ages for each category separately:

Python`1for category in set(df['categories']): 2 print(f"{category}: {list(df[df['categories'] == category]['ages'])}") 3 4'''Output: 5Youth: [20, 22, 25, 21, 23] 6YoungAdult: [27, 31, 32] 7MiddleAged: [37, 45, 41] 8Senior: [61] 9'''`

Now let's consider an example of the `qcut()`

function.

Python`1df['categories'] = pd.qcut(df['ages'], q=4)`

Unlike the `cut()`

function, the `qcut()`

function aims to divide the data into bins such that each bin contains nearly the same number of observations.

Let's print it using the same method:

Python`1'''Output: 2(22.75, 29.0]: [25, 27, 23] 3(19.999, 22.75]: [20, 22, 21] 4(38.0, 61.0]: [61, 45, 41] 5(29.0, 38.0]: [37, 31, 32] 6'''`

As you can see, bins' boundaries are adjusted so that all the bins contain the same number of values.

Like with the `cut()`

function, you can specify labels for the bins created by `qcut()`

.

Python`1labels = ["Q1", "Q2", "Q3", "Q4"] 2df['quartile_categories'] = pd.qcut(df['ages'], q=4, labels=labels)`

In this example, we have divided the ages into 4 equal-sized bins (quartiles), and we have labeled these quartiles as Q1, Q2, Q3, and Q4. The labels make understanding each bin's place in data distribution easier.

We can print this with:

Python`1for category in sorted(set(df['quartile_categories'])): 2 print(f"{category}: {list(df[df['quartile_categories'] == category]['ages'])}") 3 4'''Output: 5Q1: [20, 22, 21] 6Q2: [25, 23, 27] 7Q3: [31, 37, 32] 8Q4: [45, 41, 61] 9'''`

As you can see, providing labels to bins makes the quartiles easier to understand and interpret. It's beneficial when dealing with data where the quartiles have a specific meaning or significance.

Excellent! We’ve covered data binning, understood its importance, and implemented it using `Pandas`

. This knowledge lays a solid foundation for your data preprocessing journey. Let's fast-forward to the exercises, where you'll apply your newly acquired ability to different datasets. By practicing, you'll consolidate your understanding and improve your proficiency in data binning. Let's move on to the exercises! Happy coding!