Mastering Sorting Values within Pandas DataFrame

Advanced Preprocessing and Collecting TechniquesLesson 4

Lesson 4

Topic Overview and Actualization

Today, we will explore sorting within a DataFrame using Python's pandas. We will delve into the sort_values() function, covering single and multi-column sorting and handling missing values.

Sample Dataset

Let's consider this small dataset containing statistics of basketball players:

Python
1import pandas as pd
2
3df = pd.DataFrame({
4    'Player': ['L. James', 'K. Durant', 'M. Jordan',  'S. Curry', 'K. Bryant'],
5    'Points': [27.0, 26.0, 32.0, 24.0, 26.0],
6    'Assists': [5.7, 4.7, 4.2, 6.6, 7.4]
7})

Note that there is a tie in Points between "K. Durant" and "K. Bryant".

Learning How to Sort

We can sort DataFrame values using the sort_values() function.

Python
1sorted_df = df.sort_values(by='Points', ascending=False)
2print(sorted_df)
3'''Output:
4      Player  Points  Assists
52  M. Jordan    32.0      4.2
60   L. James    27.0      5.7
71  K. Durant    26.0      4.7
84  K. Bryant    26.0      7.4
93   S. Curry    24.0      6.6
10'''

In this example, the DataFrame is sorted by column Points in descending order using the by and ascending parameters. Thus, we can list the most successful players in terms of average points scored.

Sorting by Multiple Columns

In the previous example, pandas resolves a tie between "K. Durant" and "K. Bryant" by index, putting the player with the lower index first. Do you agree that a more reasonable decision would be to resolve ties by other players' characteristics – for example, putting players with higher Assists scores first?

It is possible by providing multiple sorting columns.

Python
1sorted_df = df.sort_values(by=['Points', 'Assists'], ascending=False)
2print(sorted_df)
3'''Output:
4      Player  Points  Assists
52  M. Jordan    32.0      4.2
60   L. James    27.0      5.7
74  K. Bryant    26.0      7.4
81  K. Durant    26.0      4.7
93   S. Curry    24.0      6.6
10'''

In this example, the DataFrame is sorted by column Points and any ties are resolved by the column Assists.

Sorting by Multiple Columns in Different Order

Let's alter the behavior and handle ties by sorting players alphabetically:

Python
1sorted_df = df.sort_values(by=['Points', 'Player'], ascending=False)
2print(sorted_df)
3'''Output:
4      Player  Points  Assists
52  M. Jordan    32.0      4.2
60   L. James    27.0      5.7
71  K. Durant    26.0      4.7
84  K. Bryant    26.0      7.4
93   S. Curry    24.0      6.6
10'''

As ascending=False, player names' sorting is also descending, which results in reverse alphabetical order. To fix it we can pass two values to ascending, defining behavior of sorting differently for 'Points' and 'Player':

Python
1sorted_df = df.sort_values(by=['Points', 'Player'], ascending=[False, True])
2print(sorted_df)
3'''Output:
4      Player  Points  Assists
52  M. Jordan    32.0      4.2
60   L. James    27.0      5.7
74  K. Bryant    26.0      7.4
81  K. Durant    26.0      4.7
93   S. Curry    24.0      6.6
10'''

'Points' sorting is still in descending order, but 'Player' sorting is in ascending.

Lesson Summary

Great job! You've revisited pandas DataFrames, mastered data frame value sorting using the sort_values() function, and learned about sorting by single or multiple columns. Now, get ready to hone these skills with some coding exercises. Happy coding!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.