Today, we will explore sorting within a DataFrame using Python's pandas
. We will delve into the sort_values()
function, covering single and multi-column sorting and handling missing values.
Let's consider this small dataset containing statistics of basketball players:
Python1import pandas as pd 2 3df = pd.DataFrame({ 4 'Player': ['L. James', 'K. Durant', 'M. Jordan', 'S. Curry', 'K. Bryant'], 5 'Points': [27.0, 26.0, 32.0, 24.0, 26.0], 6 'Assists': [5.7, 4.7, 4.2, 6.6, 7.4] 7})
Note that there is a tie in Points
between "K. Durant"
and "K. Bryant"
.
We can sort DataFrame values using the sort_values()
function.
Python1sorted_df = df.sort_values(by='Points', ascending=False) 2print(sorted_df) 3'''Output: 4 Player Points Assists 52 M. Jordan 32.0 4.2 60 L. James 27.0 5.7 71 K. Durant 26.0 4.7 84 K. Bryant 26.0 7.4 93 S. Curry 24.0 6.6 10'''
In this example, the DataFrame is sorted by column Points
in descending order using the by
and ascending
parameters. Thus, we can list the most successful players in terms of average points scored.
In the previous example, pandas resolves a tie between "K. Durant"
and "K. Bryant"
by index, putting the player with the lower index first. Do you agree that a more reasonable decision would be to resolve ties by other players' characteristics – for example, putting players with higher Assists
scores first?
It is possible by providing multiple sorting columns.
Python1sorted_df = df.sort_values(by=['Points', 'Assists'], ascending=False) 2print(sorted_df) 3'''Output: 4 Player Points Assists 52 M. Jordan 32.0 4.2 60 L. James 27.0 5.7 74 K. Bryant 26.0 7.4 81 K. Durant 26.0 4.7 93 S. Curry 24.0 6.6 10'''
In this example, the DataFrame is sorted by column Points
and any ties are resolved by the column Assists
.
Let's alter the behavior and handle ties by sorting players alphabetically:
Python1sorted_df = df.sort_values(by=['Points', 'Player'], ascending=False) 2print(sorted_df) 3'''Output: 4 Player Points Assists 52 M. Jordan 32.0 4.2 60 L. James 27.0 5.7 71 K. Durant 26.0 4.7 84 K. Bryant 26.0 7.4 93 S. Curry 24.0 6.6 10'''
As ascending=False
, player names' sorting is also descending, which results in reverse alphabetical order. To fix it we can pass two values to ascending
, defining behavior of sorting differently for 'Points'
and 'Player'
:
Python1sorted_df = df.sort_values(by=['Points', 'Player'], ascending=[False, True]) 2print(sorted_df) 3'''Output: 4 Player Points Assists 52 M. Jordan 32.0 4.2 60 L. James 27.0 5.7 74 K. Bryant 26.0 7.4 81 K. Durant 26.0 4.7 93 S. Curry 24.0 6.6 10'''
'Points'
sorting is still in descending order, but 'Player'
sorting is in ascending.
Great job! You've revisited pandas DataFrames, mastered data frame value sorting using the sort_values()
function, and learned about sorting by single or multiple columns. Now, get ready to hone these skills with some coding exercises. Happy coding!