Sorting Data Frames in R: Mastering the Order Function

Lesson 4

Topic Overview

Today, we will delve into the usefulness of sorting data within a DataFrame using R's data.table or data.frame. The focus will be on using the order() function, covering both single and multi-column sorting and handling ties in our data.

Sample Dataset

Let's consider the following concise dataset of basketball players and their stats:

R
1df <- data.frame(
2  Player = c('L. James', 'K. Durant', 'M. Jordan',  'S. Curry', 'K. Bryant'),
3  Points = c(27.0, 26.0, 32.0, 24.0, 26.0),
4  Assists = c(5.7, 4.7, 4.2, 6.6, 7.4)
5)

In this dataset, we observe a tie in the Points column between 'K. Durant' and 'K. Bryant'.

Learning How to Sort

We can sort the values in a DataFrame using the order() function in R.

R
1sorted_df <- df[order(-df$Points),]
2print(sorted_df)


1     Player Points Assists
23 M. Jordan     32     4.2
31  L. James     27     5.7
42 K. Durant     26     4.7
55 K. Bryant     26     7.4
64  S. Curry     24     6.6

This code sorts the DataFrame by the Points column in descending order. The negative sign clarifies that the values are to be sorted in descending order. Also note a comma , after the order function: this comma is a part of the indexing. We index rows by order(-df$Points), and the columns index is empty, meaning we select all the columns.

Now, we can easily identify the players with the highest average points scored.

Sorting by Multiple Columns

In instances of ties, R's order() function enables us to distinguish tied values using additional parameters. Let's resolve the tie between 'K. Durant' and 'K. Bryant' using the Assists column.

R
1sorted_df <- df[order(-df$Points, -df$Assists),]
2print(sorted_df)


1     Player Points Assists
23 M. Jordan     32     4.2
31  L. James     27     5.7
45 K. Bryant     26     7.4
52 K. Durant     26     4.7
64  S. Curry     24     6.6

In this code, the DataFrame is sorted first by the Points column, then by the Assists column. The negative sign indicates descending order for both columns.

Sorting by Multiple Columns in Different Order

Instead of resolving ties based on the Assists column, we can mix things up a bit by sorting alphabetically by player names in case of ties.

R
1sorted_df <- df[order(-df$Points, df$Player),]
2print(sorted_df)


1     Player Points Assists
23 M. Jordan     32     4.2
31  L. James     27     5.7
45 K. Bryant     26     7.4
52 K. Durant     26     4.7
64  S. Curry     24     6.6

Here, the DataFrame is sorted by the Points column in descending order and the player names in ascending order. Now, in the event of a tie in points, the players are listed alphabetically.

Lesson Summary

Fantastic work! You have deepened your understanding of data.frames in R, mastered sorting data using the order() function, and learned how single or multiple columns can be sorted. It's time to solidify your understanding by practicing with various datasets. Happy R programming!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.