Hello! Today we're diving into Indexing and Selecting Data in pandas
, a crucial part of data manipulation and analysis. Indexing helps us locate data in specific rows while selecting focuses on picking specific columns or cells.
We'll delve into how to select and index data using pandas
by walking you through some hands-on examples. Let's begin!
In pandas
, an index is more or less the address of your data. By default, pandas
assigns integer labels to the rows, but we can set any column as the index. This effectively turns it into an identifier for the rows.
Here's a basic example using pandas
DataFrame's set_index()
, reset_index()
, and rename()
methods:
Python1import pandas as pd 2 3df = pd.DataFrame({ 4 "Name": ["Alice", "Bob", "John"], 5 "Age": [25, 22, 30], 6 "City": ["New York", "Los Angeles", "Chicago"] 7}) 8 9df.set_index("Name", inplace=True) 10print(df) 11 # Output: 12 # Age City 13 # Name 14 # Alice 25 New York 15 # Bob 22 Los Angeles 16 # John 30 Chicago
Accessing data using the index is performed with pandas
loc[]
method for label-based indexing and iloc[]
method for integer-based indexing, which we will investigate later.
The inplace
parameter is common for a lot of pandas dataframe methods. If inplace
is set to True
, changes are applied to the target dataframe. Otherwise, the target dataframe will be copied, the copy will be changed and returned.
However, it is important to note that in the pandas 3.0 the `inplace parameter will be omitted, and you will have to do it this way:
1df = df.set_index("Name")
If you want to reset index back to the default, it is done easily with the following method:
Python1df.reset_index(inplace=True) 2print(df) 3 # Output: 4 # Name Age City 5 # 0 Alice 25 New York 6 # 1 Bob 22 Los Angeles 7 # 2 John 30 Chicago
Renaming the index is simply renaming the corresponding column. It is done with the rename
method:
Python1df.rename(columns={"Name": "Student Name", "Age": "Student Age"}, inplace=True) 2print(df) 3 # Output: 4 # Student Name Student Age City 5 # 0 Alice 25 New York 6 # 1 Bob 22 Los Angeles 7 # 2 John 30 Chicago
Here, we provide a dictionary where the key is the old name, and the value is the new name.
pandas
provides loc[]
and iloc[]
for accessing data in a DataFrame in a manner similar to array indexing for label-based and integer-based indexing, respectively. loc[]
uses label-based indexing, and iloc[]
uses integer-based indexing.
Let's understand this with an example:
Python1df = pd.DataFrame({ 2 "Name": ["Alice", "Bob", "John", "Robert", "Ann"], 3 "Age": [25, 22, 30, 28, 32], 4 "City": ["New York", "Los Angeles", "Chicago", "San Francisco", "Houston"] 5}) 6 7df.set_index("Name", inplace=True) 8 9print(df.loc[["Alice", "John"], ["Age", "City"]]) 10 # Output: 11 # Age City 12 # Name 13 # Alice 25 New York 14 # John 30 Chicago 15 16print(df.iloc[[1, 3], [0, 1]]) 17 # Output: 18 # Age City 19 # Name 20 # Bob 22 Los Angeles 21 # Robert 28 San Francisco
Note that we set the "Name"
column as index. In loc
, we use labels (which Is the name-indices and column names) to select the required data. In iloc
, we use numerical indices for both rows and columns: It works similarly to 2d NumPy arrays.
Congrats on completing this lesson! You've learned how to index and select data in pandas
, including functions like set_index()
, reset_index()
, loc[]
, and iloc[]
.
Next up are some practice exercises. These exercises will help solidify what you've learned in this lesson. It's crucial to practice when learning new programming skills.
In the next lesson, we will dive deeper into pandas
and cover more useful features. Stay tuned!