Welcome to our session on creating new columns in Pandas. Today, we'll build on our data handling skills as we learn how to create new columns in our DataFrame
. This ability is crucial for data cleaning and manipulation, enabling us to generate novel fields of data from our existing data.
By the end of this session, you'll be adept at adding new columns with static values, generating new columns through operations with existing columns, and creating new columns based on specific conditions.
Creating new columns is key for data analysis. Consider a DataFrame
of prices and quantities of goods sold. We might want to get the total sales, which is price * quantity
.
Python1import pandas as pd 2 3# Creating DataFrame: items, prices, quantities sold 4df = pd.DataFrame({"Item": ["Apples", "Bananas", "Oranges"], "Price": [1.5, 0.5, 0.75], "Quantity": [10, 20, 30]}) 5# Create new column "Total" which is Price * Quantity 6df["Total"] = df["Price"] * df["Quantity"] 7print(df) 8# Item Price Quantity Total 9# 0 Apples 1.50 10 15.00 10# 1 Bananas 0.50 20 10.00 11# 2 Oranges 0.75 30 22.50
In this code, we create a new "Total"
column. For dataframes, it works similarly to adding a new key to a dictionary: this easy!
Adding a new column with a static value is quite simple. For example, adding a Location
column for a group of employees working in the same location.
Python1# Add "Location" column with static value 2df["Location"] = "New York" 3print(df) 4# Item Price Quantity Total Location 5# 0 Apples 1.50 10 15.00 New York 6# 1 Bananas 0.50 20 10.00 New York 7# 2 Oranges 0.75 30 22.50 New York
We can create new columns based on conditions from the values of the existing columns. For example, if we have a DataFrame
of student scores, we can create a column that flags whether the student's score is above 40.
Here's how we can do this:
Python1import numpy as np 2 3# DataFrame with student names and their scores 4df = pd.DataFrame({"Student": ["Alice", "Bob", "Charlie"], "Score": [42, 37, 56]}) 5# Create new column "Status" that is "Pass" if Score > 40 else "Fail" 6df["Status"] = np.where(df["Score"] > 40, "Pass", "Fail") 7print(df) 8# Student Score Status 9# 0 Alice 42 Pass 10# 1 Bob 37 Fail 11# 2 Charlie 56 Pass
The np.where
function works as follows: it takes three arguments - a condition, a value to set when the condition is true, and a value to set when the condition is false. In this example, the condition is df["Score"] > 40
. If this condition is true, the new column "Status"
will have the value "Pass"
, otherwise it will have the value "Fail"
.
So far, we've covered how to create new columns in a DataFrame
with static values, through operations with existing columns, and based on conditions. The more you practice, the better your understanding will get. Looking forward to our exercise session!