Creating New Columns in Pandas

Introduction to Data Cleaning and TransformationLesson 3

Lesson 3

Introduction

Welcome to our session on creating new columns in Pandas. Today, we'll build on our data handling skills as we learn how to create new columns in our DataFrame. This ability is crucial for data cleaning and manipulation, enabling us to generate novel fields of data from our existing data.

By the end of this session, you'll be adept at adding new columns with static values, generating new columns through operations with existing columns, and creating new columns based on specific conditions.

Why Creating New Columns Is Important

Creating new columns is key for data analysis. Consider a DataFrame of prices and quantities of goods sold. We might want to get the total sales, which is price * quantity.

Python
1import pandas as pd
2
3# Creating DataFrame: items, prices, quantities sold
4df = pd.DataFrame({"Item": ["Apples", "Bananas", "Oranges"], "Price": [1.5, 0.5, 0.75], "Quantity": [10, 20, 30]})
5# Create new column "Total" which is Price * Quantity 
6df["Total"] = df["Price"] * df["Quantity"]
7print(df)
8#       Item  Price  Quantity  Total
9# 0   Apples   1.50        10  15.00
10# 1  Bananas   0.50        20  10.00
11# 2  Oranges   0.75        30  22.50

In this code, we create a new "Total" column. For dataframes, it works similarly to adding a new key to a dictionary: this easy!

Adding New Columns with Static Values

Adding a new column with a static value is quite simple. For example, adding a Location column for a group of employees working in the same location.

Python
1# Add "Location" column with static value
2df["Location"] = "New York"
3print(df)
4#       Item  Price  Quantity  Total  Location
5# 0   Apples   1.50        10  15.00  New York
6# 1  Bananas   0.50        20  10.00  New York
7# 2  Oranges   0.75        30  22.50  New York

New Columns Based on Conditions

We can create new columns based on conditions from the values of the existing columns. For example, if we have a DataFrame of student scores, we can create a column that flags whether the student's score is above 40.

Here's how we can do this:

Python
1import numpy as np
2
3# DataFrame with student names and their scores
4df = pd.DataFrame({"Student": ["Alice", "Bob", "Charlie"], "Score": [42, 37, 56]})
5# Create new column "Status" that is "Pass" if Score > 40 else "Fail"
6df["Status"] = np.where(df["Score"] > 40, "Pass", "Fail")
7print(df)
8#    Student  Score Status
9# 0    Alice     42   Pass
10# 1      Bob     37   Fail
11# 2  Charlie     56   Pass

The np.where function works as follows: it takes three arguments - a condition, a value to set when the condition is true, and a value to set when the condition is false. In this example, the condition is df["Score"] > 40. If this condition is true, the new column "Status" will have the value "Pass", otherwise it will have the value "Fail".

Lesson Summary and Upcoming Practice

So far, we've covered how to create new columns in a DataFrame with static values, through operations with existing columns, and based on conditions. The more you practice, the better your understanding will get. Looking forward to our exercise session!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.