Are you ready to delve deeper into the core functionalities of Pandas
, one of the most popular Python libraries in data science? Today, we will focus on learning about Pandas DataFrame and data manipulation — the backbone of many data operation tasks. We're going to handle the Titanic
dataset — a fascinating example containing real-world data — that will keep you engaged as we navigate through the lesson.
Pandas DataFrames
bring versatility and power to the table when it comes to data manipulation. Think of it as Excel, but on steroids, capable of handling large and complex datasets. The processing abilities of DataFrames
are crucial for cleaning, transforming, and analyzing datasets in earnest.
Say you've got a dataset, like the Titanic
, but some data is missing. Or perhaps there are some anomalies you’d want to filter out. Or you need specific segments of data to examine a particular hypothesis. How would you do it? By mastering Pandas DataFrames
, you'd be well-equipped to tackle these tasks!
Pandas DataFrame
is a two-dimensional labeled data structure capable of holding data of various types—integers, floats, strings, Python objects, and more. It's generally the most commonly used Pandas object.
Let's start simply by creating a DataFrame from a dictionary:
Python1import pandas as pd 2 3data_dict = {"Name": ["John", "Anna", "Peter"], 4 "Age": [28, 24, 33], 5 "City": ["New York", "Los Angeles", "Berlin"]} 6 7df = pd.DataFrame(data_dict) 8 9print(df) 10 11""" 12 Name Age City 130 John 28 New York 141 Anna 24 Los Angeles 152 Peter 33 Berlin 16"""
Each key-value pair in the dictionary corresponds to a column in the resulting DataFrame
. The key defines the column label, and the corresponding value is a list of column values. The DataFrame
constructor takes a dictionary as input and turns it into a two-dimensional table where keys become column names, and values in each key (which should be a list) will be the values for the respective column. Here "John", "Anna", and "Peter" have ages 28, 24, and 33, respectively, and they live in "New York", "Los Angeles", and "Berlin".
To inspect the structure and properties of a DataFrame
, we have a range of functions at our disposal. Here are some commonly used ones:
df.head(n)
: Returns the first n
rows of the DataFrame df
.df.tail(n)
: Returns the last n
rows of the DataFrame df
.df.shape
: Returns a tuple representing the dimensions (number_of_rows, number_of_columns)
of the DataFrame df
.df.columns
: Returns an index containing column labels of the DataFrame df
.df.dtypes
: Returns a series with the data type of each column.Let's see these functions in action below:
Python1print(df.head(2)) # Print first two rows 2print(df.tail(2)) # Print last two rows 3print(df.shape) # Print dimensions of the df (rows, columns): (3, 3) 4print(df.columns) # Print column labels: Index(['Name', 'Age', 'City'], dtype='object') 5print(df.dtypes) # Print data types of each column: 6# Name object 7# Age int64 8# City object 9# dtype: object
These commands will help us understand the basic shape and type of the DataFrame
. df.head(2)
prints the first 2 rows of the DataFrame
, which are {[John, 28, New York], [Anna, 24, Los Angeles]}. df.tail(2)
prints the last 2 rows which are {[Anna, 24, Los Angeles}, [Peter, 33, Berlin]}. df.shape
gives us the dimensions of the DataFrame
here being (3,3) indicating 3 rows and 3 columns. df.columns
prints the column names [Name, Age, City] and df.dtypes
gives us the data types in each column.
The apply()
function in Pandas
is a versatile tool to manipulate DataFrame
values. It allows us to apply a function (either a Python built-in function or a custom function) along the DataFrame
's axes (either row-wise or column-wise).
Let's demonstrate this by adding a new column to our DataFrame
, which represents whether a person is considered youthful by applying a lambda function to the "Age" column.
Lambda functions, λ (Lambda), in Python, are small anonymous functions that are defined with the lambda
keyword. They can take any number of arguments and can only have one expression. They are particularly useful when you need to pass a small function as an argument.
Python1df["IsYouthful"] = df["Age"].apply(lambda age: "Yes" if age < 30 else "No") 2print(df) 3 4""" 5 Name Age City IsYouthful 60 John 28 New York Yes 71 Anna 24 Los Angeles Yes 82 Peter 33 Berlin No 9"""
In the above example, we used a lambda function that takes an age as an argument and returns "Yes" if the age is less than 30 and "No" otherwise.
Pandas
provides various ways to combine DataFrames
, one of which is concat()
. As the name implies, concat()
combines DataFrame
objects along a particular axis.
Let's create a new DataFrame
and concatenate it with our existing DataFrame
:
Python1df2 = pd.DataFrame({"Name": ["Megan"], "Age": [34], "City": ["San Francisco"], "IsYouthful": ["No"]}) 2 3df_concatenated = pd.concat([df, df2], ignore_index=True) 4 5print(df_concatenated) 6 7""" 8 Name Age City IsYouthful 90 John 28 New York Yes 101 Anna 24 Los Angeles Yes 112 Peter 33 Berlin No 123 Megan 34 San Francisco No 13"""
Did you notice the ignore_index=True
parameter? When set to True
, it resets the index in the resulting DataFrame
. So, in the resultant DataFrame
, the indices are in increasing order starting from 0.
Pandas provides several ways to locate elements in a DataFrame.
The simplest way to select a column in a DataFrame is by label:
Python1print(df['column_name']) # select a single column 2print(df[['col1', 'col2']]) # select multiple columns
To select elements within the DataFrame by integer location, we use the iloc
method. The iloc
indexer is like Python list slicing. This accepts integer inputs and slice notation. The general syntax is df.iloc[row_selection, column_selection]
:
For example, if we wish to select the value in the second row (indexed at 1) and the first column (indexed at 0):
Python1df.iloc[1,0] # Select the value in the second row and the first column (1-based) 2df.iloc[:2,:2] # Select the first two rows and columns
Let's dive in and see how Pandas can be applied to real-life datasets! To show this, we will use the Titanic dataset provided by the Seaborn library and show you some quick examples of how you can start analyzing it using Pandas.
Seaborn provides a direct function to load the dataset, making it very easy to load the dataset into the Pandas DataFrame:
Python1import pandas as pd 2import seaborn as sns 3 4# Load the titanic dataset into a Pandas DataFrame 5titanic = sns.load_dataset('titanic') 6 7# Look at the first 3 rows of the DataFrame 8print(titanic.head(3)) 9 10""" 11 survived pclass sex age ... deck embark_town alive alone 120 0 3 male 22.0 ... NaN Southampton no False 131 1 1 female 38.0 ... C Cherbourg yes False 142 1 3 female 26.0 ... NaN Southampton yes True 15 16[3 rows x 15 columns] 17"""
As titanic
is just a Pandas DataFrame, you can apply to it any operations we've learned before!
To recap, we have learned about the Pandas DataFrame
, one of the fundamental building blocks in Python for data manipulation. We explored how to create a DataFrame
, gain insights about its structure, and manipulate it with functions like apply()
and concat()
. We also covered the basics of the Lambda function and its application inside the apply()
function. Lastly, we covered dataset elements' location, and how all this functionality is applicable to real-life datasets on an example of Seaborn's Titanic dataset. The ability to work with DataFrames
is essential for data analysis and will form the foundation for more complex data operations you will learn.
Remember, learning by doing is key to mastering these new skills. Are you ready to apply what you've learned today and build some exercises for more hands-on practice? As you proceed, keep in mind that understanding how to manipulate and analyze data using Pandas
is a powerful tool for your journey in data science. Let's dive deeper into the upcoming practice exercises and put your learning to the test!