Mastering Pandas - A Deep Dive into DataFrames and Data Manipulation

Lesson 3

Preparing for the Dive into Pandas

Are you ready to delve deeper into the core functionalities of Pandas, one of the most popular Python libraries in data science? Today, we will focus on learning about Pandas DataFrame and data manipulation — the backbone of many data operation tasks. We're going to handle the Titanic dataset — a fascinating example containing real-world data — that will keep you engaged as we navigate through the lesson.

Pandas DataFrames bring versatility and power to the table when it comes to data manipulation. Think of it as Excel, but on steroids, capable of handling large and complex datasets. The processing abilities of DataFrames are crucial for cleaning, transforming, and analyzing datasets in earnest.

Say you've got a dataset, like the Titanic, but some data is missing. Or perhaps there are some anomalies you’d want to filter out. Or you need specific segments of data to examine a particular hypothesis. How would you do it? By mastering Pandas DataFrames, you'd be well-equipped to tackle these tasks!

Initiation to Pandas DataFrame

Pandas DataFrame is a two-dimensional labeled data structure capable of holding data of various types—integers, floats, strings, Python objects, and more. It's generally the most commonly used Pandas object.

Let's start simply by creating a DataFrame from a dictionary:

Python
1import pandas as pd
2
3data_dict = {"Name": ["John", "Anna", "Peter"],
4             "Age": [28, 24, 33],
5             "City": ["New York", "Los Angeles", "Berlin"]}
6
7df = pd.DataFrame(data_dict)
8
9print(df)
10
11"""
12    Name  Age         City
130   John   28     New York
141   Anna   24  Los Angeles
152  Peter   33       Berlin
16"""

Each key-value pair in the dictionary corresponds to a column in the resulting DataFrame. The key defines the column label, and the corresponding value is a list of column values. The DataFrame constructor takes a dictionary as input and turns it into a two-dimensional table where keys become column names, and values in each key (which should be a list) will be the values for the respective column. Here "John", "Anna", and "Peter" have ages 28, 24, and 33, respectively, and they live in "New York", "Los Angeles", and "Berlin".

DataFrame Characteristics

To inspect the structure and properties of a DataFrame, we have a range of functions at our disposal. Here are some commonly used ones:

df.head(n): Returns the first n rows of the DataFrame df.
df.tail(n): Returns the last n rows of the DataFrame df.
df.shape: Returns a tuple representing the dimensions (number_of_rows, number_of_columns) of the DataFrame df.
df.columns: Returns an index containing column labels of the DataFrame df.
df.dtypes: Returns a series with the data type of each column.

Let's see these functions in action below:

Python
1print(df.head(2))  # Print first two rows
2print(df.tail(2))  # Print last two rows
3print(df.shape)    # Print dimensions of the df (rows, columns): (3, 3)
4print(df.columns)  # Print column labels: Index(['Name', 'Age', 'City'], dtype='object')
5print(df.dtypes)   # Print data types of each column:
6# Name    object
7# Age      int64
8# City    object
9# dtype: object

These commands will help us understand the basic shape and type of the DataFrame. df.head(2) prints the first 2 rows of the DataFrame, which are {[John, 28, New York], [Anna, 24, Los Angeles]}. df.tail(2) prints the last 2 rows which are {[Anna, 24, Los Angeles}, [Peter, 33, Berlin]}. df.shape gives us the dimensions of the DataFrame here being (3,3) indicating 3 rows and 3 columns. df.columns prints the column names [Name, Age, City] and df.dtypes gives us the data types in each column.

Using λ (Lambda) for DataFrame Manipulation

The apply() function in Pandas is a versatile tool to manipulate DataFrame values. It allows us to apply a function (either a Python built-in function or a custom function) along the DataFrame's axes (either row-wise or column-wise).

Let's demonstrate this by adding a new column to our DataFrame, which represents whether a person is considered youthful by applying a lambda function to the "Age" column.

Lambda functions, λ (Lambda), in Python, are small anonymous functions that are defined with the lambda keyword. They can take any number of arguments and can only have one expression. They are particularly useful when you need to pass a small function as an argument.

Python
1df["IsYouthful"] = df["Age"].apply(lambda age: "Yes" if age < 30 else "No")
2print(df)
3
4"""
5    Name  Age         City IsYouthful
60   John   28     New York        Yes
71   Anna   24  Los Angeles        Yes
82  Peter   33       Berlin         No
9"""

In the above example, we used a lambda function that takes an age as an argument and returns "Yes" if the age is less than 30 and "No" otherwise.

The Mighty Concat

Pandas provides various ways to combine DataFrames, one of which is concat(). As the name implies, concat() combines DataFrame objects along a particular axis.

Let's create a new DataFrame and concatenate it with our existing DataFrame:

Python
1df2 = pd.DataFrame({"Name": ["Megan"], "Age": [34], "City": ["San Francisco"], "IsYouthful": ["No"]})
2
3df_concatenated = pd.concat([df, df2], ignore_index=True)
4
5print(df_concatenated)
6
7"""
8    Name  Age           City IsYouthful
90   John   28       New York        Yes
101   Anna   24    Los Angeles        Yes
112  Peter   33         Berlin         No
123  Megan   34  San Francisco         No
13"""

Did you notice the ignore_index=True parameter? When set to True, it resets the index in the resulting DataFrame. So, in the resultant DataFrame, the indices are in increasing order starting from 0.

Locating Elements in a Pandas DataFrame

Pandas provides several ways to locate elements in a DataFrame.

The simplest way to select a column in a DataFrame is by label:

Python
1print(df['column_name']) # select a single column
2print(df[['col1', 'col2']]) # select multiple columns

To select elements within the DataFrame by integer location, we use the iloc method. The iloc indexer is like Python list slicing. This accepts integer inputs and slice notation. The general syntax is df.iloc[row_selection, column_selection]:

For example, if we wish to select the value in the second row (indexed at 1) and the first column (indexed at 0):

Python
1df.iloc[1,0] # Select the value in the second row and the first column (1-based)
2df.iloc[:2,:2] # Select the first two rows and columns

Exploring Practical Application of Pandas: Titanic Dataset from Seaborn

Let's dive in and see how Pandas can be applied to real-life datasets! To show this, we will use the Titanic dataset provided by the Seaborn library and show you some quick examples of how you can start analyzing it using Pandas.

Seaborn provides a direct function to load the dataset, making it very easy to load the dataset into the Pandas DataFrame:

Python
1import pandas as pd
2import seaborn as sns
3
4# Load the titanic dataset into a Pandas DataFrame
5titanic = sns.load_dataset('titanic')
6
7# Look at the first 3 rows of the DataFrame
8print(titanic.head(3))
9
10"""
11   survived  pclass     sex   age  ...  deck  embark_town  alive  alone
120         0       3    male  22.0  ...   NaN  Southampton     no  False
131         1       1  female  38.0  ...     C    Cherbourg    yes  False
142         1       3  female  26.0  ...   NaN  Southampton    yes   True
15
16[3 rows x 15 columns]
17"""

As titanic is just a Pandas DataFrame, you can apply to it any operations we've learned before!

Lesson Summary

To recap, we have learned about the Pandas DataFrame, one of the fundamental building blocks in Python for data manipulation. We explored how to create a DataFrame, gain insights about its structure, and manipulate it with functions like apply() and concat(). We also covered the basics of the Lambda function and its application inside the apply() function. Lastly, we covered dataset elements' location, and how all this functionality is applicable to real-life datasets on an example of Seaborn's Titanic dataset. The ability to work with DataFrames is essential for data analysis and will form the foundation for more complex data operations you will learn.

Remember, learning by doing is key to mastering these new skills. Are you ready to apply what you've learned today and build some exercises for more hands-on practice? As you proceed, keep in mind that understanding how to manipulate and analyze data using Pandas is a powerful tool for your journey in data science. Let's dive deeper into the upcoming practice exercises and put your learning to the test!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.