Lesson 4
Comprehensive Analysis With Multiple Techniques: Part 2
Lesson Introduction

Welcome to our lesson on integrating multiple techniques for comprehensive data analysis! Today, we'll dive deep into the Titanic dataset, using powerful functions and methods from pandas and numpy to uncover valuable insights. The goal is to learn how to combine techniques like groupby, merge, and pivot tables for thorough analysis.

Integrating multiple techniques is like preparing a delicious meal: you combine several ingredients to create a rich, flavorful dish. Similarly, combining data analysis techniques helps extract deeper insights from data.

Let's start stepping through our code!

Combining Groupby and Aggregation: Part 1

First, we'll group the data by class and sex and calculate the mean values. Grouping helps us understand patterns within subgroups.

Python
1import seaborn as sns 2import pandas as pd 3import numpy as np 4 5# Load the Titanic dataset 6titanic = sns.load_dataset('titanic') 7 8# Group by 'class' and 'sex' with observed=True to specify the exact behavior expected 9class_sex_grouping = titanic.groupby(['class', 'sex'], observed=True).agg({ 10 'survived': 'mean', # Mean survival rate 11 'fare': 'mean', # Mean fare 12 'age': ['mean', 'std'] # Mean and standard deviation of age 13}).reset_index()

Using reset_index here is necessary to convert the multi-level index (created by the groupby operation) back into regular columns of the DataFrame. Without resetting the index, the resulting DataFrame would have class and sex as index levels, which can complicate further data manipulation and readability

Combining Groupby and Aggregation: Part 2

After grouping, we'll simplify the multi-level columns for readability.

Python
1# Simplify multi-level columns 2class_sex_grouping.columns = ['class', 'sex', 'survived_mean', 'fare_mean', 'age_mean', 'age_std'] 3 4print(class_sex_grouping)

Output:

1 class sex survived_mean fare_mean age_mean age_std 20 First female 0.968085 106.125798 34.611765 13.612052 31 First male 0.368852 67.226127 41.281386 15.139570 42 Second female 0.921053 21.970121 28.722973 12.872702 53 Second male 0.157407 19.741782 30.740707 14.793894 64 Third female 0.500000 16.118810 21.750000 12.729964 75 Third male 0.135447 12.661633 26.507589 12.159514

This tells us if first-class passengers had higher survival rates and fares compared to third-class passengers.

Creating a Pivot Table

After grouping and aggregating our data, we'll create a pivot table to summarize and cross-tabulate our datasets dynamically.

Python
1# Pivot table with observed=True for grouping to avoid FutureWarning 2pivot_table = class_sex_grouping.pivot_table( 3 index='class', 4 columns='sex', 5 values=['survived_mean', 'fare_mean', 'age_mean', 'age_std'], 6 observed=True 7) 8 9print(pivot_table) 10# survived_mean 11# sex female male 12# class 13# First 0.968085 0.368852 14# Second 0.921053 0.157407 15# Third 0.500000 0.135447 16 17# fare_mean 18# Analogous 19 20# age_mean 21# Analogous 22 23# age_std 24# Analogous

The pivot table allows us to easily compare survival rates, fare means, and age statistics across different classes and genders.

Adding a Conditional Column

We'll add a new column to indicate whether a passenger is a child. This helps us understand survival rates among children.

Python
1# Adding a 'child' column: whether the passenger is a child (age < 18) 2titanic['is_child'] = titanic['age'] < 18 3print(titanic['is_child']) 4# 0 False 5# 1 False 6# 2 False 7# 3 False 8# 4 False 9# ...

Adding the is_child column allows further analysis considering passengers' age groups.

Analysis of Survival Rates by Class and Age Group

Next, let's analyze survival rates by class and whether the passenger is a child.

Python
1# Analyze survival rates by class and whether the passenger is a child or not 2survival_by_class_child = titanic.pivot_table( 3 'survived', index='class', columns='is_child', aggfunc='mean', 4 observed=True 5) 6 7print(survival_by_class_child) 8# is_child False True 9# class 10# First 0.612745 0.916667 11# Second 0.409938 0.913043 12# Third 0.217918 0.371795

This informs us if children had better survival rates than adults in each class. The False column is survival rates for adults, and the True column is survival rates for children.

Merging Datasets for Comprehensive View

We’ll merge our grouped data with child survival data for a comprehensive dataset.

Python
1# Merge this pivot table with the original grouped data for a comprehensive view 2comprehensive_view = pd.merge( 3 class_sex_grouping, 4 survival_by_class_child, 5 on='class', 6 how='left' 7) 8 9print(comprehensive_view)

The output is:

1 class sex survived_mean ... age_std False True 20 First female 0.968085 ... 13.612052 0.612745 0.916667 31 First male 0.368852 ... 15.139570 0.612745 0.916667 42 Second female 0.921053 ... 12.872702 0.409938 0.913043 53 Second male 0.157407 ... 14.793894 0.409938 0.913043 64 Third female 0.500000 ... 12.729964 0.217918 0.371795 75 Third male 0.135447 ... 12.159514 0.217918 0.371795

Merging datasets combines various insights into one comprehensive analysis. Additionally, we can rename the True and False columns from the survival_by_class_child dataframe for clarity:

Python
1# Rename the columns for clarity 2comprehensive_view.rename(columns={False: 'adult_survival_rate', True: 'child_survival_rate'}, inplace=True)

Note that rename function takes a dictionary mapping the old column names to the new column names.

Lesson Summary and Practice Introduction

Today, you learned how to integrate multiple data analysis techniques to conduct a comprehensive analysis. We started by loading and exploring the dataset, then grouped and aggregated data, created pivot tables, added conditional columns, conducted advanced analysis, and merged datasets for broader insights.

Now, it's time to practice. In the next session, you'll work on similar exercises with different datasets or parameters. Happy coding!

Enjoy this lesson? Now it's time to practice with Cosmo!
Practice is how you turn knowledge into actual skills.