Comprehensive Analysis With Multiple Techniques: Part 2

Lesson 4

Lesson Introduction

Welcome to our lesson on integrating multiple techniques for comprehensive data analysis! Today, we'll dive deep into the Titanic dataset, using powerful functions and methods from pandas and numpy to uncover valuable insights. The goal is to learn how to combine techniques like groupby, merge, and pivot tables for thorough analysis.

Integrating multiple techniques is like preparing a delicious meal: you combine several ingredients to create a rich, flavorful dish. Similarly, combining data analysis techniques helps extract deeper insights from data.

Let's start stepping through our code!

Combining Groupby and Aggregation: Part 1

First, we'll group the data by class and sex and calculate the mean values. Grouping helps us understand patterns within subgroups.

Python
1import seaborn as sns
2import pandas as pd
3import numpy as np
4
5# Load the Titanic dataset
6titanic = sns.load_dataset('titanic')
7
8# Group by 'class' and 'sex' with observed=True to specify the exact behavior expected
9class_sex_grouping = titanic.groupby(['class', 'sex'], observed=True).agg({
10    'survived': 'mean',  # Mean survival rate
11    'fare': 'mean',      # Mean fare
12    'age': ['mean', 'std']  # Mean and standard deviation of age
13}).reset_index()

Using reset_index here is necessary to convert the multi-level index (created by the groupby operation) back into regular columns of the DataFrame. Without resetting the index, the resulting DataFrame would have class and sex as index levels, which can complicate further data manipulation and readability

Combining Groupby and Aggregation: Part 2

After grouping, we'll simplify the multi-level columns for readability.

Python
1# Simplify multi-level columns
2class_sex_grouping.columns = ['class', 'sex', 'survived_mean', 'fare_mean', 'age_mean', 'age_std']
3
4print(class_sex_grouping)

Output:


1    class     sex  survived_mean   fare_mean   age_mean    age_std
20   First  female       0.968085  106.125798  34.611765  13.612052
31   First    male       0.368852   67.226127  41.281386  15.139570
42  Second  female       0.921053   21.970121  28.722973  12.872702
53  Second    male       0.157407   19.741782  30.740707  14.793894
64   Third  female       0.500000   16.118810  21.750000  12.729964
75   Third    male       0.135447   12.661633  26.507589  12.159514

This tells us if first-class passengers had higher survival rates and fares compared to third-class passengers.

Creating a Pivot Table

After grouping and aggregating our data, we'll create a pivot table to summarize and cross-tabulate our datasets dynamically.

Python
1# Pivot table with observed=True for grouping to avoid FutureWarning
2pivot_table = class_sex_grouping.pivot_table(
3    index='class', 
4    columns='sex', 
5    values=['survived_mean', 'fare_mean', 'age_mean', 'age_std'],
6    observed=True
7)
8
9print(pivot_table)
10#              survived_mean                 
11# sex               female      male      
12# class                                                                                                          
13# First          0.968085   0.368852  
14# Second         0.921053   0.157407   
15# Third          0.500000   0.135447   
16
17#              fare_mean                    
18# Analogous
19
20#              age_mean                     
21# Analogous
22
23#              age_std                       
24# Analogous

The pivot table allows us to easily compare survival rates, fare means, and age statistics across different classes and genders.

Adding a Conditional Column

We'll add a new column to indicate whether a passenger is a child. This helps us understand survival rates among children.

Python
1# Adding a 'child' column: whether the passenger is a child (age < 18)
2titanic['is_child'] = titanic['age'] < 18
3print(titanic['is_child'])
4# 0      False
5# 1      False
6# 2      False
7# 3      False
8# 4      False
9# ...

Adding the is_child column allows further analysis considering passengers' age groups.

Analysis of Survival Rates by Class and Age Group

Next, let's analyze survival rates by class and whether the passenger is a child.

Python
1# Analyze survival rates by class and whether the passenger is a child or not
2survival_by_class_child = titanic.pivot_table(
3    'survived', index='class', columns='is_child', aggfunc='mean',
4    observed=True
5)
6
7print(survival_by_class_child)
8# is_child         False     True
9# class                          
10# First     0.612745  0.916667
11# Second    0.409938  0.913043
12# Third     0.217918  0.371795

This informs us if children had better survival rates than adults in each class. The False column is survival rates for adults, and the True column is survival rates for children.

Merging Datasets for Comprehensive View

We’ll merge our grouped data with child survival data for a comprehensive dataset.

Python
1# Merge this pivot table with the original grouped data for a comprehensive view
2comprehensive_view = pd.merge(
3    class_sex_grouping, 
4    survival_by_class_child, 
5    on='class', 
6    how='left'
7)
8
9print(comprehensive_view)

The output is:


1    class     sex  survived_mean  ...    age_std     False      True
20   First  female       0.968085  ...  13.612052  0.612745  0.916667
31   First    male       0.368852  ...  15.139570  0.612745  0.916667
42  Second  female       0.921053  ...  12.872702  0.409938  0.913043
53  Second    male       0.157407  ...  14.793894  0.409938  0.913043
64   Third  female       0.500000  ...  12.729964  0.217918  0.371795
75   Third    male       0.135447  ...  12.159514  0.217918  0.371795

Merging datasets combines various insights into one comprehensive analysis. Additionally, we can rename the True and False columns from the survival_by_class_child dataframe for clarity:

Python
1# Rename the columns for clarity
2comprehensive_view.rename(columns={False: 'adult_survival_rate', True: 'child_survival_rate'}, inplace=True)

Note that rename function takes a dictionary mapping the old column names to the new column names.

Lesson Summary and Practice Introduction

Today, you learned how to integrate multiple data analysis techniques to conduct a comprehensive analysis. We started by loading and exploring the dataset, then grouped and aggregated data, created pivot tables, added conditional columns, conducted advanced analysis, and merged datasets for broader insights.

Now, it's time to practice. In the next session, you'll work on similar exercises with different datasets or parameters. Happy coding!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.