Welcome to our lesson on integrating multiple techniques for comprehensive data analysis! Today, we'll dive deep into the Titanic dataset, using powerful functions and methods from pandas
and numpy
to uncover valuable insights. The goal is to learn how to combine techniques like groupby
, merge
, and pivot
tables for thorough analysis.
Integrating multiple techniques is like preparing a delicious meal: you combine several ingredients to create a rich, flavorful dish. Similarly, combining data analysis techniques helps extract deeper insights from data.
Let's start stepping through our code!
First, we'll group the data by class
and sex
and calculate the mean values. Grouping helps us understand patterns within subgroups.
Python1import seaborn as sns 2import pandas as pd 3import numpy as np 4 5# Load the Titanic dataset 6titanic = sns.load_dataset('titanic') 7 8# Group by 'class' and 'sex' with observed=True to specify the exact behavior expected 9class_sex_grouping = titanic.groupby(['class', 'sex'], observed=True).agg({ 10 'survived': 'mean', # Mean survival rate 11 'fare': 'mean', # Mean fare 12 'age': ['mean', 'std'] # Mean and standard deviation of age 13}).reset_index()
Using reset_index
here is necessary to convert the multi-level index (created by the groupby operation) back into regular columns of the DataFrame. Without resetting the index, the resulting DataFrame would have class
and sex
as index levels, which can complicate further data manipulation and readability
After grouping, we'll simplify the multi-level columns for readability.
Python1# Simplify multi-level columns 2class_sex_grouping.columns = ['class', 'sex', 'survived_mean', 'fare_mean', 'age_mean', 'age_std'] 3 4print(class_sex_grouping)
Output:
1 class sex survived_mean fare_mean age_mean age_std 20 First female 0.968085 106.125798 34.611765 13.612052 31 First male 0.368852 67.226127 41.281386 15.139570 42 Second female 0.921053 21.970121 28.722973 12.872702 53 Second male 0.157407 19.741782 30.740707 14.793894 64 Third female 0.500000 16.118810 21.750000 12.729964 75 Third male 0.135447 12.661633 26.507589 12.159514
This tells us if first-class passengers had higher survival rates and fares compared to third-class passengers.
After grouping and aggregating our data, we'll create a pivot table to summarize and cross-tabulate our datasets dynamically.
Python1# Pivot table with observed=True for grouping to avoid FutureWarning 2pivot_table = class_sex_grouping.pivot_table( 3 index='class', 4 columns='sex', 5 values=['survived_mean', 'fare_mean', 'age_mean', 'age_std'], 6 observed=True 7) 8 9print(pivot_table) 10# survived_mean 11# sex female male 12# class 13# First 0.968085 0.368852 14# Second 0.921053 0.157407 15# Third 0.500000 0.135447 16 17# fare_mean 18# Analogous 19 20# age_mean 21# Analogous 22 23# age_std 24# Analogous
The pivot table allows us to easily compare survival rates, fare means, and age statistics across different classes and genders.
We'll add a new column to indicate whether a passenger is a child. This helps us understand survival rates among children.
Python1# Adding a 'child' column: whether the passenger is a child (age < 18) 2titanic['is_child'] = titanic['age'] < 18 3print(titanic['is_child']) 4# 0 False 5# 1 False 6# 2 False 7# 3 False 8# 4 False 9# ...
Adding the is_child
column allows further analysis considering passengers' age groups.
Next, let's analyze survival rates by class and whether the passenger is a child.
Python1# Analyze survival rates by class and whether the passenger is a child or not 2survival_by_class_child = titanic.pivot_table( 3 'survived', index='class', columns='is_child', aggfunc='mean', 4 observed=True 5) 6 7print(survival_by_class_child) 8# is_child False True 9# class 10# First 0.612745 0.916667 11# Second 0.409938 0.913043 12# Third 0.217918 0.371795
This informs us if children had better survival rates than adults in each class. The False
column is survival rates for adults, and the True
column is survival rates for children.
We’ll merge our grouped data with child survival data for a comprehensive dataset.
Python1# Merge this pivot table with the original grouped data for a comprehensive view 2comprehensive_view = pd.merge( 3 class_sex_grouping, 4 survival_by_class_child, 5 on='class', 6 how='left' 7) 8 9print(comprehensive_view)
The output is:
1 class sex survived_mean ... age_std False True 20 First female 0.968085 ... 13.612052 0.612745 0.916667 31 First male 0.368852 ... 15.139570 0.612745 0.916667 42 Second female 0.921053 ... 12.872702 0.409938 0.913043 53 Second male 0.157407 ... 14.793894 0.409938 0.913043 64 Third female 0.500000 ... 12.729964 0.217918 0.371795 75 Third male 0.135447 ... 12.159514 0.217918 0.371795
Merging datasets combines various insights into one comprehensive analysis. Additionally, we can rename the True
and False
columns from the survival_by_class_child
dataframe for clarity:
Python1# Rename the columns for clarity 2comprehensive_view.rename(columns={False: 'adult_survival_rate', True: 'child_survival_rate'}, inplace=True)
Note that rename
function takes a dictionary mapping the old column names to the new column names.
Today, you learned how to integrate multiple data analysis techniques to conduct a comprehensive analysis. We started by loading and exploring the dataset, then grouped and aggregated data, created pivot tables, added conditional columns, conducted advanced analysis, and merged datasets for broader insights.
Now, it's time to practice. In the next session, you'll work on similar exercises with different datasets or parameters. Happy coding!