Unleashing Descriptive Statistics on Titanic Data with Numpy and Pandas

Lesson 4

Topic Introduction and Actualization

Welcome to the next leg of our journey with the Titanic Survival Data - wielding the power of Descriptive Statistics with Numpy and Pandas! In this lesson, we will cover how to use both these libraries to perform descriptive statistical analysis on our dataset. By the end of this lesson, you will have gained the ability to calculate measures of central tendencies such as mean, median, and mode and understand how to interpret measures of variability, quartiles, and percentiles.

Why should we care about learning descriptive statistics? Well, simply put, descriptive statistics provide powerful, informative summaries of our data, allowing us to understand the nature and distribution of our data even before embarking on any form of machine learning or data prediction. Armed with this understanding, we are better equipped to carry out accurate analyses and produce meaningful insights from our data. Ready to investigate the Titanic dataset more thoroughly? Then, let's dive in!

Descriptive Statistics

Descriptive statistics are appropriately named, as they provide insights into the main features of our data. Let's start with the Titanic dataset and calculate some basic statistics for the age of passengers: the mean, median, and mode.

Python
1import numpy as np
2import pandas as pd
3import seaborn as sns
4
5# Load Titanic dataset
6titanic_df = sns.load_dataset('titanic')
7
8mean_age = titanic_df['age'].mean()
9median_age = titanic_df['age'].median()
10mode_age = titanic_df['age'].mode()[0]
11
12print(f"Mean age: {mean_age}") # Mean age: 29.69911764705882
13print(f"Median age: {median_age}") # Median age: 28.0
14print(f"Mode age: {mode_age}") # Mode age: 24.0

The code calculates and displays the mean (average), median (middle value), and mode (most frequently occurring value) of the age column. These are measures of central tendency, and they give us a general picture of the age distribution of passengers aboard the Titanic.

Measures of Variability: Standard Deviation

Apart from measures of central tendency, there is another important style of measurement in statistics - measures of dispersion (variability). One of the common ways to gauge the variability in a dataset is via the standard deviation, which measures how much the values in a dataset vary around the mean. A super low standard deviation indicates a dataset with values clustered around the mean, while a higher standard deviation represents a wider spread around the mean. For our Titanic dataset, we can calculate the standard deviation of age as follows:

Python
1# Standard deviation
2std_dev_age = np.std(titanic_df['age'])
3
4print(f"Standard deviation of age: {std_dev_age}") # Standard deviation of age: 14.516321150817316

Running the provided Python code will calculate and print the standard deviation of the age field in the Titanic dataset, thereby giving you a sense of how much the ages of passengers varied.

Delving Deeper into Data: Quartiles and Percentiles

Let's dig deeper and start looking at the division of data into segments with quartiles and percentiles. Quartiles and percentiles are in essence, a way to cut our data into equal segments. The 25th percentile, for example, is equivalent to the first quartile, and the 75th percentile is the third quartile.

Python
1# Quartiles and percentiles
2# Using Numpy
3Q1_age_np = np.percentile(titanic_df['age'].dropna(), 25) # dropna is being used to drop NA values
4Q3_age_np = np.percentile(titanic_df['age'].dropna(), 75)
5
6print(f"First quartile of age (Numpy): {Q1_age_np}")
7print(f"Third quartile of age (Numpy): {Q3_age_np}")
8
9# Output:
10# First quartile of age (Numpy): 20.125
11# Third quartile of age (Numpy): 38.0
12
13# Using Pandas
14Q1_age_pd = titanic_df['age'].quantile(0.25)
15Q3_age_pd = titanic_df['age'].quantile(0.75)
16
17print(f"First quartile of age (Pandas): {Q1_age_pd}")
18print(f"Third quartile of age (Pandas): {Q3_age_pd}")
19
20# Output:
21# First quartile of age (Pandas): 20.125
22# Third quartile of age (Pandas): 38.0

The executed Python code first calculates and prints the first and third quartiles for the age column of our Titanic dataset using NumPy. It then repeats the calculation using Pandas, giving the same results. With these quartiles, we can immediately understand more about the age distribution of passengers on board the Titanic. For instance, we now know that 50% of passengers were between the ages of Q1_age_np (around 20 years old) and Q3_age_np (approximately 38 years old).

Wrapping Up

Congratulations! You have now added it to your Python data library. You have learned how to use Numpy and Pandas to dig into your dataset using descriptive statistics to compute useful measures such as the mean, median, mode, quartiles, percentiles, and standard deviation. These are the ABCs of exploratory data analysis and provide a powerful first step into the realm of Statistical Analysis and Data Science.

With this foundation laid down, you now have what it takes to conduct more complex statistical analyses and to engage and succeed in even more advanced fields of Data Science.

Ready to Practice?

Now it's time to consolidate your knowledge and master the science of statistics! Try out some practice problems and exercises that will help solidify all you have learned and equip you with the skills to extract more insights from our Titanic dataset! Remember, the more you practice, the more you learn!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.