Welcome to the next leg of our journey with the Titanic Survival Data - wielding the power of Descriptive Statistics with Numpy and Pandas! In this lesson, we will cover how to use both these libraries to perform descriptive statistical analysis on our dataset. By the end of this lesson, you will have gained the ability to calculate measures of central tendencies such as mean, median, and mode and understand how to interpret measures of variability, quartiles, and percentiles.
Why should we care about learning descriptive statistics? Well, simply put, descriptive statistics provide powerful, informative summaries of our data, allowing us to understand the nature and distribution of our data even before embarking on any form of machine learning or data prediction. Armed with this understanding, we are better equipped to carry out accurate analyses and produce meaningful insights from our data. Ready to investigate the Titanic dataset
more thoroughly? Then, let's dive in!
Descriptive statistics are appropriately named, as they provide insights into the main features of our data. Let's start with the Titanic dataset
and calculate some basic statistics for the age of passengers: the mean, median, and mode.
Python1import numpy as np 2import pandas as pd 3import seaborn as sns 4 5# Load Titanic dataset 6titanic_df = sns.load_dataset('titanic') 7 8mean_age = titanic_df['age'].mean() 9median_age = titanic_df['age'].median() 10mode_age = titanic_df['age'].mode()[0] 11 12print(f"Mean age: {mean_age}") # Mean age: 29.69911764705882 13print(f"Median age: {median_age}") # Median age: 28.0 14print(f"Mode age: {mode_age}") # Mode age: 24.0
The code calculates and displays the mean (average), median (middle value), and mode (most frequently occurring value) of the age
column. These are measures of central tendency, and they give us a general picture of the age distribution of passengers aboard the Titanic.
Apart from measures of central tendency, there is another important style of measurement in statistics - measures of dispersion (variability). One of the common ways to gauge the variability in a dataset is via the standard deviation, which measures how much the values in a dataset vary around the mean. A super low standard deviation indicates a dataset with values clustered around the mean, while a higher standard deviation represents a wider spread around the mean. For our Titanic dataset, we can calculate the standard deviation of age as follows:
Python1# Standard deviation 2std_dev_age = np.std(titanic_df['age']) 3 4print(f"Standard deviation of age: {std_dev_age}") # Standard deviation of age: 14.516321150817316
Running the provided Python code will calculate and print the standard deviation of the age
field in the Titanic dataset, thereby giving you a sense of how much the ages of passengers varied.
Let's dig deeper and start looking at the division of data into segments with quartiles and percentiles. Quartiles and percentiles are in essence, a way to cut our data into equal segments. The 25th percentile, for example, is equivalent to the first quartile, and the 75th percentile is the third quartile.
Python1# Quartiles and percentiles 2# Using Numpy 3Q1_age_np = np.percentile(titanic_df['age'].dropna(), 25) # dropna is being used to drop NA values 4Q3_age_np = np.percentile(titanic_df['age'].dropna(), 75) 5 6print(f"First quartile of age (Numpy): {Q1_age_np}") 7print(f"Third quartile of age (Numpy): {Q3_age_np}") 8 9# Output: 10# First quartile of age (Numpy): 20.125 11# Third quartile of age (Numpy): 38.0 12 13# Using Pandas 14Q1_age_pd = titanic_df['age'].quantile(0.25) 15Q3_age_pd = titanic_df['age'].quantile(0.75) 16 17print(f"First quartile of age (Pandas): {Q1_age_pd}") 18print(f"Third quartile of age (Pandas): {Q3_age_pd}") 19 20# Output: 21# First quartile of age (Pandas): 20.125 22# Third quartile of age (Pandas): 38.0
The executed Python code first calculates and prints the first and third quartiles for the age
column of our Titanic dataset using NumPy. It then repeats the calculation using Pandas, giving the same results. With these quartiles, we can immediately understand more about the age distribution of passengers on board the Titanic. For instance, we now know that 50% of passengers were between the ages of Q1_age_np
(around 20 years old) and Q3_age_np
(approximately 38 years old).
Congratulations! You have now added it to your Python data library. You have learned how to use Numpy and Pandas to dig into your dataset using descriptive statistics to compute useful measures such as the mean, median, mode, quartiles, percentiles, and standard deviation. These are the ABCs of exploratory data analysis and provide a powerful first step into the realm of Statistical Analysis and Data Science.
With this foundation laid down, you now have what it takes to conduct more complex statistical analyses and to engage and succeed in even more advanced fields of Data Science.
Now it's time to consolidate your knowledge and master the science of statistics! Try out some practice problems and exercises that will help solidify all you have learned and equip you with the skills to extract more insights from our Titanic dataset! Remember, the more you practice, the more you learn!