Diving into Descriptive Statistics of The Titanic Dataset

Lesson 2

Introduction to Descriptive Statistics with the Titanic Dataset

Welcome back! This lesson is all about descriptive statistics and understanding the various characteristics of the Titanic dataset.

So, why do we need to study statistics when dealing with data? Well, statistics is a branch of mathematics dealing with data collection, organization, and interpretation. In data science, we use statistics to extract meaningful insights and knowledge from data.

Statistics helps us deal with the data's complexity by reducing a complex dataset into a simpler summary. It assists in the presentation and visualization of the data, thereby making our data analysis or machine learning model more precise.

Take our current dataset, for instance, which comprises various demographics and passenger information; wouldn't it be interesting to know the average age or to gauge the variety in travelers' fares? Our lesson will focus on extracting these primary statistical features from our dataset, helping us better comprehend the Titanic voyage.

Overview of Descriptive Statistics

Descriptive statistics summarise and organize the characteristics of a data set. A data set is a collection of responses or observations from a sample or entire population.

In pandas, there's a function called describe(), which calculates the basic statistics for all continuous variables, i.e., types of variables that can take on an infinite number of values within a specific range. It provides the count, mean, standard deviation (std), min, quartiles, and max in its output.

Firstly, let's import the libraries we will be using and load the dataset:

Python
1import seaborn as sns
2
3# Load the dataset
4titanic = sns.load_dataset('titanic')
5
6# show the first few rows of data
7print(titanic.head())

The output of the head command will be like this:

Markdown
1   survived  pclass     sex   age  ...  deck  embark_town  alive  alone
20         0       3    male  22.0  ...   NaN  Southampton     no  False
31         1       1  female  38.0  ...     C    Cherbourg    yes  False
42         1       3  female  26.0  ...   NaN  Southampton    yes   True
53         1       1  female  35.0  ...     C  Southampton    yes  False
64         0       3    male  35.0  ...   NaN  Southampton     no   True

The describe() function can then be executed as follows:

Python
1# Generate descriptive statistics
2titanic_stats = titanic.describe()
3print(titanic_stats)

The output of the describe() function will be like this:

Markdown
1        survived      pclass         age       sibsp       parch        fare
2count  891.000000  891.000000  714.000000  891.000000  891.000000  891.000000
3mean     0.383838    2.308642   29.699118    0.523008    0.381594   32.204208
4std      0.486592    0.836071   14.526497    1.102743    0.806057   49.693429
5min      0.000000    1.000000    0.420000    0.000000    0.000000    0.000000
625%      0.000000    2.000000   20.125000    0.000000    0.000000    7.910400
750%      0.000000    3.000000   28.000000    0.000000    0.000000   14.454200
875%      1.000000    3.000000   38.000000    1.000000    0.000000   31.000000
9max      1.000000    3.000000   80.000000    8.000000    6.000000  512.329200

In this code snippet, the describe() function generates descriptive statistics that summarize a dataset's distribution's central tendency, dispersion, and shape, excluding NaN values.

What Else?

Notice how all the categorical columns, like 'sex' or 'class', are missing in the output. By default, describe() only includes columns with numerical data.

If you want to include all columns, you need to pass include='all' as an argument. Here is how to do it:

Python
1# Generate descriptive statistics
2titanic_stats = titanic.describe(include='all')
3print(titanic_stats)

Note that for categorical variables, the output has different features – unique, top, and freq. 'unique' shows the number of distinct objects in the column, 'top' shows the most frequent object, and 'freq' shows how many times the top object appears in the column.

Unveiling The Spread

Variability, also known as dispersion, is the extent to which data points differ from the center. Two commonly used measures are the range and interquartile range (IQR).

The range is the difference between a dataset's maximum and minimum values. However, it's sensitive to outliers; extremely high or low values can skew the range. Here's how you calculate the range for the age column of the Titanic dataset:

Python
1# Calculate the numerical data range
2age_range = titanic['age'].max() - titanic['age'].min()
3print('Age Range:', age_range) # Age Range: 79.58

The IQR measures statistical dispersion, or how far apart the data points are. It's the range within which the middle 50% of your data falls. It's a better measure of dispersion than the range because outliers don't affect it. Here's how you can calculate it:

Python
1# Calculate the IQR
2Q1 = titanic['age'].quantile(0.25)
3Q3 = titanic['age'].quantile(0.75)
4IQR = Q3 - Q1
5print('Age IQR:', IQR) # Age IQR: 17.875

Determining The Central Position

Central tendency measures help you find the center of your dataset. Mean and median are the most common measures of central tendency.

The mean or average is the most common measure of central tendency. It's the sum of all data points divided by the number of data points.

Python
1# Calculate the mean
2mean_age = titanic['age'].mean()
3print('Mean Age:', mean_age) # Mean Age: 29.69911764705882

The median is the middle score. The scores must be arranged in numerical order to identify the median correctly.

Python
1# Calculate the median
2median_age = titanic['age'].median()
3print('Median Age:', median_age) # Median Age: 28.0

Wrapping Up

You've just taken your first steps into the realm of descriptive statistics! In this lesson, you've learned about the usefulness of statistics in data analysis and how we can summarize our Titanic dataset via central tendency and dispersion measures.

Hence, understanding these statistical characteristics and central tendencies is significant for making effective predictions about our dataset, offering a sound foundation for building meaningful data visualizations.

Ready to Practice?

With the theory presented, let's put that into practice! This practice exercise will help you revisit everything learned in this lesson while drawing out statistical inferences from our Titanic dataset.

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.