Deep Exploration of the Titanic Dataset: Features and Characteristics

Lesson 1

Starting the Voyage: Exploring the Titanic Dataset

Welcome to our course, Intro to Data Visualization with Titanic - an in-depth exploration into the techniques and methodologies of data visualization using Python. This course is designed to provide you with comprehensive insights into real-world scenarios, helping you understand the invaluable concept of data visualization and its applications in today's data-driven world.

In the first lesson of this course, we will explore the detailed properties of the Titanic dataset available from Seaborn - the dataset containing the demographic and passenger information from the 891 surviving passengers out of the 2214 on board the Titanic.

Understanding the data we're working with is foundational in data analysis because it lets us gain better insights into it and spot potential errors. It also helps us form a reliable basis for further intricate analysis. The runtime of this process can vary solely based on the characteristics of the dataset and what we intend to understand from it.

So, let's delve in and explore the Titanic dataset to understand further the people who pursued their fate on Titanic.

Insight into Features of the Titanic Dataset

We shall begin our voyage into the dataset by understanding the various attributes of the Titanic dataset.

First, let's briefly go over the features of the Titanic dataset:

survived: Whether the passenger survived (0 = No; 1 = Yes).
pclass: Passenger class (1 = 1st; 2 = 2nd; 3 = 3rd).
sex: Sex of the passenger (male or female).
age: Age of the passenger (float number).
sibsp: Number of siblings/spouses aboard.
parch: Number of parents/children aboard.
fare: Passenger fare (in British pounds).
embarked: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton).
... and more!

By discussing these attributes, let's familiarize ourselves with the Titanic dataset available in Seaborn.

Python
1import seaborn as sns
2
3titanic_df = sns.load_dataset('titanic')
4print(titanic_df.head())
5# This command shows the first five entries of the DataFrame

The output of the head command is in the following table:

Plain text
1   survived  pclass     sex   age  ...  deck  embark_town  alive  alone
20         0       3    male  22.0  ...   NaN  Southampton     no  False
31         1       1  female  38.0  ...     C    Cherbourg    yes  False
42         1       3  female  26.0  ...   NaN  Southampton    yes   True
53         1       1  female  35.0  ...     C  Southampton    yes  False
64         0       3    male  35.0  ...   NaN  Southampton     no   True

Each row here represents a different passenger on the ship, while each column corresponds to one of the features described above.

Diving Deeper: Examining More Characteristics

Our dataset (titanic_df) is a Pandas DataFrame, and it comes with many built-in functions that we can use to inspect the data:

head(n): Displays the first n entries of the DataFrame.
tail(n): Displays the last n entries of the DataFrame.
shape: Returns the number of rows and columns of the DataFrame.
info(): Provides a concise summary of the DataFrame.
describe(): Generates descriptive statistics that summarize a dataset's distribution's central tendency, dispersion, and shape.

Each of these functions offers a different perspective on the Titanic dataset:

Python
1# Print the first five entries
2print(titanic_df.head())
3
4# Print the last five entries
5print(titanic_df.tail())
6
7# Print the shape of the DataFrame
8print(titanic_df.shape)
9# Output: (891, 15)
10
11# Print a concise summary of the DataFrame
12titanic_df.info()
13"""
14<class 'pandas.core.frame.DataFrame'>
15RangeIndex: 891 entries, 0 to 890
16Data columns (total 15 columns):
17 #   Column       Non-Null Count  Dtype   
18---  ------       --------------  -----   
19 0   survived     891 non-null    int64   
20 1   pclass       891 non-null    int64   
21 2   sex          891 non-null    object  
22 3   age          714 non-null    float64 
23 4   sibsp        891 non-null    int64   
24 5   parch        891 non-null    int64   
25 6   fare         891 non-null    float64 
26 7   embarked     889 non-null    object  
27 8   class        891 non-null    category
28 9   who          891 non-null    object  
29 10  adult_male   891 non-null    bool    
30 11  deck         203 non-null    category
31 12  embark_town  889 non-null    object  
32 13  alive        891 non-null    object  
33 14  alone        891 non-null    bool    
34dtypes: bool(2), category(2), float64(2), int64(4), object(5)
35memory usage: 80.7+ KB
36"""
37
38# Print the descriptive statistics of the DataFrame
39print(titanic_df.describe())
40"""
41         survived      pclass         age       sibsp       parch        fare
42count  891.000000  891.000000  714.000000  891.000000  891.000000  891.000000
43mean     0.383838    2.308642   29.699118    0.523008    0.381594   32.204208
44std      0.486592    0.836071   14.526497    1.102743    0.806057   49.693429
45min      0.000000    1.000000    0.420000    0.000000    0.000000    0.000000
4625%      0.000000    2.000000   20.125000    0.000000    0.000000    7.910400
4750%      0.000000    3.000000   28.000000    0.000000    0.000000   14.454200
4875%      1.000000    3.000000   38.000000    1.000000    0.000000   31.000000
49max      1.000000    3.000000   80.000000    8.000000    6.000000  512.329200
50"""

The output shows:

The head command outputs the first five rows similar to the abovementioned one.
The tail command outputs the last five rows of the dataframe.
The shape command returns (891, 15), indicating the dataframe has 891 rows and 15 columns.
The info command prints a concise summary, including the number of non-null entries for each column.
The describe command provides a statistics table for the dataframe's numerical columns.

You will notice from this description that the dataset contains some missing values in features like Age and Embarked, something we will learn to handle in later lessons.

Deeper Dive with DataFrame Functionality

The value_counts() function can also be quite helpful in understanding the distribution of categorical data. For example, if you want to count how many male and female passengers were on the Titanic, you could use this command:

Python
1print(titanic_df['sex'].value_counts())
2
3"""
4male      577
5female    314
6Name: sex, dtype: int64
7"""

The nunique() and unique() functions could also come in handy to identify unique entries within your dataset. The former gives the count of unique entries, and the latter gives the actual unique entries.

Python
1# Print the count of unique entries in 'embarked' column
2print(titanic_df['embarked'].nunique()) # Output: 3
3
4# Print the unique entries in 'embarked' column
5print(titanic_df['embarked'].unique()) # Output: ['S' 'C' 'Q' nan]

These additional functions provide functionality to make your exploratory data analysis even more powerful!

Wrapping Up

Congratulations! You've now learned to explore and understand the Titanic dataset's basic features and characteristics using Python and Pandas. We dove into the dataset's content, comprehensively understanding the Titanic passengers and their tragic journey. Today's deep dive is invaluable in setting the foundation for more advanced data visualizations.

In this lesson, we learned how to:

Load a dataset using Seaborn.
Explore the dataset using the various built-in functions provided by Pandas.

Practice Ahead!

We encourage you to apply what you've learned in this beginner-friendly exploration. Take the time to explore the dataset further: check the missing values, investigate the descriptive statistics, and try using other functionalities of Pandas.

Good luck with your journey in data visualization! Happy sailing!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.