Lesson 1

Deep Exploration of the Titanic Dataset: Features and Characteristics

Starting the Voyage: Exploring the Titanic Dataset

Welcome to our course, Intro to Data Visualization with Titanic - an in-depth exploration into the techniques and methodologies of data visualization using Python. This course is designed to provide you with comprehensive insights into real-world scenarios, helping you understand the invaluable concept of data visualization and its applications in today's data-driven world.

In the first lesson of this course, we will explore the detailed properties of the Titanic dataset available from Seaborn - the dataset containing the demographic and passenger information from the 891 surviving passengers out of the 2214 on board the Titanic.

Understanding the data we're working with is foundational in data analysis because it lets us gain better insights into it and spot potential errors. It also helps us form a reliable basis for further intricate analysis. The runtime of this process can vary solely based on the characteristics of the dataset and what we intend to understand from it.

So, let's delve in and explore the Titanic dataset to understand further the people who pursued their fate on Titanic.

Insight into Features of the Titanic Dataset

We shall begin our voyage into the dataset by understanding the various attributes of the Titanic dataset.

First, let's briefly go over the features of the Titanic dataset:

  • survived: Whether the passenger survived (0 = No; 1 = Yes).
  • pclass: Passenger class (1 = 1st; 2 = 2nd; 3 = 3rd).
  • sex: Sex of the passenger (male or female).
  • age: Age of the passenger (float number).
  • sibsp: Number of siblings/spouses aboard.
  • parch: Number of parents/children aboard.
  • fare: Passenger fare (in British pounds).
  • embarked: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton).
  • ... and more!

By discussing these attributes, let's familiarize ourselves with the Titanic dataset available in Seaborn.

1import seaborn as sns 2 3titanic_df = sns.load_dataset('titanic') 4print(titanic_df.head()) 5# This command shows the first five entries of the DataFrame

The output of the head command is in the following table:

Plain text
1 survived pclass sex age ... deck embark_town alive alone 20 0 3 male 22.0 ... NaN Southampton no False 31 1 1 female 38.0 ... C Cherbourg yes False 42 1 3 female 26.0 ... NaN Southampton yes True 53 1 1 female 35.0 ... C Southampton yes False 64 0 3 male 35.0 ... NaN Southampton no True

Each row here represents a different passenger on the ship, while each column corresponds to one of the features described above.

Diving Deeper: Examining More Characteristics

Our dataset (titanic_df) is a Pandas DataFrame, and it comes with many built-in functions that we can use to inspect the data:

  • head(n): Displays the first n entries of the DataFrame.
  • tail(n): Displays the last n entries of the DataFrame.
  • shape: Returns the number of rows and columns of the DataFrame.
  • info(): Provides a concise summary of the DataFrame.
  • describe(): Generates descriptive statistics that summarize a dataset's distribution's central tendency, dispersion, and shape.

Each of these functions offers a different perspective on the Titanic dataset:

1# Print the first five entries 2print(titanic_df.head()) 3 4# Print the last five entries 5print(titanic_df.tail()) 6 7# Print the shape of the DataFrame 8print(titanic_df.shape) 9# Output: (891, 15) 10 11# Print a concise summary of the DataFrame 12titanic_df.info() 13""" 14<class 'pandas.core.frame.DataFrame'> 15RangeIndex: 891 entries, 0 to 890 16Data columns (total 15 columns): 17 # Column Non-Null Count Dtype 18--- ------ -------------- ----- 19 0 survived 891 non-null int64 20 1 pclass 891 non-null int64 21 2 sex 891 non-null object 22 3 age 714 non-null float64 23 4 sibsp 891 non-null int64 24 5 parch 891 non-null int64 25 6 fare 891 non-null float64 26 7 embarked 889 non-null object 27 8 class 891 non-null category 28 9 who 891 non-null object 29 10 adult_male 891 non-null bool 30 11 deck 203 non-null category 31 12 embark_town 889 non-null object 32 13 alive 891 non-null object 33 14 alone 891 non-null bool 34dtypes: bool(2), category(2), float64(2), int64(4), object(5) 35memory usage: 80.7+ KB 36""" 37 38# Print the descriptive statistics of the DataFrame 39print(titanic_df.describe()) 40""" 41 survived pclass age sibsp parch fare 42count 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000 43mean 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208 44std 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429 45min 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000 4625% 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400 4750% 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200 4875% 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000 49max 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200 50"""

The output shows:

  • The head command outputs the first five rows similar to the abovementioned one.
  • The tail command outputs the last five rows of the dataframe.
  • The shape command returns (891, 15), indicating the dataframe has 891 rows and 15 columns.
  • The info command prints a concise summary, including the number of non-null entries for each column.
  • The describe command provides a statistics table for the dataframe's numerical columns.

You will notice from this description that the dataset contains some missing values in features like Age and Embarked, something we will learn to handle in later lessons.

Deeper Dive with DataFrame Functionality

The value_counts() function can also be quite helpful in understanding the distribution of categorical data. For example, if you want to count how many male and female passengers were on the Titanic, you could use this command:

1print(titanic_df['sex'].value_counts()) 2 3""" 4male 577 5female 314 6Name: sex, dtype: int64 7"""

The nunique() and unique() functions could also come in handy to identify unique entries within your dataset. The former gives the count of unique entries, and the latter gives the actual unique entries.

1# Print the count of unique entries in 'embarked' column 2print(titanic_df['embarked'].nunique()) # Output: 3 3 4# Print the unique entries in 'embarked' column 5print(titanic_df['embarked'].unique()) # Output: ['S' 'C' 'Q' nan]

These additional functions provide functionality to make your exploratory data analysis even more powerful!

Wrapping Up

Congratulations! You've now learned to explore and understand the Titanic dataset's basic features and characteristics using Python and Pandas. We dove into the dataset's content, comprehensively understanding the Titanic passengers and their tragic journey. Today's deep dive is invaluable in setting the foundation for more advanced data visualizations.

In this lesson, we learned how to:

  • Load a dataset using Seaborn.
  • Explore the dataset using the various built-in functions provided by Pandas.
Practice Ahead!

We encourage you to apply what you've learned in this beginner-friendly exploration. Take the time to explore the dataset further: check the missing values, investigate the descriptive statistics, and try using other functionalities of Pandas.

Good luck with your journey in data visualization! Happy sailing!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.