Welcome to the first step in our course. This lesson demonstrates how to load and explore the Airline dataset using Python, showcasing its basic structure and notable features.
Understanding the dataset you're working with is the first key step in any data science project. Exploring the dataset helps detect trends, outliers, incorrect data, and much more. As a data scientist, it is essential to understand which questions your data can answer and which it cannot. Let's dive in and explore!
Our dataset, called the "Flights" dataset, belongs to the Seaborn
library. This dataset provides a monthly tally of airline passengers from 1949 to 1960.
The Flights dataset comprises three distinct columns:
year
: Represents the year in which the count of passengers was taken.month
: Points towards the month in which the passenger count was gathered.passengers
: Indicates the number of passengers that traveled in that month of a particular year.
Let's load the dataset in Python. You can easily load this dataset, along with other inbuilt Seaborn
datasets, using the load_dataset()
method as follows:
Python1import seaborn as sns 2 3# Load the Flights dataset 4flights_df = sns.load_dataset('flights') 5 6# Display the first five records 7print(flights_df.head()) 8""" 9 year month passengers 100 1949 Jan 112 111 1949 Feb 118 122 1949 Mar 132 133 1949 Apr 129 144 1949 May 121 15""" 16 17# Display the first 10 records 18print(flights_df.head(10)) 19 20# Display the last five records 21print(flights_df.tail())
Running the above script will load the "Flights" dataset into a pandas DataFrame and display the first five records, the first ten, and the last 5 records, respectively. As you will see from the output, the dataset contains rows representing individual months over several years, with columns specifying the year, month, and number of passengers.
Now, let's delve a little deeper into the structure of our data. Our DataFrame flights_df
has a specific shape, i.e., it contains a certain number of rows and columns. You can retrieve this shape using the shape
attribute. This attribute returns a tuple representing the dimensionality of the DataFrame. It is used to get the current shape of DataFrame, i.e., (number of rows and columns).
Additionally, you can use the info()
method to get a quick description of the data, including the total number of non-null entries and the column data types.
Python1# Get the dimensions of the dataset 2print('Shape of the dataset:', flights_df.shape) 3# Output: Shape of the dataset: (144, 3) 4 5# Get more information about the dataset 6flights_df.info() 7""" 8<class 'pandas.core.frame.DataFrame'> 9RangeIndex: 144 entries, 0 to 143 10Data columns (total 3 columns): 11 # Column Non-Null Count Dtype 12--- ------ -------------- ----- 13 0 year 144 non-null int64 14 1 month 144 non-null category 15 2 passengers 144 non-null int64 16dtypes: category(1), int64(2) 17memory usage: 2.9 KB 18"""
This will print out the number of entries, columns, column names, their data types, and the count of non-null entries per column, telling us whether our data has any missing entries. In this case, our dataset is complete and contains no missing values.
We always want more! It is time we dig a little deeper into the dataset. A quick way to get a summary of the numerical fields in your dataset is to use the describe()
command. This command provides a statistical summary for numerical columns.
Python1# Explore basic statistics of the dataset 2print(flights_df.describe()) 3""" 4 year passengers 5count 144.000000 144.000000 6mean 1954.500000 280.298611 7std 3.464102 119.966317 8min 1949.000000 104.000000 925% 1951.750000 180.000000 1050% 1954.500000 265.500000 1175% 1957.250000 360.500000 12max 1960.000000 622.000000 13"""
This command will generate a precise summary of the respective statistics of the DataFrame. You will see from the output that the years range from 1949 to 1960, and the median number of passengers, denoted by the 50% quantile, is around 265.5 - quite insightful already, isn't it?
Congratulations on completing your first exploration of the Flights dataset! You now have a better understanding of the structure of your data, its overall shape, and important statistical insights. You've successfully loaded the Airline dataset and done an initial exploration.
Throughout this lesson, we have covered:
- Loading the Airline dataset using the
load_dataset()
function inSeaborn
. - Getting dataset shape and summary with the
describe()
andinfo()
attributes. - Applying basic descriptive statistics to understand your data better, using the
describe()
function.
By doing this, we're laying a foundation for the subsequent steps: cleaning and manipulating this data, then visualizing and modeling it. The initial exploration of the data makes us better prepared for what lies ahead: visualizing and uncovering trends in air travel!
Are you ready to delve deeper? In the following practice session, you will have a chance to practice your skills and explore the dataset further. Use the knowledge you gained in this lesson to uncover more insights and expand your understanding of the dataset. Let's get to it!