Welcome back! In this lesson, we're going to dig deeper and explore the SMS Spam Collection dataset. We'll learn how to find out more information about the data set like the unique counts, and some basic statistics. Understanding these details about the dataset is hugely important while working on Natural Language Processing (NLP) tasks, as they can drive the preprocessing and modeling steps.
To get more details about the DataFrame
, such as the datatypes of the columns and non-null counts, you can use the info()
function. This method prints information about a DataFrame
including the index dtype and columns, non-null values, and memory usage.
Python1# Show detailed information about the dataset 2print(df.info())
The output of the above code will be:
Plain text1<class 'pandas.core.frame.DataFrame'> 2RangeIndex: 5572 entries, 0 to 5571 3Data columns (total 2 columns): 4 # Column Non-Null Count Dtype 5--- ------ -------------- ----- 6 0 label 5572 non-null object 7 1 message 5572 non-null object 8dtypes: object(2) 9memory usage: 87.2+ KB
This output shows the DataFrame
structure detailing that it has two columns (label
and message
) with 5572 entries each. Both columns consist of objects (dtype: object
), which means they are stored as strings in pandas, and there are no null values in either column, since the "Non-Null Count" is 5572 for both the label
and message
columns.
An essential preliminary step in data exploration is identifying the names of the columns in the DataFrame. Knowing the column names aids in efficiently accessing and manipulating data. Use the columns
attribute to list all column names in the DataFrame:
Python1# List all column names 2print(df.columns)
This simple line of code will output the names of the columns in your dataset, making it easier for you to reference specific data points as you continue your analysis:
Plain text1Index(['label', 'message'], dtype='object')
Understanding the column names in your dataset is crucial for applying specific data manipulation and analysis techniques effectively.
Now that we have a basic understanding of the structure of the data, let's learn more about the content of the data. We can use the nunique()
function to count the number of unique messages in the 'sms' column and the unique()
function to find unique labels in the 'label' column.
Python1# Count the number of unique messages and labels 2print("Unique messages:", df['message'].nunique()) 3# Returns the unique labels 4print("Labels:", df['label'].unique())
The output of the above code will be:
Plain text1Unique messages: 5169 2Labels: ['ham' 'spam']
This output indicates that there are 5169 unique messages in the dataset, and the 'label' column contains two unique values, 'ham' and 'spam', which represent non-spam and spam messages, respectively. This information is critical in understanding the diversity and distribution of the dataset.
Finally, let's get some basic statistics about the data. Pandas provides the describe()
function which, by default, provides a statistical summary of all numerical columns.
Python1# Display basic statistics 2print(df.describe())
The output of the above code will be:
Plain text1 label message 2count 5572 5572 3unique 2 5169 4top ham Sorry, I'll call later 5freq 4825 30
This output details the basic statistics for the 'message' column, signifying there are 5572 counts, with 5169 unique messages. The most common message is "Sorry, I'll call later" which appears 30 times. This summary gives insight into the repetitive nature of some messages in the dataset.
In this lesson, we've taken a deeper look at our dataset using Python and pandas. We've learned how to use the pandas info()
, columns
, nunique()
, unique()
, and describe()
functions to get more information on our dataset. Understanding the composition of the dataset is a very important step while working on NLP tasks. In our next exercises, we'll practice implementing these methods in order to reinforce what we've learned. Keep up the great work!