Welcome to the first lesson: meeting our dataset! You will learn how to load a text dataset using the Python programming language, how to perform some initial explorations using the pandas
library, and finally how to convert the loaded dataset into a pandas
DataFrame.
The dataset we will work with in this lesson is the popular SMS Spam Collection dataset, which is widely used in text classification tasks in the field of Natural Language Processing (NLP).
To load our SMS Spam Collection dataset, we will use the load_dataset
function from the datasets
library to load our dataset hosted in the CodeSignal platform, as demonstrated in this code snippet:
Python1from datasets import load_dataset 2 3# Load the SMS Spam Collection dataset 4sms_spam = load_dataset('codesignal/sms-spam-collection')
After loading the dataset, let's proceed to convert it to a pandas
DataFrame for more convenient handling.
Pandas' DataFrame is a two-dimensional labeled data structure with columns of potentially different types. It is generally the most commonly used pandas object, perfect for data wrangling, manipulation and data analysis with integrated arithmetic operations and aggregations. We'll be converting our sms_spam
data into a pandas
DataFrame.
The code snippet to perform this conversion is as follows:
Python1import pandas as pd 2 3# Convert to pandas DataFrame for convenient handling 4df = pd.DataFrame(sms_spam['train'])
The data stored under 'train' in the loaded dataset is converted into a pandas
dataframe using the pd.DataFrame()
function.
One of the first steps in working with any dataset is to know what the dataset contains. The easiest way to get a quick idea about the DataFrame is to use the head()
method to show the first few rows.
The head()
function is used to get the first n rows. The number of rows to select is passed as an argument. If no argument is passed, by default it returns the first 5 rows.
This is how you can use the head()
method to preview the initial entries of the DataFrame:
Python1# Preview the first entries of the DataFrame 2print(df.head())
The output of the above code will be:
Plain text1 label message 20 ham Go until jurong point, crazy.. Available only ... 31 ham Ok lar... Joking wif u oni... 42 spam Free entry in 2 a wkly comp to win FA Cup fina... 53 ham U dun say so early hor... U c already then say... 64 ham Nah I don't think he goes to usf, he lives aro...
This output demonstrates the structure of the DataFrame containing the SMS data. Each row represents a distinct message, with the 'label' column indicating whether the message is spam (spam
) or not (ham
) and the 'message' column containing the text of the message.
Congratulations on finishing the lesson! During this lesson, you loaded a dataset using the Python library datasets
and explored it using pandas
. The importance of data in Natural Language Processing and how to load it, was highlighted by you.
By now, you should feel more comfortable working with NLP datasets and pandas
DataFrames. The next step for you is to practice what you've learnt. Take a different dataset and load it using the datasets library and inspect it using pandas.
Remember, the key to mastering these skills is by constant application and practice. Keep going, and you'll be amazed by how much you can accomplish!