Exploring Text Data: Loading and Examining the SMS Spam Collection Dataset

Lesson 1

Lesson Overview

Welcome to the first lesson: meeting our dataset! You will learn how to load a text dataset using the Python programming language, how to perform some initial explorations using the pandas library, and finally how to convert the loaded dataset into a pandas DataFrame.

The dataset we will work with in this lesson is the popular SMS Spam Collection dataset, which is widely used in text classification tasks in the field of Natural Language Processing (NLP).

Loading Dataset via Python Library datasets

To load our SMS Spam Collection dataset, we will use the load_dataset function from the datasets library to load our dataset hosted in the CodeSignal platform, as demonstrated in this code snippet:

Python
1from datasets import load_dataset
2
3# Load the SMS Spam Collection dataset
4sms_spam = load_dataset('codesignal/sms-spam-collection')

After loading the dataset, let's proceed to convert it to a pandas DataFrame for more convenient handling.

Converting the Loaded Dataset to pandas DataFrame

Pandas' DataFrame is a two-dimensional labeled data structure with columns of potentially different types. It is generally the most commonly used pandas object, perfect for data wrangling, manipulation and data analysis with integrated arithmetic operations and aggregations. We'll be converting our sms_spam data into a pandas DataFrame.

The code snippet to perform this conversion is as follows:

Python
1import pandas as pd
2
3# Convert to pandas DataFrame for convenient handling
4df = pd.DataFrame(sms_spam['train'])

The data stored under 'train' in the loaded dataset is converted into a pandas dataframe using the pd.DataFrame() function.

Exploring Initial Entries Using pandas head() Method

One of the first steps in working with any dataset is to know what the dataset contains. The easiest way to get a quick idea about the DataFrame is to use the head() method to show the first few rows.

The head() function is used to get the first n rows. The number of rows to select is passed as an argument. If no argument is passed, by default it returns the first 5 rows.

This is how you can use the head() method to preview the initial entries of the DataFrame:

Python
1# Preview the first entries of the DataFrame
2print(df.head())

The output of the above code will be:

Plain text
1  label                                            message
20   ham  Go until jurong point, crazy.. Available only ...
31   ham                      Ok lar... Joking wif u oni...
42  spam  Free entry in 2 a wkly comp to win FA Cup fina...
53   ham  U dun say so early hor... U c already then say...
64   ham  Nah I don't think he goes to usf, he lives aro...

This output demonstrates the structure of the DataFrame containing the SMS data. Each row represents a distinct message, with the 'label' column indicating whether the message is spam (spam) or not (ham) and the 'message' column containing the text of the message.

Lesson Summary and Practice

Congratulations on finishing the lesson! During this lesson, you loaded a dataset using the Python library datasets and explored it using pandas. The importance of data in Natural Language Processing and how to load it, was highlighted by you.

By now, you should feel more comfortable working with NLP datasets and pandas DataFrames. The next step for you is to practice what you've learnt. Take a different dataset and load it using the datasets library and inspect it using pandas.

Remember, the key to mastering these skills is by constant application and practice. Keep going, and you'll be amazed by how much you can accomplish!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.