Demystifying Stop Words in Natural Language Processing

Lesson 4

Lesson Overview

Hello and welcome! In this lesson, we'll dive into a crucial step in text data preprocessing in Natural Language Processing (NLP) — Removing stop words using Python and NLTK.

Stop words usually refer to the most commonly used words in a language. However, despite their high frequency, these words carry little meaningful information and are often filtered out from the text data. By the end of this lesson, you'll understand what stop words are, why they're essential in NLP, and how to remove these stop words from your text data.

Understanding Stop Words in NLP

The big question is — What exactly are stop words? You can look at stop words as background noise in your text data. They're words like is, the, in, and and — words that don't carry a lot of meaning on their own.

So, why do we want to remove these stop words? Machine learning models look for signals to make decisions. If we leave in the common words that are often in every document, we're not giving our model a lot of useful information. Hence, earlier on in our pre-processing pipeline, we would probably remove these words to let our model focus on the words that may indicate something more exceptional.

Identification and Removal of Stop Words

To remove these stop words, we first have to identify them. To do this efficiently, we can leverage a resource from the NLTK library — the NLTK's built-in English stop words list. Before we can use this list, however, it's important to ensure that the stopwords package is downloaded. This is achieved using the command nltk.download('stopwords'). Once downloaded, we can access the list of commonly agreed-upon stop words in the English language by calling nltk.corpus.stopwords.words('english'). This function returns a comprehensive list that we can use to filter out stop words from our text data.

Practical Example: Code Explanation

To better illustrate this, let's walk through the code block below:

Python
1import pandas as pd
2from datasets import load_dataset
3import nltk
4nltk.download('punkt')
5nltk.download('stopwords')
6from nltk.tokenize import word_tokenize
7from nltk.corpus import stopwords
8
9sms_spam = load_dataset('codesignal/sms-spam-collection')
10
11df = pd.DataFrame(sms_spam['train'])
12
13df['tokens'] = df['message'].apply(lambda x: word_tokenize(x))
14
15stop_words = stopwords.words('english')
16
17print("Some stop words:", stop_words[:3]) 
18
19df['filtered_tokens'] = df['tokens'].apply(lambda x: [word for word in x if word not in stop_words])
20
21print(df['filtered_tokens'].head())

In this block of code, we start by importing the necessary libraries and downloading the necessary NLTK packages. Next, we convert our sms_spam dataset into a pandas DataFrame, which makes it easier to handle.

We then proceed to tokenize the message column of our DataFrame and store our tokens in a new column — tokens. Tokenizing involves breaking down our text data into individual components. At this point, we don't worry about any stop words — our primary focus is on breaking down the sentences into words.

We then define our stop words using NLTK's built-in English stop words list and set this list to a variable — stop_words. Following that we print 3 stop words as examples.

Now that we have both our tokens and our stop words, we can proceed to remove any stop words from our tokens. We use a lambda function to compare each word in our tokens to our list of stop words. If the word is a stop word, we filter it out.

We apply this process to our DataFrame, and our final output is a DataFrame with another new column — the filtered_tokens column. This column contains our tokenized messages, sans the stop words.

The output will be:

Plain text
1Some stop words: ['i', 'me', 'my']
2
30    [Go, jurong, point, ,, crazy, .., Available, b...
41             [Ok, lar, ..., Joking, wif, u, oni, ...]
52    [Free, entry, 2, wkly, comp, win, FA, Cup, fin...
63    [U, dun, say, early, hor, ..., U, c, already, ...
74    [Nah, I, n't, think, goes, usf, ,, lives, arou...
8Name: filtered_tokens, dtype: object

This output shows the first few entries of the filtered_tokens column from our DataFrame, demonstrating the result of removing stop words from the tokenized messages. Each entry corresponds to tokenized, filtered text from our initial dataset, showcasing how common stop words are excluded, leaving more meaningful words.

Lesson Summary and Practice

Phew! We made it. You should now understand what stop words are, why they're important in Natural Language Processing, and why we remove them. You can now remove these stop words from your text data using Python and the NLTK library, which is a pretty neat skill to have.

Remember to keep practicing. Challenge yourself with different text data and try to remove the stop words. There's no better way to learn than through constant hands-on experience. So, go ahead and start analyzing some data!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.