Hello and welcome! In this lesson, we'll dive into a crucial step in text data preprocessing in Natural Language Processing (NLP) — Removing stop words using Python and NLTK.
Stop words usually refer to the most commonly used words in a language. However, despite their high frequency, these words carry little meaningful information and are often filtered out from the text data. By the end of this lesson, you'll understand what stop words are, why they're essential in NLP, and how to remove these stop words from your text data.
The big question is — What exactly are stop words? You can look at stop words as background noise in your text data. They're words like is
, the
, in
, and and
— words that don't carry a lot of meaning on their own.
So, why do we want to remove these stop words? Machine learning models look for signals to make decisions. If we leave in the common words that are often in every document, we're not giving our model a lot of useful information. Hence, earlier on in our pre-processing pipeline, we would probably remove these words to let our model focus on the words that may indicate something more exceptional.
To remove these stop words, we first have to identify them. To do this efficiently, we can leverage a resource from the NLTK library — the NLTK's built-in English stop words list. Before we can use this list, however, it's important to ensure that the stopwords
package is downloaded. This is achieved using the command nltk.download('stopwords')
. Once downloaded, we can access the list of commonly agreed-upon stop words in the English language by calling nltk.corpus.stopwords.words('english')
. This function returns a comprehensive list that we can use to filter out stop words from our text data.
To better illustrate this, let's walk through the code block below:
Python1import pandas as pd 2from datasets import load_dataset 3import nltk 4nltk.download('punkt') 5nltk.download('stopwords') 6from nltk.tokenize import word_tokenize 7from nltk.corpus import stopwords 8 9sms_spam = load_dataset('codesignal/sms-spam-collection') 10 11df = pd.DataFrame(sms_spam['train']) 12 13df['tokens'] = df['message'].apply(lambda x: word_tokenize(x)) 14 15stop_words = stopwords.words('english') 16 17print("Some stop words:", stop_words[:3]) 18 19df['filtered_tokens'] = df['tokens'].apply(lambda x: [word for word in x if word not in stop_words]) 20 21print(df['filtered_tokens'].head())
In this block of code, we start by importing the necessary libraries and downloading the necessary NLTK packages. Next, we convert our sms_spam
dataset into a pandas DataFrame, which makes it easier to handle.
We then proceed to tokenize the message
column of our DataFrame and store our tokens in a new column — tokens
. Tokenizing involves breaking down our text data into individual components. At this point, we don't worry about any stop words — our primary focus is on breaking down the sentences into words.
We then define our stop words using NLTK's built-in English stop words list and set this list to a variable — stop_words
. Following that we print 3 stop words as examples.
Now that we have both our tokens and our stop words, we can proceed to remove any stop words from our tokens. We use a lambda function to compare each word in our tokens to our list of stop words. If the word is a stop word, we filter it out.
We apply this process to our DataFrame, and our final output is a DataFrame with another new column — the filtered_tokens
column. This column contains our tokenized messages, sans the stop words.
The output will be:
Plain text1Some stop words: ['i', 'me', 'my'] 2 30 [Go, jurong, point, ,, crazy, .., Available, b... 41 [Ok, lar, ..., Joking, wif, u, oni, ...] 52 [Free, entry, 2, wkly, comp, win, FA, Cup, fin... 63 [U, dun, say, early, hor, ..., U, c, already, ... 74 [Nah, I, n't, think, goes, usf, ,, lives, arou... 8Name: filtered_tokens, dtype: object
This output shows the first few entries of the filtered_tokens
column from our DataFrame, demonstrating the result of removing stop words from the tokenized messages. Each entry corresponds to tokenized, filtered text from our initial dataset, showcasing how common stop words are excluded, leaving more meaningful words.
Phew! We made it. You should now understand what stop words are, why they're important in Natural Language Processing, and why we remove them. You can now remove these stop words from your text data using Python and the NLTK library, which is a pretty neat skill to have.
Remember to keep practicing. Challenge yourself with different text data and try to remove the stop words. There's no better way to learn than through constant hands-on experience. So, go ahead and start analyzing some data!