Removing Stop Words and Stemming in Text Preprocessing

Lesson 3

Introduction

Hello and welcome to this lesson on Removing Stop Words and Stemming! In this lesson, we will dive deep into two essential steps to prepare text data for machine learning models: removing stop words and stemming. These techniques will help us improve the efficiency and accuracy of our models. Let's get started!

Understanding Stop Words

Stop words in Natural Language Processing (NLP) refer to the most common words in a language. Examples include "and", "the", "is", and others that do not provide significant meaning and are often removed to speed up processing without losing crucial information. For this purpose, Python's Natural Language Tool Kit (NLTK) provides a pre-defined list of stop words. Let's have a look:

Python
1from nltk.corpus import stopwords
2
3# Defining the stop words
4stop_words = set(stopwords.words('english'))
5
6# Print 5 stop words
7examples_of_stopwords = list(stop_words)[:5]
8print(f"Examples of stop words: {examples_of_stopwords}")

The output of the above code will be:

Plain text
1Examples of stop words: ['or', 'some', 'couldn', 'hasn', 'after']

Here, the stopwords.words('english') function returns a list of English stop words. You might sometimes need to add domain-specific stop words to this list based on the nature of your text data.

Introduction to Stemming

Stemming is a technique that reduces a word to its root form. Although the stemmed word may not always be a real or grammatically correct word in English, it does help to consolidate different forms of the same word to a common base form, reducing the complexity of text data. This simplification leads to quicker computation and potentially better performance when implementing Natural Language Processing (NLP) algorithms, as there are fewer unique words to consider.

For example, the words "run", "runs", "running" might all be stemmed to the common root "run". This helps our algorithm understand that these words are related and they carry a similar semantic meaning.

Let's illustrate this with Porter Stemmer, a well-known stemming algorithm from the NLTK library:

Python
1from nltk.stem import PorterStemmer
2
3# Stemming with NLTK Porter Stemmer
4stemmer = PorterStemmer()
5
6stemmed_word = stemmer.stem('running')
7print(f"Stemmed word: {stemmed_word}")

The output of the above code will be:

Plain text
1Stemmed word: run

The PorterStemmer class comes with the stem method that takes in a word and returns its root form. In this case, "running" is correctly stemmed to its root word "run". This form of preprocessing, although it may lead to words that are not recognizable, is a standard practice in text preprocessing for NLP tasks.

Stop Words Removal and Stemming in Action

Having understood stop words and stemming, let's develop a function that removes stop words and applies stemming to a given text. We will tokenize the text (split it into individual words) and apply these transformations word by word.

Python
1from nltk.tokenize import word_tokenize
2
3def remove_stopwords_and_stem(text):
4    tokenized_text = word_tokenize(text)
5    filtered_text = [stemmer.stem(word) for word in tokenized_text if not word in stop_words]
6    return " ".join(filtered_text)
7
8example_text = "This is a example text to demonstrate the removal of stop words and stemming."
9
10print(f"Original Text: {example_text}")
11print(f"Processed Text: {remove_stopwords_and_stem(example_text)}")

The output of the above code will be:

Plain text
1Original Text: This is a example text to demonstrate the removal of stop words and stemming.
2Processed Text: thi exampl text demonstr remov stop word stem .

The remove_stopwords_and_stem function does the required processing and provides the cleaned-up text.

Stop Words Removal and Stemming on a Dataset

Let's implement the above concepts on a real-world text dataset – the 20 Newsgroups Dataset.

Python
1from sklearn.datasets import fetch_20newsgroups
2
3# Fetching 20 newsgroups dataset
4newsgroups_data = fetch_20newsgroups(subset='all')
5
6# Limit to first 100 data points for efficient code execution
7newsgroups_data = newsgroups_data['data'][:100]
8
9processed_newsgroups_data = [remove_stopwords_and_stem(text) for text in newsgroups_data[:100]]
10
11# Print first 100 characters of first document
12print("First 100 characters of first processed document:")
13print(processed_newsgroups_data[0][:100])

The output of the above code will be:


1First 100 characters of first processed document:
2from : mamatha devineni ratnam < mr47+ @ andrew.cmu.edu > subject : pen fan reaction organ : post of

This process can take a while for large datasets, but the output will be much cleaner and easier for a machine learning model to work with.

Summary and Conclusion

And that's a wrap! In today's lesson, we've learned about stop words and stemming as crucial steps in text preprocessing for machine learning models. We've used Python's NLTK library to work with stop words and perform stemming. We have processed some example sentences and a real-world dataset to practice these concepts.

As we proceed to more advanced NLP tasks, pre-processing techniques like removing stop words and stemming would serve as a solid foundation. In the upcoming lessons, we will delve deeper into handling missing text data and learn about reshaping textual data for analysis. Let's keep going!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.