Hello and welcome to this lesson on Removing Stop Words and Stemming! In this lesson, we will dive deep into two essential steps to prepare text data for machine learning models: removing stop words and stemming. These techniques will help us improve the efficiency and accuracy of our models. Let's get started!
Stop words in Natural Language Processing (NLP) refer to the most common words in a language. Examples include "and", "the", "is", and others that do not provide significant meaning and are often removed to speed up processing without losing crucial information. For this purpose, Python's Natural Language Tool Kit (NLTK) provides a pre-defined list of stop words. Let's have a look:
Python1from nltk.corpus import stopwords 2 3# Defining the stop words 4stop_words = set(stopwords.words('english')) 5 6# Print 5 stop words 7examples_of_stopwords = list(stop_words)[:5] 8print(f"Examples of stop words: {examples_of_stopwords}")
The output of the above code will be:
Plain text1Examples of stop words: ['or', 'some', 'couldn', 'hasn', 'after']
Here, the stopwords.words('english')
function returns a list of English stop words. You might sometimes need to add domain-specific stop words to this list based on the nature of your text data.
Stemming is a technique that reduces a word to its root form. Although the stemmed word may not always be a real or grammatically correct word in English, it does help to consolidate different forms of the same word to a common base form, reducing the complexity of text data. This simplification leads to quicker computation and potentially better performance when implementing Natural Language Processing (NLP) algorithms, as there are fewer unique words to consider.
For example, the words "run", "runs", "running" might all be stemmed to the common root "run". This helps our algorithm understand that these words are related and they carry a similar semantic meaning.
Let's illustrate this with Porter Stemmer, a well-known stemming algorithm from the NLTK library:
Python1from nltk.stem import PorterStemmer 2 3# Stemming with NLTK Porter Stemmer 4stemmer = PorterStemmer() 5 6stemmed_word = stemmer.stem('running') 7print(f"Stemmed word: {stemmed_word}")
The output of the above code will be:
Plain text1Stemmed word: run
The PorterStemmer
class comes with the stem
method that takes in a word and returns its root form. In this case, "running" is correctly stemmed to its root word "run". This form of preprocessing, although it may lead to words that are not recognizable, is a standard practice in text preprocessing for NLP tasks.
Having understood stop words and stemming, let's develop a function that removes stop words and applies stemming to a given text. We will tokenize the text (split it into individual words) and apply these transformations word by word.
Python1from nltk.tokenize import word_tokenize 2 3def remove_stopwords_and_stem(text): 4 tokenized_text = word_tokenize(text) 5 filtered_text = [stemmer.stem(word) for word in tokenized_text if not word in stop_words] 6 return " ".join(filtered_text) 7 8example_text = "This is a example text to demonstrate the removal of stop words and stemming." 9 10print(f"Original Text: {example_text}") 11print(f"Processed Text: {remove_stopwords_and_stem(example_text)}")
The output of the above code will be:
Plain text1Original Text: This is a example text to demonstrate the removal of stop words and stemming. 2Processed Text: thi exampl text demonstr remov stop word stem .
The remove_stopwords_and_stem
function does the required processing and provides the cleaned-up text.
Let's implement the above concepts on a real-world text dataset – the 20 Newsgroups Dataset.
Python1from sklearn.datasets import fetch_20newsgroups 2 3# Fetching 20 newsgroups dataset 4newsgroups_data = fetch_20newsgroups(subset='all') 5 6# Limit to first 100 data points for efficient code execution 7newsgroups_data = newsgroups_data['data'][:100] 8 9processed_newsgroups_data = [remove_stopwords_and_stem(text) for text in newsgroups_data[:100]] 10 11# Print first 100 characters of first document 12print("First 100 characters of first processed document:") 13print(processed_newsgroups_data[0][:100])
The output of the above code will be:
1First 100 characters of first processed document: 2from : mamatha devineni ratnam < mr47+ @ andrew.cmu.edu > subject : pen fan reaction organ : post of
This process can take a while for large datasets, but the output will be much cleaner and easier for a machine learning model to work with.
And that's a wrap! In today's lesson, we've learned about stop words and stemming as crucial steps in text preprocessing for machine learning models. We've used Python's NLTK library to work with stop words and perform stemming. We have processed some example sentences and a real-world dataset to practice these concepts.
As we proceed to more advanced NLP tasks, pre-processing techniques like removing stop words and stemming would serve as a solid foundation. In the upcoming lessons, we will delve deeper into handling missing text data and learn about reshaping textual data for analysis. Let's keep going!