Hello, and welcome to today's lesson on n-grams! If you've ever wondered how language models or text classifiers can understand the context or sequence in text, it's usually courtesy of our today's hero — n-grams. In this lesson, we'll delve into the magic of n-grams and how essential they prove in processing textual data. Specifically, we'll learn how to create n-grams from text data using Python, covering unigrams and bigrams.
In Natural Language Processing, when we analyze text, it's often beneficial to consider not only individual words but sequences of words. This approach helps to grasp the context better. Here is where n-grams come in handy.
An n-gram is a contiguous sequence of n items from a given sample of text or speech. The 'n' stands for the number of words in the sequence. For instance, in "I love dogs," a 1-gram (or unigram) is just one word, like "love." A 2-gram (or bigram) would be a sequence of 2 words, like "I love" or "love dogs".
N-grams help preserve the sequential information or context in text data, contributing significantly to many language models or text classifiers.
Before we can create n-grams, we need clean, structured text data. The text needs to be cleaned and preprocessed into a desirable format, after which it can be used for feature extraction or modeling.
Here's an already familiar code where we apply cleaning on our text, removing stop words and stemming the remaining words. These steps include lower-casing words, removing punctuations, useless words (stopwords), and reducing all words to their base or stemmed form.
Python1# Function to clean text and perform stemming 2def clean_text(text): 3 text = text.lower() # Convert text to lower case 4 text = re.sub(r'\S*@\S*\s?', '', text) # Remove email addresses 5 text = re.sub(r'http\S+', '', text) # Remove URLs 6 text = re.sub(r'\W', ' ', text) # Remove punctuation and special characters 7 text = re.sub(r'\d', ' ', text) # Remove digits 8 text = re.sub(r'\s\s+', ' ', text) # Remove extra spaces 9 10 tokenized_text = word_tokenize(text) 11 filtered_text = [stemmer.stem(word) for word in tokenized_text if not word in stop_words] 12 13 return " ".join(filtered_text)
Python's sklearn
library provides an accessible way to generate n-grams. The CountVectorizer
class in the sklearn.feature_extraction.text
module can convert a given text into its matrix representation and allows us to specify the type of n-grams we want.
Let's set up our vectorizer as a preliminary step towards creating n-grams:
Python1from sklearn.feature_extraction.text import CountVectorizer 2vectorizer = CountVectorizer(ngram_range=(1, 2)) # Generate unigram and bigram
The ngram_range=(1, 2)
parameter instructs our vectorizer to generate n-grams where n ranges from 1 to 2. So, the CountVectorizer will generate both unigrams and bigrams. If we wanted unigrams, bigrams, and trigrams, we could use ngram_range=(1, 3)
.
Now that we've set up our n-gram generating machine let's use it on some real-world data.
Python1# Fetching 20 newsgroups dataset and restricting to first 100 records for performance 2newsgroups_data = fetch_20newsgroups(subset='all')['data'][:100] 3 4# Clean and preprocess the newsgroup data 5cleaned_data = [clean_text(data) for data in newsgroups_data]
Applying the vectorizer to our cleaned text data will create the n-grams:
Python1# Apply the CountVectorizer on the cleaned data to create n-grams 2X = vectorizer.fit_transform(cleaned_data) 3 4# Display the shape of X 5print("Shape of X with n-grams: ", X.shape) 6 7# Print the total number of features 8print("Total number of features: ", len(features)) 9 10# Print features from index 100 to 110 11print("Features from index 100 to 110: ", features[100:111])
The output of the above code will be:
Plain text1Shape of X with n-grams: (100, 16246) 2Total number of features: 16246 3Features from index 100 to 110: ['accid figur' 'accid worri' 'accomod' 'accomod like' 'accord' 4 'accord document' 'accord lynn' 'accord mujanov' 'accord previou' 5 'account' 'account curiou']
The shape of X
is (100, 16246)
, indicating we have a high-dimensional feature space. The first number, 100
, represents the number of documents or records in your dataset (here, it's 100 as we limited our fetching to the first 100 records of the dataset), whereas 16246
represents the unique n-grams or features created from all the 100 documents.
By printing features[100:111]
we get a glance into our features where each string represents an n-gram from our cleaned text data. The returned n-grams ['accid figur', 'accid worri', 'accomod', ...]
include both unigrams (single words like accomod
, account
) and bigrams (two-word phrases like accid figur
, accid worri
).
As you can see, generating n-grams adds a new level of complexity to our analysis, as we now have multiple types of features or tokens - unigrams and bigrams. You can experiment with the ngram_range
parameter in CountVectorizer
to include trigrams or higher-level n-grams, depending on your specific context and requirements. Remember, each choice will have implications for the complexity and interpretability of your models, and it's always a balance between the two.
Congratulations, you've finished today's lesson on n-grams! We've explored what n-grams are and their importance in text classification. We then moved on to preparing data for creating n-grams before we dived into generating them using Python's CountVectorizer
class in the sklearn
library.
Now, it's time to get hands-on. Try generating trigrams or 4-grams from the same cleaned newsgroups data and notice the differences. Practicing these skills will not only reinforce the concepts learned in this lesson but also enable you to understand when and how much context is needed for certain tasks.
As always, happy learning!