In the world of text analysis, transforming raw data into a format that is both computer-friendly and preserves the essential information for further processing is crucial. One of the simplest yet versatile methods to do this is the Bag-of-Words Representation, or BoW for short.
BoW is essentially a method to extract features from text. Imagine you have a big bag filled with words. These words can come from anywhere: a book, a website, or, in our case, movie reviews from the IMDB dataset. For each document or sentence, the BoW representation will contain the count of how many times each word appears. Most importantly, in this "bag," we don't care about the order of words, only their occurrence.
Consider this simple example with three sentences:
The cat sat on the mat.
The cat sat near the mat.
The cat played with a ball.
Using a BoW representation, our table would look like this:
the | cat | sat | on | mat | near | played | with | a | ball | |
---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
2 | 2 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 |
3 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 |
Each sentence (document) corresponds to a row, and each unique word is a column. The values in the cells represent the word count in the given sentence.
We can start practising the Bag-of-Words model by using Scikit-learn CountVectorizer
on the exact same three sentences:
Python1from sklearn.feature_extraction.text import CountVectorizer 2 3# Simple example sentences 4sentences = ['The cat sat on the mat.', 5 'The cat sat near the mat.', 6 'The cat played with a ball.'] 7 8vectorizer = CountVectorizer() 9X = vectorizer.fit_transform(sentences) 10 11print('Feature names:') 12print(vectorizer.get_feature_names_out()) 13print('Bag of Words Representation:') 14print(X.toarray())
The output of the above code will be:
Plain text1Feature names: 2['ball' 'cat' 'mat' 'near' 'on' 'played' 'sat' 'the' 'with'] 3Bag of Words Representation: 4[[0 1 1 0 1 0 1 2 0] 5 [0 1 1 1 0 0 1 2 0] 6 [1 1 0 0 0 1 0 1 1]]
From the output, you'll notice that Scikit-learn CountVectorizer
has done the exact thing as our previous manual process. It's created a Bag-of-Words representation for our sentences where each row corresponds to a sentence and each column to a unique word.
Now that we know what Bag-of-Words is and what it does, let's apply it to our dataset:
Python1import nltk 2from nltk.corpus import movie_reviews 3from sklearn.feature_extraction.text import CountVectorizer 4 5nltk.download('movie_reviews') 6reviews = [movie_reviews.raw(fileid) for fileid in movie_reviews.fileids()]
In the code snippet above, we utilize Python's NLTK
module to download and import the IMDB movie reviews dataset
.
Next, we'll again use Scikit-learn's CountVectorizer
to apply the BoW method to our reviews:
Python1vectorizer = CountVectorizer() 2bag_of_words = vectorizer.fit_transform(reviews) 3 4print(f"The shape of our Bag-of-Words is: {bag_of_words.shape}")
The output of the above code will be:
Plain text1The shape of our Bag-of-Words is: (2000, 39659)
The output indicates that the result is a matrix where each row corresponds to a movie review and each column to a unique word. The entries in this matrix are word counts.
Let's decode what's inside the bag_of_words
matrix:
Python1feature_names = vectorizer.get_feature_names_out() 2first_review_word_counts = bag_of_words[0].toarray()[0]
Here, we retrieve the feature names (which are unique words in the reviews) from our CountVectorizer
model. Then we get the word counts for a specific review - in our case, we chose the first one.
Subsequently, let's find out which word in the first review occurs the most:
Python1max_count_index = first_review_word_counts.argmax() 2most_used_word = feature_names[max_count_index] 3 4print(f"The most used word is '{most_used_word}' with a count of {first_review_word_counts[max_count_index]}")
Running the above code would output something like:
Plain text1The most used word is 'the' with a count of 38
The output gives away the most used word in the first review and its count. The script finds the index of the word with the highest count in the first review. Then, it uses this index to identify the corresponding word in the feature_names
. This demonstrates how we can identify the most used word in a specific review using the Bag-of-Words model.
Congratulations! You've successfully made it through this lesson. Today, you've learned how to implement a significant concept in the world of text classification, the Bag-of-Words
method. You've not only understood the theoretical aspect of it, but you also applied it on a real-world dataset using Python. You even used it to extract insights about word frequency, a crucial aspect of many text classification problems.
As we move forward in the upcoming lessons, we'll take what you've learned today, build on top of it, and continue our journey to understand and apply more advanced text classification techniques. Remember, practice makes perfect, so try to apply what you've learned today on different text data on your own. Happy coding, and see you in the next lesson!