Hello and welcome! Today's lesson will introduce a crucial component of text feature engineering: tokenization. Used in text classification, tokenization is a pre-processing step that transforms raw text into units of meaning known as tokens. By breaking down text into these consumable pieces, we can provide feeding material for machine learning models to understand the text better. Our goal in this lesson is to apply tokenization on a raw text dataset (IMDB movie review dataset) and understand how it can be beneficial in the process of text classification.
Text tokenization is a type of pre-processing step where a text string is split up into individual units (tokens). In most cases, these tokens are words, digits, or punctuation marks. For instance, consider this text: "I love Python." After tokenization, this sentence is split into ['I', 'love', 'Python', '.']
, with each word and punctuation mark becoming a separate token.
Text tokenization plays a foundational role in text classification and many Natural Language Processing (NLP) tasks. Consider the fact that most machine learning algorithms prefer numerical input. But when dealing with text data, we can't feed raw text directly into these algorithms. This is where tokenization steps in. It breaks down the text into individual tokens, which can then be transformed into some numerical form (via techniques like Bag-of-Words, TF-IDF, etc.). This transformed form can then be processed by the machine learning algorithms.
Before we tackle our dataset, let's understand how tokenization works with a simple example. Python and the NLTK
(Natural Language Toolkit) library, a comprehensive library built specifically for NLP tasks, make tokenization simple and efficient. For our example, suppose we have a sentence: "The cat is on the mat." Let's tokenize it:
Python1from nltk import word_tokenize 2text = "The cat is on the mat." 3tokens = word_tokenize(text) 4print(tokens)
The output of the above code will be:
Plain text1['The', 'cat', 'is', 'on', 'the', 'mat', '.']
For the purpose of this lesson, we'll use the IMDB movie reviews dataset (provided in the NLTK corpus). This dataset contains movie reviews along with their associated binary sentiment polarity labels. The core dataset has 50,000 reviews split evenly into 25k for training and 25k for testing. Each set has 12.5k positive and 12.5k negative reviews. However, for the purpose of these lessons, we will focus on using the first 100 reviews.
It's important to note that the IMDB dataset provided in the NLTK corpus has been preprocessed. The text is already lowercased, and common punctuation is typically separated from the words. This pre-cleaning makes the dataset well-suited for the tokenization process we'll be exploring.
Let's get these reviews and print a few of them:
Python1import nltk 2from nltk.corpus import movie_reviews 3 4nltk.download('movie_reviews') 5 6movie_reviews_ids = movie_reviews.fileids()[:100] 7review_texts = [movie_reviews.raw(fileid) for fileid in movie_reviews_ids] 8print("First movie review:\n", review_texts[0][:260])
Note that we're only printing the first 260 characters of the first review to prevent lengthy output.
The output of the above code will be:
Plain text1First movie review: 2 plot : two teen couples go to a church party , drink and then drive . 3they get into an accident . 4one of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . 5what's the deal ? 6watch the movie and " sorta " find out . .
Now it's time to transform our data. For this, we will apply tokenization on all our 100 movie reviews.
Python1from nltk import word_tokenize 2tokenized_reviews = [word_tokenize(review) for review in review_texts]
So, what changes did tokenization bring to our data? Each review, which was initially a long string of text, is now a list of individual tokens (words, punctuation, etc), which collectively represent the review. In other words, our dataset evolved from being a list of strings to being a list of lists.
Python1for i, review in enumerate(tokenized_reviews[:3]): 2 print(f"\n Review {i+1} first 10 tokens:\n", review[:10])
The output of the above code will be:
Plain text1 2 Review 1 first 10 tokens: 3 ['plot', ':', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party'] 4 5 Review 2 first 10 tokens: 6 ['the', 'happy', 'bastard', "'s", 'quick', 'movie', 'review', 'damn', 'that', 'y2k'] 7 8 Review 3 first 10 tokens: 9 ['it', 'is', 'movies', 'like', 'these', 'that', 'make', 'a', 'jaded', 'movie'] 10
Well done! Today, you learned about the fundamental concept of text tokenization and its importance in text classification. You also applied tokenization to the IMDB movie reviews dataset using Python and NLTK. Your text data is now effectively transformed into a form that machine learning models can digest better.
As you advance in the course, you will refine this dataset further for your text classification objectives. We are laying the foundation one brick at a time, and tokenization was a sturdy one! Upcoming lessons will build upon this understanding. You'll harness this tokenized data to generate Bag-of-Words representations, implement TF-IDF representations, handle sparse features, and apply dimensionality reduction.
Remember, practice consolidates learning. Make sure to reinforce your knowledge by practicing the code samples and applying these concepts contextually. Don't forget to use your creativity to manipulate codes and see the outcomes. Happy learning!