Welcome to our new lesson on Tokenization. Tokenization is a form of textual data cleaning that is typically performed in Natural Language Processing (NLP). It transforms raw text into a more usable format by breaking it down into individual words or tokens. Our lesson uses Python, NLTK (the Natural Language Toolkit), and pandas library for data handling. We will apply tokenization on the SMS Spam Collection dataset that you are already familiar with. Let's get started!
Tokenization is the process of converting a sequence of text into separate pieces called tokens, usually words. When reading a text, our brain automatically identifies words without spaces, punctuation, or other separators, and understands the context. For computers, the process isn't that straightforward. They need to be taught to understand language structures, and that's when tokenization comes into play.
Tokenization plays a key role in various NLP tasks including text classification, language modeling, and sentiment analysis. For instance, if we train a machine learning model to classify spam messages, tokenization helps split a message into individual words. Each word becomes a feature for our model to learn from.
One of the challenges with tokenization can be the handling of contractions. For example, the word "don't
" may get tokenized into "don
", "'
" and "t
", with the traditional whitespace tokenizer, which is incorrect. To mitigate this, we might need additional steps to handle contractions appropriately.
NLTK, or Natural Language Toolkit, is a library in Python that provides tools for handling human language data. It supplies easy-to-use interfaces to over 50 corpora and lexical resources, such as the nltk.tokenize
package, which offers several tokenizer functions including word_tokenize
, sent_tokenize
, and more. For our focus on word_tokenize
, it's important to note why we often need to execute nltk.download('punkt')
before we start tokenizing.
Before using word_tokenize
for the first time, you might need to download the punkt
package using nltk.download('punkt')
. This package includes a pre-trained model that helps NLTK to split ordinary text into tokens effectively. It is especially tuned for splitting sentences into words, taking into account various language peculiarities and structures.
Python1import nltk 2from nltk.tokenize import word_tokenize 3# Ensure necessary packages are downloaded for tokenization 4nltk.download('punkt')
Downloading punkt
is necessary because word_tokenize
relies on this model to distinguish between the different parts of a sentence, such as words and punctuation, using an unsupervised machine learning algorithm. Without it, word_tokenize
can not work.
As you already know, the SMS Spam Collection dataset can be loaded directly into pandas DataFrame for convenient handling.
As a first step, let's convert all messages to lowercase to ensure uniformity, because NLP models treat "hello
" and "Hello
" differently.
Python1# Convert all messages to lowercase for uniformity 2df['processed_message'] = df['message'].apply(lambda x: x.lower())
Then we will implement tokenization using the function nltk.tokenize.word_tokenize()
.
Python1# Tokenize the messages into individual words 2df['tokens'] = df['processed_message'].apply(lambda x: word_tokenize(x)) 3print(df['tokens'].head())
The output of the above code will be:
Plain text10 [go, until, jurong, point, ,, crazy, .., avail... 21 [ok, lar, ..., joking, wif, u, oni, ...] 32 [free, entry, in, 2, a, wkly, comp, to, win, f... 43 [u, dun, say, so, early, hor, ..., u, c, alrea... 54 [nah, i, do, n't, think, he, goes, to, usf, ,,... 6Name: tokens, dtype: object
This output clearly demonstrates tokenization in action, where each message is split into a list of components or "tokens". This step is critical for preparing text data for further analysis in NLP tasks.
Today, we learned about the concept of tokenization and its importance in the context of Natural Language Processing. Utilizing the power of the nltk
library in Python, we explored how tokens, the individual pieces of text, can be extracted from raw text data for further processing. Now, it's your turn to practice and refine your tokenization skills with a series of exercises. Remember, the more you practice, the better you become at working with Natural Language Processing tasks! Happy learning!