In this lesson, we'll sharpen our understanding of tokenization, exploring more advanced aspects such as handling special token types and working with different cases. Using the Reuters dataset and spaCy, our versatile NLP library, we'll go beyond basic tokenization, implementing strategies to handle punctuation, numbers, non-alphabetical characters and stopwords. This lesson aims to deepen our NLP expertise and make text preprocessing even more effective.
Firstly, let's revisit tokenization. In our previous lesson, we introduced tokenization as the process of splitting up text into smaller pieces, called tokens. These tokens work as the basic building blocks in NLP, enabling us to process and analyze text more efficiently. It's like slicing a cake into pieces to serve, where each slice or token represents a piece of the overall content (the cake).
Different types of tokens exist, each serving a unique purpose in NLP. Today, we will explore four types: punctuation tokens, numerical tokens, non-alphabetic tokens, and stopword tokens.
Knowing the types of tokens we are working with is fundamental for successful NLP tasks. Let's take a closer look at each one:
-
Punctuation Tokens: These are tokens composed of punctuation marks such as full stops, commas, exclamation marks, etc. Although often disregarded, punctuation can sometimes hold significant meaning, affecting the interpretation of the text.
-
Numerical Tokens: These represent numbers found in the text. Depending on the context, numerical tokens can provide valuable information or, alternatively, can serve as noise that you might want to filter out.
-
Non-Alphabetic Tokens: Such tokens consist of characters that are neither letters nor numbers. They include spaces, punctuation, symbols, etc.
-
Stopword Tokens: Generally, these are common words like 'is', 'at', 'which', 'on'. In many NLP tasks, stopwords are filtered out as they often provide little to no meaningful information.
Special tokens like those mentioned often need to be treated differently depending on the task at hand. For instance, while punctuation might be critical for sentiment analysis (imagine an exclamation mark to express excitement), you may wish to ignore it while performing tasks like topic identification.
spaCy provides us with simple and efficient methods to handle these token types. For instance, with token.is_punct
, we can filter out all punctuation tokens from our token list. Similarly, we can use token.like_num
to filter numerical tokens, token.is_alpha
to filter out non-alphabetic tokens, and token.is_stop
to identify stopword tokens.
Let's now run our example code and see these methods in action.
Python1# Import necessary modules 2import spacy 3from nltk.corpus import reuters 4 5# Load English tokenizer 6nlp = spacy.load("en_core_web_sm") 7 8# Get the raw document from reuters dataset 9doc_id = reuters.fileids()[0] 10doc_text = reuters.raw(doc_id) 11 12# Pass the raw document to the nlp object 13doc = nlp(doc_text) 14 15# Get all punctuation tokens in the document 16punctuation_tokens = [token.text for token in doc if token.is_punct] 17print('Punctuation Tokens: ', punctuation_tokens) 18 19# Get all numerical tokens in the document 20numerical_tokens = [token.text for token in doc if token.like_num] 21print('Numerical Tokens: ', numerical_tokens) 22 23# Get all non-alphabetical tokens in the document 24non_alpha_tokens = [token.text for token in doc if not token.is_alpha] 25print('Non-Alphabetic Tokens: ', non_alpha_tokens) 26 27# Get all stopword tokens in the document 28stopword_tokens = [token.text for token in doc if token.is_stop] 29print('Stopword Tokens: ', stopword_tokens) 30 31# Let's extract the unique non-stopword, non-punctuation alpha tokens from the document 32non_stop_alpha_tokens = list(set([token.text for token in doc if not token.is_stop and 33 token.is_alpha and not token.is_punct])) 34 35print('\nUnique non-stopword, non-punctuation alphanumeric tokens: ', non_stop_alpha_tokens)
The output of the above code will be:
1Punctuation Tokens: ['.', '.', '.', ',', '.', ',', ',', ',', ',', '.', ...] (truncated for brevity) 2Numerical Tokens: ['1986', '15', 'April', '5', '5', '15'] 3Non-Alphabetic Tokens: ['1986', '15', '.', '.', 'April', '5', ',', '5', '15', ',', ...] (truncated for brevity) 4Stopword Tokens: ['for', 'the', 'to', 'of', 'and', ...] (truncated for brevity) 5 6Unique non-stopword, non-punctuation alphanumeric tokens: ['Lorem', 'Ipsum', 'Dolor', 'Sit', 'Amet', ...] (truncated for brevity)
This output demonstrates the classification of tokens into different categories using spaCy: punctuation, numerical, non-alphabetical, and stopword tokens. It also showcases the extraction of unique non-stopword, non-punctuation alphanumeric tokens, illustrating a common preprocessing step in NLP.
Great work getting through this lesson! Today, we boosted our understanding of tokenization in NLP, exploring different types of tokens and strategies to handle them. We also dove deep into special token types such as punctuation, numerical tokens, non-alphabetic tokens, and stopwords, understanding why and when they matter in NLP applications.
Now, it's time to cement this knowledge with some hands-on practice. Up next are exercises that will require you to implement the techniques we covered today. Don't worry - solving these tasks will be crucial for mastering token classification and build a strong foundation for more complex NLP tasks. Let's dive in!