Mastering Advanced Tokenization Techniques in NLP with spaCy

Lesson 3

Lesson Overview

In this lesson, we'll sharpen our understanding of tokenization, exploring more advanced aspects such as handling special token types and working with different cases. Using the Reuters dataset and spaCy, our versatile NLP library, we'll go beyond basic tokenization, implementing strategies to handle punctuation, numbers, non-alphabetical characters and stopwords. This lesson aims to deepen our NLP expertise and make text preprocessing even more effective.

Tokenization and Its Role in NLP

Firstly, let's revisit tokenization. In our previous lesson, we introduced tokenization as the process of splitting up text into smaller pieces, called tokens. These tokens work as the basic building blocks in NLP, enabling us to process and analyze text more efficiently. It's like slicing a cake into pieces to serve, where each slice or token represents a piece of the overall content (the cake).

Different types of tokens exist, each serving a unique purpose in NLP. Today, we will explore four types: punctuation tokens, numerical tokens, non-alphabetic tokens, and stopword tokens.

Exploring Different Types of Tokens

Knowing the types of tokens we are working with is fundamental for successful NLP tasks. Let's take a closer look at each one:

Punctuation Tokens: These are tokens composed of punctuation marks such as full stops, commas, exclamation marks, etc. Although often disregarded, punctuation can sometimes hold significant meaning, affecting the interpretation of the text.
Numerical Tokens: These represent numbers found in the text. Depending on the context, numerical tokens can provide valuable information or, alternatively, can serve as noise that you might want to filter out.
Non-Alphabetic Tokens: Such tokens consist of characters that are neither letters nor numbers. They include spaces, punctuation, symbols, etc.
Stopword Tokens: Generally, these are common words like 'is', 'at', 'which', 'on'. In many NLP tasks, stopwords are filtered out as they often provide little to no meaningful information.

Dealing with Special Cases in Tokenization

Special tokens like those mentioned often need to be treated differently depending on the task at hand. For instance, while punctuation might be critical for sentiment analysis (imagine an exclamation mark to express excitement), you may wish to ignore it while performing tasks like topic identification.

spaCy provides us with simple and efficient methods to handle these token types. For instance, with token.is_punct, we can filter out all punctuation tokens from our token list. Similarly, we can use token.like_num to filter numerical tokens, token.is_alpha to filter out non-alphabetic tokens, and token.is_stop to identify stopword tokens.

Interactive Code Run: Classifying Tokens with spaCy

Let's now run our example code and see these methods in action.

Python
1# Import necessary modules
2import spacy
3from nltk.corpus import reuters
4
5# Load English tokenizer
6nlp = spacy.load("en_core_web_sm")
7
8# Get the raw document from reuters dataset
9doc_id = reuters.fileids()[0]
10doc_text = reuters.raw(doc_id)
11
12# Pass the raw document to the nlp object
13doc = nlp(doc_text)
14
15# Get all punctuation tokens in the document
16punctuation_tokens = [token.text for token in doc if token.is_punct]
17print('Punctuation Tokens: ', punctuation_tokens)
18
19# Get all numerical tokens in the document
20numerical_tokens = [token.text for token in doc if token.like_num]
21print('Numerical Tokens: ', numerical_tokens)
22
23# Get all non-alphabetical tokens in the document
24non_alpha_tokens = [token.text for token in doc if not token.is_alpha]
25print('Non-Alphabetic Tokens: ', non_alpha_tokens)
26
27# Get all stopword tokens in the document
28stopword_tokens = [token.text for token in doc if token.is_stop]
29print('Stopword Tokens: ', stopword_tokens)
30
31# Let's extract the unique non-stopword, non-punctuation alpha tokens from the document
32non_stop_alpha_tokens = list(set([token.text for token in doc if not token.is_stop and 
33                                  token.is_alpha and not token.is_punct]))
34
35print('\nUnique non-stopword, non-punctuation alphanumeric tokens: ', non_stop_alpha_tokens)

The output of the above code will be:


1Punctuation Tokens:  ['.', '.', '.', ',', '.', ',', ',', ',', ',', '.', ...] (truncated for brevity)
2Numerical Tokens:  ['1986', '15', 'April', '5', '5', '15']
3Non-Alphabetic Tokens:  ['1986', '15', '.', '.', 'April', '5', ',', '5', '15', ',', ...] (truncated for brevity)
4Stopword Tokens:  ['for', 'the', 'to', 'of', 'and', ...] (truncated for brevity)
5
6Unique non-stopword, non-punctuation alphanumeric tokens:  ['Lorem', 'Ipsum', 'Dolor', 'Sit', 'Amet', ...] (truncated for brevity)

This output demonstrates the classification of tokens into different categories using spaCy: punctuation, numerical, non-alphabetical, and stopword tokens. It also showcases the extraction of unique non-stopword, non-punctuation alphanumeric tokens, illustrating a common preprocessing step in NLP.

Summary and Upcoming Practice Tasks

Great work getting through this lesson! Today, we boosted our understanding of tokenization in NLP, exploring different types of tokens and strategies to handle them. We also dove deep into special token types such as punctuation, numerical tokens, non-alphabetic tokens, and stopwords, understanding why and when they matter in NLP applications.

Now, it's time to cement this knowledge with some hands-on practice. Up next are exercises that will require you to implement the techniques we covered today. Don't worry - solving these tasks will be crucial for mastering token classification and build a strong foundation for more complex NLP tasks. Let's dive in!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.