Welcome, data enthusiasts! In this lesson, we will continue our journey into the world of Natural Language Processing (NLP), with an introduction to deep learning for text classification. To harness the power of deep learning, it's important to start with proper data preparation. That's why we will focus today on text preprocessing, shifting from Scikit-learn
, which we used previously in this course, to the powerful TensorFlow
library.
The goal of this lesson is to leverage TensorFlow
for textual data preparation and understand how it differs from the methods we used earlier. We will implement tokenization, convert tokens into sequences, learn how to pad these sequences to a consistent length, and transform categorical labels into integer labels to input into our deep learning model. Let's dive in!
TensorFlow
is an open-source library developed by Google, encompassing a comprehensive ecosystem of tools, libraries, and resources that facilitate machine learning and deep learning tasks, including NLP. As with any machine learning task, preprocessing of your data is a key step in NLP as well.
A significant difference between text preprocessing with TensorFlow
and using libraries like Scikit-learn, lies in the approach to tokenization and sequence generation. TensorFlow
incorporates a highly efficient tokenization process, handling both tokenization and sequence generation within the same library. Let's understand how this process works.
Tokenization is a foundational step in NLP, where sentences or texts are segmented into individual words or tokens. This process facilitates the comprehension of the language structure and produces meaningful units of text that serve as input for numerous machine learning algorithms.
In TensorFlow
, we utilize the Tokenizer
class for tokenization. A unique feature of TensorFlow's tokenizer is its robust handling of 'out-of-vocabulary' (OOV) words, or words not present in the tokenizer's word index. By specifying the oov_token
parameter, we can assign a special token, <OOV>
, to represent these OOV words.
Let's look at a practical example of tokenization:
Python1from tensorflow.keras.preprocessing.text import Tokenizer 2 3sentence = "Love is a powerful entity." 4tokenizer = Tokenizer(num_words=100, oov_token="<OOV>") 5tokenizer.fit_on_texts([sentence]) 6word_index = tokenizer.word_index 7print(word_index)
Output:
Plain text1{'<OOV>': 1, 'love': 2, 'is': 3, 'a': 4, 'powerful': 5, 'entity': 6}
In this example, tokenizer.fit_on_texts([...])
examines the text it receives and constructs a vocabulary from the unique words found within. Specifically, for the sentence provided, it generates a word index, where each unique word is assigned a distinct integer value. Importantly, this vocabulary is built exclusively from the text data passed to fit_on_texts(...)
, ensuring that tokenization aligns precisely with the text's lexical composition. For instance, future texts processed by this tokenizer will be tokenized according to this vocabulary, with any unknown words being represented by the <OOV>
token.
Through this mechanism, TensorFlow's Tokenizer effectively prepares text data for subsequent machine learning tasks by mapping words to consistent integer values while gracefully handling words not encountered during the initial vocabulary construction.
After tokenization, the next step is to represent text as sequences of integers. Sequences are lists of integers where each integer corresponds to a token in the dictionary created during tokenization. This conversion process translates natural language text into structured data that can be input into a machine learning model.
Let's see how we can convert the sentences into sequences to demonstrate the handling of words that are and are not in the vocabulary.
Python1sentences = [sentence, "very powerful"] 2sequences = tokenizer.texts_to_sequences(sentences) 3print(sequences)
Output:
Plain text1[[2, 3, 4, 5, 6], [1, 5]]
Our original sentence "Love is a powerful entity." has been converted into a sequence [2, 3, 4, 5, 6]
, and each number directly corresponds to a word in our word index. Looking at the second sequence, [1, 5]
, it effectively demonstrates how the Tokenizer
handles words that are not part of the initial vocabulary (OOV words) using the specified oov_token="<OOV>"
.
In the sequence [1, 5]
for the input "very powerful", the word “very” is not found in the tokenizer's word index, thus it is labeled as token 1
, which we designated as the <OOV>
token. The word “powerful”, being recognized in the vocabulary, retains its assigned index 5
. This illustrates TensorFlow's capability to manage unknown words gracefully, using the OOV token to ensure continuous processing of text data even when faced with unfamiliar tokens.
Deep learning models require input data of a consistent shape. In the context of NLP, it means all text must be represented by the same number of tokens. Padding is a process to ensure this by adding zeros to shorter sequences to match the length of the longest sequence.
Here's how we pad sequences in TensorFlow
:
Python1from tensorflow.keras.preprocessing.sequence import pad_sequences 2 3padded_sequences = pad_sequences(sequences, padding='post') 4print(padded_sequences)
The output will be:
Plain text1[[2 3 4 5 6] 2 [1 5 0 0 0]]
In this updated example, after adding the "very powerful" sentence to our sequences
, we apply padding. The result shows our original sentence "Love is a powerful entity." as [2 3 4 5 6]
, where each number directly corresponds to a word in our word index. For the second sequence, [1 5 0 0 0]
, it illustrates the addition of 0
s at the end to ensure both sequences have the same length. The word “very” in the newer sequence is not found in the tokenizer's word index, thus labeled with the <OOV>
token 1
, and “powerful” retains its assigned index 5
. The padding ensures all sequences are unified in length, catering to the requirements of deep learning models for consistent input shape. These integers represent the word indexes for each token, and sequences are made uniform through padding at the end, demonstrating how TensorFlow's padding function can accommodate data with variable sequence lengths to maintain a consistent input shape across all data inputs.
Finally, let's implement the entire preprocessing workflow with a limited set of data from the Reuters-21578 text categorization dataset.
Python1# Import necessary libraries 2import nltk 3from tensorflow.keras.preprocessing.text import Tokenizer 4from tensorflow.keras.preprocessing.sequence import pad_sequences 5from sklearn.preprocessing import LabelEncoder 6from nltk.corpus import reuters 7 8# Download the reuters dataset from nltk 9nltk.download('reuters', quiet=True) 10 11# Limiting the data for quick execution 12categories = reuters.categories()[:3] 13documents = reuters.fileids(categories) 14 15# Preparing the dataset 16text_data = [" ".join([word for word in reuters.words(fileid)]) for fileid in documents] 17categories_data = [reuters.categories(fileid)[0] for fileid in documents] 18 19# Tokenize the text data, using TensorFlow's Tokenizer class 20tokenizer = Tokenizer(num_words=500, oov_token="<OOV>") 21tokenizer.fit_on_texts(text_data) 22sequences = tokenizer.texts_to_sequences(text_data) 23 24# Padding sequences for uniform input shape 25X = pad_sequences(sequences, padding='post') 26 27# Translating categories into numerical labels 28label_encoder = LabelEncoder() 29y = label_encoder.fit_transform(categories_data) 30 31print("Shape of X: ", X.shape) 32print("Shape of Y: ", y.shape)
The output will be:
Plain text1Shape of X: (2477, 2380) 2Shape of Y: (2477,)
Great work! You've successfully ventured into TensorFlow
for text preprocessing, an essential step in leveraging the true potential of deep learning for text classification. You've seen how tokenization, sequence creation and padding can be swiftly handled in TensorFlow
, a key difference from methods we used in Scikit-learn
. These foundations will serve you well as we move forward in our NLP journey. Up next, we're diving deeper into building Neural Network Models for Text Classification!