Text Preprocessing for Deep Learning with TensorFlow

Lesson 4

Introduction to Deep Learning for Text Classification

Welcome, data enthusiasts! In this lesson, we will continue our journey into the world of Natural Language Processing (NLP), with an introduction to deep learning for text classification. To harness the power of deep learning, it's important to start with proper data preparation. That's why we will focus today on text preprocessing, shifting from Scikit-learn, which we used previously in this course, to the powerful TensorFlow library.

The goal of this lesson is to leverage TensorFlow for textual data preparation and understand how it differs from the methods we used earlier. We will implement tokenization, convert tokens into sequences, learn how to pad these sequences to a consistent length, and transform categorical labels into integer labels to input into our deep learning model. Let's dive in!

Understanding TensorFlow and its Role in Text Preprocessing

TensorFlow is an open-source library developed by Google, encompassing a comprehensive ecosystem of tools, libraries, and resources that facilitate machine learning and deep learning tasks, including NLP. As with any machine learning task, preprocessing of your data is a key step in NLP as well.

A significant difference between text preprocessing with TensorFlow and using libraries like Scikit-learn, lies in the approach to tokenization and sequence generation. TensorFlow incorporates a highly efficient tokenization process, handling both tokenization and sequence generation within the same library. Let's understand how this process works.

Tokenizing Text Data

Tokenization is a foundational step in NLP, where sentences or texts are segmented into individual words or tokens. This process facilitates the comprehension of the language structure and produces meaningful units of text that serve as input for numerous machine learning algorithms.

In TensorFlow, we utilize the Tokenizer class for tokenization. A unique feature of TensorFlow's tokenizer is its robust handling of 'out-of-vocabulary' (OOV) words, or words not present in the tokenizer's word index. By specifying the oov_token parameter, we can assign a special token, <OOV>, to represent these OOV words.

Let's look at a practical example of tokenization:

Python
1from tensorflow.keras.preprocessing.text import Tokenizer
2
3sentence = "Love is a powerful entity."
4tokenizer = Tokenizer(num_words=100, oov_token="<OOV>")
5tokenizer.fit_on_texts([sentence])
6word_index = tokenizer.word_index
7print(word_index)

Output:

Plain text
1{'<OOV>': 1, 'love': 2, 'is': 3, 'a': 4, 'powerful': 5, 'entity': 6}

In this example, tokenizer.fit_on_texts([...]) examines the text it receives and constructs a vocabulary from the unique words found within. Specifically, for the sentence provided, it generates a word index, where each unique word is assigned a distinct integer value. Importantly, this vocabulary is built exclusively from the text data passed to fit_on_texts(...), ensuring that tokenization aligns precisely with the text's lexical composition. For instance, future texts processed by this tokenizer will be tokenized according to this vocabulary, with any unknown words being represented by the <OOV> token.

Through this mechanism, TensorFlow's Tokenizer effectively prepares text data for subsequent machine learning tasks by mapping words to consistent integer values while gracefully handling words not encountered during the initial vocabulary construction.

Converting Text to Sequences

After tokenization, the next step is to represent text as sequences of integers. Sequences are lists of integers where each integer corresponds to a token in the dictionary created during tokenization. This conversion process translates natural language text into structured data that can be input into a machine learning model.

Let's see how we can convert the sentences into sequences to demonstrate the handling of words that are and are not in the vocabulary.

Python
1sentences = [sentence, "very powerful"]
2sequences = tokenizer.texts_to_sequences(sentences)
3print(sequences)

Output:

Plain text
1[[2, 3, 4, 5, 6], [1, 5]]

Our original sentence "Love is a powerful entity." has been converted into a sequence [2, 3, 4, 5, 6], and each number directly corresponds to a word in our word index. Looking at the second sequence, [1, 5], it effectively demonstrates how the Tokenizer handles words that are not part of the initial vocabulary (OOV words) using the specified oov_token="<OOV>".

In the sequence [1, 5] for the input "very powerful", the word “very” is not found in the tokenizer's word index, thus it is labeled as token 1, which we designated as the <OOV> token. The word “powerful”, being recognized in the vocabulary, retains its assigned index 5. This illustrates TensorFlow's capability to manage unknown words gracefully, using the OOV token to ensure continuous processing of text data even when faced with unfamiliar tokens.

Padding Sequences for Consistent Input Shape

Deep learning models require input data of a consistent shape. In the context of NLP, it means all text must be represented by the same number of tokens. Padding is a process to ensure this by adding zeros to shorter sequences to match the length of the longest sequence.

Here's how we pad sequences in TensorFlow:

Python
1from tensorflow.keras.preprocessing.sequence import pad_sequences
2
3padded_sequences = pad_sequences(sequences, padding='post')
4print(padded_sequences)

The output will be:

Plain text
1[[2 3 4 5 6]
2 [1 5 0 0 0]]

In this updated example, after adding the "very powerful" sentence to our sequences, we apply padding. The result shows our original sentence "Love is a powerful entity." as [2 3 4 5 6], where each number directly corresponds to a word in our word index. For the second sequence, [1 5 0 0 0], it illustrates the addition of 0s at the end to ensure both sequences have the same length. The word “very” in the newer sequence is not found in the tokenizer's word index, thus labeled with the <OOV> token 1, and “powerful” retains its assigned index 5. The padding ensures all sequences are unified in length, catering to the requirements of deep learning models for consistent input shape. These integers represent the word indexes for each token, and sequences are made uniform through padding at the end, demonstrating how TensorFlow's padding function can accommodate data with variable sequence lengths to maintain a consistent input shape across all data inputs.

Implementing Text Preprocessing with TensorFlow

Finally, let's implement the entire preprocessing workflow with a limited set of data from the Reuters-21578 text categorization dataset.

Python
1# Import necessary libraries
2import nltk
3from tensorflow.keras.preprocessing.text import Tokenizer
4from tensorflow.keras.preprocessing.sequence import pad_sequences
5from sklearn.preprocessing import LabelEncoder
6from nltk.corpus import reuters
7
8# Download the reuters dataset from nltk
9nltk.download('reuters', quiet=True)
10
11# Limiting the data for quick execution
12categories = reuters.categories()[:3]
13documents = reuters.fileids(categories)
14
15# Preparing the dataset
16text_data = [" ".join([word for word in reuters.words(fileid)]) for fileid in documents]
17categories_data = [reuters.categories(fileid)[0] for fileid in documents]
18
19# Tokenize the text data, using TensorFlow's Tokenizer class
20tokenizer = Tokenizer(num_words=500, oov_token="<OOV>")
21tokenizer.fit_on_texts(text_data)
22sequences = tokenizer.texts_to_sequences(text_data)
23
24# Padding sequences for uniform input shape
25X = pad_sequences(sequences, padding='post')
26
27# Translating categories into numerical labels
28label_encoder = LabelEncoder()
29y = label_encoder.fit_transform(categories_data)
30
31print("Shape of X: ", X.shape)
32print("Shape of Y: ", y.shape)

The output will be:

Plain text
1Shape of X:  (2477, 2380)
2Shape of Y:  (2477,)

Conclusion

Great work! You've successfully ventured into TensorFlow for text preprocessing, an essential step in leveraging the true potential of deep learning for text classification. You've seen how tokenization, sequence creation and padding can be swiftly handled in TensorFlow, a key difference from methods we used in Scikit-learn. These foundations will serve you well as we move forward in our NLP journey. Up next, we're diving deeper into building Neural Network Models for Text Classification!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.