Hello and welcome to today's lesson! We are now moving towards an exciting journey into the field of Neural Networks, significant players in the Natural Language Processing (NLP) arena. Neural Networks implicitly capture the structure of the data, a phenomenon that's of high value in text data, given its sequential nature. Remember how our ensemble models did a good job on the Reuters-21578 Text Categorization Collection
? Now, imagine how we can unlock even higher performance by using these powerful models.
Before discussing Neural Networks in detail, let's recall the code we have already executed:
Python1# Importing libraries 2import tensorflow as tf 3from tensorflow.keras.preprocessing.sequence import pad_sequences 4from tensorflow.keras.preprocessing.text import Tokenizer 5from sklearn.preprocessing import LabelEncoder 6from sklearn.model_selection import train_test_split 7import numpy as np 8import nltk 9from nltk.corpus import reuters 10 11nltk.download('reuters', quiet=True) 12 13# Loading and preparing the Reuters-21578 Text Categorization Collection dataset 14categories = reuters.categories()[:3] 15documents = reuters.fileids(categories) 16text_data = [" ".join([word for word in reuters.words(fileid)]) for fileid in documents] 17categories_data = [reuters.categories(fileid)[0] for fileid in documents] 18 19# Tokenizing and padding sequences 20tokenizer = Tokenizer(num_words=500, oov_token="<OOV>") 21tokenizer.fit_on_texts(text_data) 22sequences = tokenizer.texts_to_sequences(text_data) 23X = pad_sequences(sequences, padding='post') 24 25# Label Encoding 26y = LabelEncoder().fit_transform(categories_data) 27 28# Train-Test Split 29X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
So far, we have preprocessed our text data and transformed it into a format suitable for input into models. We have our train and test datasets ready, which means we are all set to dive into creating our Neural Network model for text classification.
When dealing with text data, our neural network usually starts with an Embedding layer. This layer is tasked with converting the tokenized textual data into a dense vector representation which the neural network can understand. The embedding matrix created by this process captures the general understanding of words and their contextual meanings.
Here's our simple, initial neural network model with the embedding layer:
Python1model = tf.keras.Sequential([ 2 tf.keras.layers.Embedding(input_dim=500, output_dim=16), 3])
Notice the parameters we passed to the embedding layer - the input_dim
and output_dim
. The input_dim
is set to 500, the same as the number of words we encoded with our tokenizer. The output_dim
sets how many dimensions we want to have in the dense vector representing each word - we set it to 16.
Still, the model is not yet complete. Let's add the next layer.
Next, we will use a pooling layer - GlobalAveragePooling1D
. This layer reduces the dimensionality of the model's representation by taking the average of each word vector. This process effectively creates an overall context vector per text sequence, a necessary process before predicting the text category.
Our model with the GlobalAveragePooling1D
layer now looks like this:
Python1model = tf.keras.Sequential([ 2 tf.keras.layers.Embedding(input_dim=500, output_dim=16), 3 tf.keras.layers.GlobalAveragePooling1D(), 4])
Our last layer is a Dense layer, the output layer, with three units and 'softmax' activation function. The number three here represents the number of our output categories. The 'softmax' activation will ensure the output probabilities of all categories sum up to 1.
Lastly, we compile our model with the loss function 'sparse_categorical_crossentropy', 'adam' optimizer, and 'accuracy' as the metric. We train our model for 10 epochs using our training set and then evaluate it using the test set:
Python1model = tf.keras.Sequential([ 2 tf.keras.layers.Embedding(input_dim=500, output_dim=16), 3 tf.keras.layers.GlobalAveragePooling1D(), 4 tf.keras.layers.Dense(3, activation='softmax') 5]) 6 7model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy']) 8model.fit(X_train, y_train, epochs=10, validation_data=(X_test, y_test)) 9loss, accuracy = model.evaluate(X_test, y_test) 10 11print(f"Test Loss: {loss}") 12print(f"Test Accuracy: {accuracy}")
The output of the above code will be:
Plain text1Test Loss: 0.22081851959228516 2Test Accuracy: 0.9556451439857483
This summary indicates successful training of a Neural Network model for text classification on the Reuters dataset, achieving significant accuracy with minimal loss. The use of an embedding layer coupled with GlobalAveragePooling1D
and Dense layers allows for effective understanding and categorization of text sequences.
We use sparse_categorical_crossentropy
as our loss function because our labels are integers. In multi-class classification tasks where labels are not one-hot encoded (which would require categorical_crossentropy
), sparse_categorical_crossentropy
allows for a more efficient and straightforward handling of the labels. It expects integers as labels and calculates the loss between the true labels and predicted labels, guiding the model's optimization.
Congratulations on taking a big step in your NLP journey! You've learned how to prepare and use Neural Networks for text classification efficiently. You've gone a long way and it's the perfect time to apply these concepts. In the next set of exercises, you will get to apply these concepts and consolidate your learning. Practice is crucial - it helps us understand the concepts better and gives us the confidence to handle real-world datasets and tasks. Remember, you're just one lesson away from unlocking the power of Simple RNN, which we will cover in our next class. Let's get practicing!