Welcome! Today, we're going to take a deep dive into the concept of TF-IDF and its crucial role in Text Classification. TF-IDF stands for Term Frequency-Inverse Document Frequency. It's a numerical statistic that reflects how important a word is in a document within a corpus of documents. The TF-IDF value increases proportionally to the number of times a word appears in the document but is counterbalanced by the frequency of the word in the corpus, helping to adjust for the fact that some words appear more frequently in general.
TF-IDF is used in information retrieval and text mining, to assist in identifying key words that contribute the most to the document's relevancy. In simple terms, terms that are more frequent in a specific document and less frequent in other documents from the corpus are significant and have high TF-IDF scores.
Now, let's understand it in practice.
In the Python ecosystem, scikit-learn is a widely used library offering various machine learning methods, along with utilities for pre-processing data, cross-validation, and other related tasks. One of the utilities it provides for text processing is TfidfVectorizer
.
Let's walk through each line of the code:
Python1import numpy as np 2from sklearn.feature_extraction.text import TfidfVectorizer
We first import the necessary libraries. Next, we set up a small list of text documents:
Python1sentences = [ 2 'This is the first document.', 3 'This document is the second document.', 4 'And this is the third one.', 5 'Is this the first document?' 6]
We then create an instance of the TfidfVectorizer
class and fit the vectorizer to our set of documents:
Python1vectorizer = TfidfVectorizer() 2vectorizer.fit(sentences)
"The fitting process" involves tokenization and learning the vocabulary. The text documents are tokenized into a set of tokens, and the vocabulary, which is a set of all tokens, is learned. At this point, we have effectively transformed our sentences into a numerical format that our machine can understand!
We can now print out the vocabulary and the Inverse Document Frequency (IDF) for each word in the vocabulary:
Python1print(f'Vocabulary: {vectorizer.vocabulary_}\n') 2print(f'IDF: {vectorizer.idf_}\n')
The output looks something like this:
Plain text1Vocabulary: {'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4} 2 3IDF: [1.91629073 1.22314355 1.51082562 1. 1.91629073 1.91629073 4 1. 1.91629073 1. ]
The 'Vocabulary' shows the numerical encoding of our sentences; each distinct word is assigned a unique numerical value. The 'IDF' values are the computed Inverse Document Frequencies for each word. These values define how important a word is to the document within the overall corpus. From these outputs, we get an important inference: terms that are very common in all documents (such as 'is' and 'the') have lower IDF scores, showing less importance. On the other hand, terms that are less common have higher IDF scores, indicating they may be more important or distinctive in our text data.
Next, let's transform one of the text documents to a sparse vector of TF-IDF values:
Python1vector = vectorizer.transform([sentences[0]])
This step encodes our sentences using TF-IDF scores. Simply put, each word in the sentence is translated into a numerical value. This numerical value - generated by the TF-IDF algorithm - represents the word's relevance or significance within the document.
Finally, let's print out the resulting vector and its shape:
Python1print('Shape:', vector.shape) 2print('Array:', vector.toarray())
The output reveals the shape of our encoded array and the TF-IDF score associated with each word in our sentence:
Plain text1Shape: (1, 9) 2Array: [[0. 0.46979139 0.58028582 0.38408524 0. 0. 3 0.38408524 0. 0.38408524]]
In the array, the order of the TF-IDF scores matches the order of the words in the 'vocabulary'. So, for instance, the first word 'and' (in the 'vocabulary') has a score of 0.0 as it does not occur in the sentence, while the word 'this' has a score of 0.38408524, which gives us the relevance of the word 'this' in our sentence.
This way, we have transformed human language into a numerical representation that our machine can understand and learn from!
Expanding on the simple sentences, let's apply the same process to the IMDB movie reviews dataset available in the NLTK library. This gives us a real-world scenario where TfidfVectorizer is utilized for text classification tasks - in this case, movie review classification.
Python1import nltk 2from nltk.corpus import movie_reviews 3 4nltk.download('movie_reviews') 5 6reviews = [movie_reviews.raw(fileid) for fileid in movie_reviews.fileids()] 7 8vectorizer = TfidfVectorizer() 9X = vectorizer.fit_transform(reviews) 10 11print('Shape:', X.shape)
After applying the TfidfVectorizer
to the movie reviews dataset, our output will show a shape that signifies the matrix dimension with the number of reviews and the unique words across all reviews:
Plain text1Shape: (2000, 39659)
In cases of large text datasets like ours, the matrix will have many zero entries because many words won't appear in a given review. Storing all these zeros can be highly memory-intensive and inefficient. Instead, we use a sparse matrix where we only store the non-zero elements, optimizing our memory usage. This is the storage method used for X
, which holds all the TF-IDF vectors.
Let's look into the structure of this sparse matrix a bit more:
Python1print("Total non-zero elements in the matrix X: ", len(X.data)) 2print("Length of the column indices array in X: ", len(X.indices)) 3print("Length of the row pointer array in X: ", len(X.indptr))
The outputs will look like this:
Plain text1Total non-zero elements in the matrix X: 666842 2Length of the column indices array in X: 666842 3Length of the row pointer array in X: 2001
Here:
X.data: This array holds all the non-zero elements in our matrix, hence its length signifies the total number of non-zero elements.
X.indices: This array holds the column (word) indice for each non-zero element—it is as long as the X.data
, telling us which word each data point corresponds to.
X.indptr: This is the "row pointer" array. It has as many elements as the number of rows in the matrix plus one. Each value signifies where the corresponding row starts in the X.data
and X.indices
arrays. It helps us locate which data points belong to which review.
Congratulations! You've just learned about the concept of TF-IDF, how to apply the TF-IDF Vectorizer to text data in Python using the Scikit-Learn library, and how to understand the subsequent output. Additionally, you've been introduced to sparse matrices—a helpful concept when handling large text datasets—and understood how such matrices are represented.
In the coming practice exercises, you will independently apply these concepts, helping you solidify your understanding and deepen your comprehension of how TF-IDF fits into text classification tasks. Keep up the excellent work!