Hello again! Today, we're taking a closer look at Term Frequency-Inverse Document Frequency, or TF-IDF
. As you may recall from earlier lessons, TF-IDF
is a statistical measure that tells us how important a word is to a document in a collection or corpus.
Here, we explore TF-IDF
scores and their relevance, as they can guide us in identifying words that carry significant value in determining the context or theme of a document. We'll use Python's Scikit-learn library, a machine learning tool that comes with built-in capabilities for calculating TF-IDF
. Let's embark on this exciting journey of analyzing TF-IDF
scores on our SMS Spam Collection dataset.
Now that we learned how to compute the TF-IDF
matrix, we would like to determine which words (or features) have the highest scores. These words are often the most descriptive or differentiating words in the corpus!
Let's understand how the following code snippet helps identify these top features:
Python1# ... Previously loading the dataset and converting to Dataframe 2 3import pandas as pd 4import numpy as np 5from sklearn.feature_extraction.text import TfidfVectorizer 6 7# Initialize the TF-IDF Vectorizer 8vectorizer = TfidfVectorizer() 9 10# Transform the messages into a matrix of TF-IDF features 11X_tfidf = vectorizer.fit_transform(df['message']).toarray() 12 13# Get the feature names 14feature_names = vectorizer.get_feature_names_out() 15 16# Find the indices of the top 10 features based on global maximum TF-IDF scores 17top_features_idx = np.argsort(X_tfidf.max(axis=0))[-10:] 18 19# Extract the top 10 features 20top_features = feature_names[top_features_idx] 21 22# Print the top 10 TF-IDF features 23print("Top 10 TF-IDF features:") 24print(top_features)
In this code, we first transformed our dataset's messages into a matrix of TF-IDF
features, assigning greater weight to the most significant words. Then, by extracting the feature names and calculating the indices of the top 10 features with the highest TF-IDF
scores, we identified and printed the words most characteristic of our corpus. This procedure allows us to discern the terms that uniquely define the content of our dataset, showcasing the power of TF-IDF
in highlighting keywords.
The output of the above code will be:
Plain text1Top 10 TF-IDF features: 2['anytime' 'yup' '645' 'where' 'ok' 'alrite' 'thank' 'okie' 'thanx' 'nite']
This output showcases the eclectic mix of words that have the highest TF-IDF
scores within our SMS Spam Collection dataset. It's interesting to note the presence of both common words and seemingly random or unique terms. This underlines the importance of TF-IDF
in distinguishing relevant terms in a specific context, even when those terms might not seem immediately relevant at a glance.
Remarkable job! Today, we have learnt how to extract insights from TF-IDF
scores. We delved deeper into the meaning of these scores, and we coded a Python script using the Scikit-learn
library to calculate TF-IDF
. Furthermore, we built on this by writing code to identify the top words based on their TF-IDF
scores.
While it may seem like a lot, remember that practice makes perfect. Continue working with different corpora to get comfortable with the process. Use the upcoming practice exercises to reinforce your understanding and enhance your skills. Knowing how to interpret and analyze TF-IDF
scores forms the backbone of numerous advanced NLP tasks including document classification, sentiment analysis, topic modeling, and many more. Keep going!