Welcome to your lesson on Basic TF-IDF Vectorization! In the world of Natural Language Processing (NLP), converting text into numerical representation such as vectors is crucial. Today, we will explore one of the popular techniques, Term Frequency-Inverse Document Frequency (TF-IDF), and see how we can implement it in Python using the scikit-learn
library. By the end of the lesson, you will know how to vectorize a real-world text dataset, the SMS Spam Collection
.
In the era of data-driven decision making, much of the valuable information comes in textual form — think about social media posts, medical records, news articles. Text mining is the practice of deriving valuable insights from text. However, machine learning algorithms, which help process and understand this text, usually require input data in numerical format.
Here is where text vectorization steps in, converting text into a set of numerical values, each corresponding to a particular word or even a group of words. This conversion opens the doors for performing various operations such as finding similarity between documents, document classification, sentiment analysis, and many more.
Term Frequency-Inverse Document Frequency (TF-IDF) is a statistical measure used to evaluate the importance of a word in the context of a set of documents, known as a corpus. Its importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.
Term Frequency (TF): It measures the frequency of a word's occurrence in a document. Different ways to calculate TF include:
Inverse Document Frequency (IDF): It measures the importance of a word in the entire corpus. The beauty of IDF is that it diminishes the weight of terms that occur very frequently in the data set and increases the weight of terms that occur rarely. Traditionally, the IDF of a word 'learn', for example, is calculated using the formula:
where:
However, this straightforward calculation might lead to a division-by-zero issue if a term does not exist in any document. To avoid this, scikit-learn
applies a modified version of the formula:
This adjusted formula ensures no term ends up with an IDF of zero by adding 1 to both the total number of documents and the document frequency of the term. Such an approach not only prevents mathematical errors but also reflects an intuitive understanding: every term should have a minimum level of importance simply by existing in the language used across the corpus.
Finally, the TF-IDF value for a word in a document is calculated as the product of its TF (with one of the variations mentioned above chosen based on the application) and its IDF. This balanced mix serves to highlight words that are uniquely significant to individual documents, providing a robust way to represent text data numerically for various machine learning and natural language processing applications.
TfidfVectorizer
, offered by the sklearn.feature_extraction.text
module, is a comprehensive tool for transforming text into meaningful numerical representation. It not only calculates TF-IDF scores but also streamlines text preprocessing by:
The tokenization process thus plays a critical role in filtering and preparing the text data, making it ready for vectorization without extensive manual preprocessing.
Example of using TfidfVectorizer
:
Python1# Example corpus 2corpus = ['The car is driven on the road.', # Note the punctuation 3 'The bus is driven on the street!'] 4 5from sklearn.feature_extraction.text import TfidfVectorizer 6vectorizer = TfidfVectorizer() 7 8# Vectorize the corpus, including lowercasing, tokenization and punctuation removal 9X_tfidf = vectorizer.fit_transform(corpus) 10 11# Display the feature names and the TF-IDF matrix 12print(vectorizer.get_feature_names_out()) 13print(X_tfidf.toarray())
In the output, the results form a matrix where each row represents a different document from our corpus, and each column corresponds to a unique token identified across all documents. Important to note, punctuation has been removed, ensuring our focus is solely on the words themselves:
Plain text1['bus' 'car' 'driven' 'is' 'on' 'road' 'street' 'the'] 2 3[[0. 0.42471719 0.30218978 0.30218978 0.30218978 0.42471719 0. 0.60437955] 4 [0.42471719 0. 0.30218978 0.30218978 0.30218978 0. 0.42471719 0.60437955]]
The TfidfVectorizer
's ability to remove punctuation and tokenize the text simplifies the preprocessing requirements, making it straightforward to turn raw text into a format that's ready for analysis or machine learning models.
Let's apply TF-IDF to the SMS Spam Collection
text data:
Python1import pandas as pd 2from datasets import load_dataset 3from sklearn.feature_extraction.text import TfidfVectorizer 4 5# Load the dataset 6sms_spam = load_dataset('codesignal/sms-spam-collection') 7 8# Convert to Dataframe 9df = pd.DataFrame(sms_spam['train']) 10 11# Initialize a TF-IDF Vectorizer 12vectorizer = TfidfVectorizer() 13 14# Convert the 'message' column to TF-IDF vectors 15X_tfidf = vectorizer.fit_transform(df['message']) 16 17print(X_tfidf.shape)
The output of the above code will be:
Plain text1(5572, 8713)
This output shows that the TF-IDF vectorization transformed the messages into a matrix with 5572 rows, each corresponding to a message, and 8713 columns, each representing a unique word. This transformation effectively converts the textual data into a numerical format suitable for machine learning algorithms.
Congratulations on completing the lesson on Basic TF-IDF Vectorization! You have taken a step further into the world of NLP by learning about TF-IDF vectorization, a popular method to transform text into numerical representation. With this knowledge, you can now prepare your textual data for machine learning algorithms. Remember, the more you practise, the stronger your understanding will be. So dive into the exercises and keep learning!