Welcome back! As we move ahead in our Natural Language Processing journey, today's lesson is about a fundamental component of NLP preprocessing — Lemmatization. We will get hands-on with the spaCy
library to implement lemmatization on our text data.
By the end of the lesson, you should be skilled in explaining and implementing lemmatization in your data preprocessing pipeline for NLP tasks.
Lemmatization, in the context of Natural Language Processing, is the process of reducing any given word to its base form or root.
Let's take an example: suppose we have a verb in its past tense, like flying
. The base form of flying
is fly
. If we perform lemmatization on flying
, we get fly
. On similar lines, better
would be reduced to good
, mice
would become mouse
, and so on.
So, why lemmatization? Well, while dealing with natural language, it happens quite frequently that we encounter different forms of the same word. For a machine, better
, good
and best
are different words, even though they essentially express the same thing. When we perform tasks like text classification, these different forms are treated as different features, thus increasing the dimensionality of our dataset. By lemmatizing, we can reduce these variations to their root form, thereby reducing the number of features and making our model more efficient.
spaCy
offers a convenient and efficient way to perform lemmatization on text. When spaCy
processes any text, it performs lemmatization by default and keeps the lemma (or root form) of each word as an attribute of the word. This attribute can be accessed by simply calling token.lemma_
, where token
is the word we're dealing with.
Now, let's move onto the practical implementation.
Let's use the provided task as an example to perform lemmatization on a sentence.
Python1import spacy 2 3nlp = spacy.load('en_core_web_sm') 4sentence = "The striped bats are hanging on their feet and ate best fishes" 5doc = nlp(sentence) 6 7for token in doc: 8 print(token.text, token.lemma_)
In the above code, we initially load the English language model using nlp = spacy.load("en_core_web_sm")
. We then use this model to process our sentence and convert it to a doc
, which is essentially a collection of tokens (or words).
Finally, we iterate over each token in the doc
and print the token and its corresponding lemma. The lemma of a token can be accessed using the lemma_
attribute of the token.
The output of the above code will be:
Plain text1The the 2striped stripe 3bats bat 4are be 5hanging hang 6on on 7their their 8feet foot 9and and 10ate eat 11best good 12fishes fish
This output demonstrates how each word from our sentence is processed and reduced to its lemma form. Notice how "bats" is converted to "bat", and "ate" to "eat", showcasing the effectiveness of lemmatization in normalizing text.
So how does this help in real-world Natural Language Processing tasks? Lemmatization reduces the various inflected forms of a word to a single form. This can significantly reduce the number of unique words in our text (which, in case of text data, means reducing the number of features) without losing significant meaning.
Text classification, sentiment analysis, and topic modelling are just a few NLP tasks that can significantly benefit from the dimensionality reduction lemmatization offers. By making the dataset more manageable, more computationally efficient, and more accurate machine learning models can be built.
Well done on reaching this point! Today you learned about lemmatization, its importance in NLP data preprocessing, and how to utilize spaCy
to perform lemmatization. This knowledge is an integral part of any NLP pipeline and will assist you greatly in future tasks.
Up next, we'll be practicing using spaCy's
lemmatization functionality on actual datasets and assessing its effects on our text data. This will reinforce your understanding and further boost your spaCy
mastery! See you in the next lesson.