Hello and welcome to the next exciting part of our journey with Natural Language Processing! In today's lesson, we focus on one of the vital components in NLP – Entity Recognition, and we are going to see it in action using Python and spaCy
. Our goal for today's lesson is to grasp the core concepts behind Entity Recognition, understand why it's important, and be able to implement it in Python using spaCy
.
So, what exactly is Entity Recognition? Entity Recognition or Named Entity Recognition (NER) is a task in information extraction that involves identifying and classifying named entities (like persons, places, organization) present in a text into pre-defined categories. It is essentially the process by which an algorithm can read a string of text and say, "Ah, this part of the text refers to a place, and this part refers to a person!"
Let's consider an example to understand this better. Given a sentence - "Apple Inc. is planning to open a new office in San Francisco." Named entity recognition will help us identify "Apple Inc." as an organization and "San Francisco" as a geographical entity.
Named Entity Recognition plays a crucial role in various NLP applications like information retrieval (search engines), machine translation, question answering systems and more. It helps algorithms better understand the context of the sentences and extract important attributes from the text.
With a theoretical understanding of Entity Recognition, let's now delve into its practical implementation using Python and the spaCy
library. As mentioned above, spaCy
has a built-in Named Entity Recognition system that can recognize a wide variety of named or numerical entities. This comes as a part of spaCy's
statistical models and not all the language models support it. However, the model we are using, en_core_web_sm
, supports Named Entity Recognition.
When you call nlp
on a text, spaCy
first tokenizes the text to produce a Doc
object. Doc
is then processed in several different steps – this is also known as the processing pipeline. The pipeline used by the en_core_web_sm
model consists of a tagger, a parser and an entity recognizer. Each pipeline component returns the processed Doc
, which is then passed on to the next component.
Upon calling nlp
with our text, the model’s pipeline is applied to the Doc
, returning a processed Doc
object. Having gone through the pipeline, the Doc
object now holds all the information about the entities that have been recognized.
Now that we understand how spaCy's
Entity Recognizer works, let's go ahead and execute it on a real-world dataset. For this lesson, we will use the in-built Reuters dataset from the Natural Language Toolkit (NLTK) library. Specifically, we will aim to extract entities from articles in the 'Crude' category.
To start with, we import the necessary libraries and load the English model using spacy.load("en_core_web_sm")
. Next, we fetch an article from the 'Crude' category using reuters.raw(fileids=reuters.fileids(categories='crude')[0])
. The raw text of the first article in this category is processed through our pipeline by calling nlp(text)
.
Python1# Import necessary libraries 2from nltk.corpus import reuters 3import spacy 4 5# Load English tokenizer, tagger, parser, NER, and word vectors 6nlp = spacy.load("en_core_web_sm") 7 8# Define the text for extraction 9text = reuters.raw(fileids=reuters.fileids(categories='crude')[0]) 10 11# Process the text 12doc = nlp(text) 13 14# Print the entity, starting and ending index, and label 15for ent in doc.ents: 16 print(ent.text, ent.start_char, ent.end_char, ent.label_)
The Doc
object holds a collection of Token
objects, which also hold their respective predicted entities. Here, we iterate over each ent
in doc.ents
and print out the text of the entity, its starting and ending index in the document, and its label.
The output of the above code will be:
Plain text1JAPAN 0 5 GPE 2The Ministry of International Trade 52 87 ORG 3MITI 104 108 ORG 4August 170 176 DATE 5Japanese 209 217 NORP 6MITI 266 270 ORG 7the year 2000 340 353 DATE 8550 357 360 CARDINAL 9600 386 389 CARDINAL 10Japanese 476 484 NORP 11MITI 594 598 ORG 12the 13 Agency of Natural Resources and Energy 711 755 ORG 14MITI 793 797 ORG 15Japan 945 950 GPE 16the fiscal year ended March 31 973 1003 DATE 17an estimated 27 1015 1030 CARDINAL 18a kilowatt/hour 1040 1055 TIME 1923 1080 1082 CARDINAL 2021 1117 1119 CARDINAL
This output shows various entities extracted from the Reuters article including geopolitical entities (GPE), organizations (ORG), nationalities (NORP), dates, and cardinal numbers. It illustrates the powerful capability of spaCy
in identifying different types of entities in text, which is fundamental for many NLP tasks.
This entity recognition code helps us understand how the spaCy
library processes text and how we can utilize its power to identify various entities in practically any type of textual data. This knowledge will be crucial when we move forward to the next lesson on Entity Linking.
Congratulations! You have learned the importance of Entity Recognition in NLP and implemented it efficiently using the spaCy
library in Python.
You have seen how we can process text and identify named entities, such as organizations, persons, and geographical locations, among others. To further strengthen your understanding, we encourage you to experiment with a variety of texts and categories within the Reuters dataset, or other text data of your interest.
In the next lesson, we will further compound our learning by studying custom NLP pipeline components and their practical implementation. Stay tuned!