Understanding Named Entity Recognition in NLP

Lesson 5

Introduction

Welcome to our lesson on Named Entity Recognition! Today, we'll be diving deep into the world of NLP and discovering how we can identify informative chunks of text, namely "Named Entities". The goal of this lesson is to learn about Part of Speech (POS) tagging and Named Entity Recognition (NER). By the end, you'll be able to gather specific types of data from text and get a few steps closer to mastering text classification.

What is Named Entity Recognition?

Imagine we have a piece of text and we want to get some quick insights. What are the main subjects? Are there any specific locations or organizations being talked about? This is where Named Entity Recognition (NER) comes in handy.

In natural language processing (NLP), NER is a subtask of information extraction that seeks to locate and classify named entities in text into pre-defined categories such as names of persons, organizations, locations, expressions of times, quantities, monetary values, and percentages.

For instance, consider the sentence: "Apple Inc. is planning to open a new store in San Francisco." Using NER, we could identify that "Apple Inc." is an organization and "San Francisco" is a location. Such information can be incredibly valuable for numerous NLP tasks.

Part of Speech (POS) Tagging

Every word in a sentence has a particular role. Some words are objects, some are verbs, some are adjectives, and so on. Tagging these parts of speech, or POS tagging, can be a critical component to many NLP tasks. It can help answer many questions, like what are the main objects in a sentence, what actions are being taken, and what's the context of these actions?

Let's start with a sentence example: "Apple Inc. is planning to open a new store in San Francisco." We are going to use NLTK's pos_tag function to tag the part of speech for each word in this sentence.

Python
1from nltk import pos_tag, word_tokenize
2
3example_sentence = "Apple Inc. is planning to open a new store in San Francisco."
4tokens = word_tokenize(example_sentence)
5pos_tags = pos_tag(tokens)
6print(f'The first 5 POS tags are: {pos_tags[:5]}')

The output of the above code will be:

Plain text
1The first 5 POS tags are: [('Apple', 'NNP'), ('Inc.', 'NNP'), ('is', 'VBZ'), ('planning', 'VBG'), ('to', 'TO')]

Here, every word from our sentence gets tagged with a corresponding part of speech. This is the first step towards performing Named Entity Recognition.

Named Entity Recognition with NLTK

Now, what about Named Entity Recognition? Well, Named Entity Recognition (or NER) can be considered a step beyond regular POS tagging. It groups together one or more words that signify a named entity such as "San Francisco" or "Apple Inc." into a single category, i.e., location or organization in this case.

We can use the ne_chunk function in NLTK to perform NER on our POS-tagged sentence, like so:

Python
1from nltk import ne_chunk
2
3named_entities = ne_chunk(pos_tags)
4print(f'The named entities in our example sentences are:\n{named_entities}')

The output of the above code will be:

Plain text
1The named entities in our example sentences are:
2(S
3  (PERSON Apple/NNP)
4  (ORGANIZATION Inc./NNP)
5  is/VBZ
6  planning/VBG
7  to/TO
8  open/VB
9  a/DT
10  new/JJ
11  store/NN
12  in/IN
13  (GPE San/NNP Francisco/NNP)
14  ./.)

Let's break down this output:

The 'S' at the beginning signifies the start of a sentence.
Words inside paretheses, prefixed with labels such as PERSON, ORGANIZATION, or GPE are recognized named entities. For example, '(PERSON Apple/NNP)' indicates that 'Apple' is recognized as a named entity representing a Person and 'Apple' has been POS tagged as 'NNP' (Proper Noun, Singular).
Words outside parentheses are not recognized as part of a named entity but are part of the sentence and each of them is associated with a POS tag. For instance, 'is/VBZ' means that 'is' is recognized as a verb in present tense, 3rd person singular form.
'(GPE San/NNP Francisco/NNP)' indicates that 'San Francisco', a two-word entity, is recognized as a geopolitical entity, such as a city, state, or country.

While Named Entity Recognition offers richer insights than simple POS tagging, it might not always be perfectly accurate due to the ambiguity and context-dependent nature of language. Despite this, it's a powerful tool in any NLP practitioner's arsenal.

Applying PoS Tagging and NER to a Real Dataset

Examining these NLP techniques in action on larger, more complex datasets allows us to understand the power of Natural Language Processing better. To this end, let's use POS tagging and Named Entity Recognition on a real-world dataset - the 20 Newsgroups dataset.

Python
1from sklearn.datasets import fetch_20newsgroups
2from nltk import pos_tag, ne_chunk, word_tokenize
3
4# Loading the data with metadata removed
5newsgroups_data = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
6
7# Selecting the first document 
8first_doc = newsgroups_data.data[0]
9
10# Trimming the document's text down to the first 67 characters
11first_doc = first_doc[:67]
12
13# Tokenizing the text
14tokens_first_doc = word_tokenize(first_doc)
15
16# Applying POS tagging
17pos_tags_first_doc = pos_tag(tokens_first_doc)
18
19# Applying Named Entity Recognition
20named_entities = ne_chunk(pos_tags_first_doc)
21
22print(f'The first chunk of named entities in the first document are:\n{named_entities}')

Here's the output you can expect:

Plain text
1The first chunk of named entities in the first document are:
2(S
3  I/PRP
4  was/VBD
5  wondering/VBG
6  if/IN
7  anyone/NN
8  out/IN
9  there/RB
10  could/MD
11  enlighten/VB
12  me/PRP
13  on/IN
14  this/DT
15  car/NN)

As you can see, even when we're working with a slimmed-down text input, both POS tagging and NER deliver valuable insights. We've applied these techniques to just a portion of a complex, real-world dataset, demonstrating how NLP can uncover important information from vast amounts of textual data. This highlights the critical role NLP plays in fields ranging from data analysis to AI and machine learning.

Lesson Summary and Practice

In this lesson, we have covered Part of Speech (POS) tagging, Named Entity Recognition (NER), and even applied these techniques to a real-world dataset! These concepts are fundamental to text preprocessing in Natural Language Processing (NLP). Having a grasp over these will allow you to approach more advanced topics in NLP with ease.

You have the upcoming tasks to practice these techniques, reinforcing your understanding and improving your Natural Language Processing skills. Let's head onto them and keep learning! Practice is key when it comes to mastering these techniques. Enjoy the hands-on session. Keep Learning!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.