Welcome to our lesson on Named Entity Recognition! Today, we'll be diving deep into the world of NLP and discovering how we can identify informative chunks of text, namely "Named Entities". The goal of this lesson is to learn about Part of Speech (POS) tagging and Named Entity Recognition (NER). By the end, you'll be able to gather specific types of data from text and get a few steps closer to mastering text classification.
Imagine we have a piece of text and we want to get some quick insights. What are the main subjects? Are there any specific locations or organizations being talked about? This is where Named Entity Recognition (NER) comes in handy.
In natural language processing (NLP), NER is a subtask of information extraction that seeks to locate and classify named entities in text into pre-defined categories such as names of persons, organizations, locations, expressions of times, quantities, monetary values, and percentages.
For instance, consider the sentence: "Apple Inc. is planning to open a new store in San Francisco." Using NER, we could identify that "Apple Inc." is an organization and "San Francisco" is a location. Such information can be incredibly valuable for numerous NLP tasks.
Every word in a sentence has a particular role. Some words are objects, some are verbs, some are adjectives, and so on. Tagging these parts of speech, or POS tagging, can be a critical component to many NLP tasks. It can help answer many questions, like what are the main objects in a sentence, what actions are being taken, and what's the context of these actions?
Let's start with a sentence example: "Apple Inc. is planning to open a new store in San Francisco." We are going to use NLTK
's pos_tag
function to tag the part of speech for each word in this sentence.
Python1from nltk import pos_tag, word_tokenize 2 3example_sentence = "Apple Inc. is planning to open a new store in San Francisco." 4tokens = word_tokenize(example_sentence) 5pos_tags = pos_tag(tokens) 6print(f'The first 5 POS tags are: {pos_tags[:5]}')
The output of the above code will be:
Plain text1The first 5 POS tags are: [('Apple', 'NNP'), ('Inc.', 'NNP'), ('is', 'VBZ'), ('planning', 'VBG'), ('to', 'TO')]
Here, every word from our sentence gets tagged with a corresponding part of speech. This is the first step towards performing Named Entity Recognition.
Now, what about Named Entity Recognition? Well, Named Entity Recognition (or NER) can be considered a step beyond regular POS tagging. It groups together one or more words that signify a named entity such as "San Francisco" or "Apple Inc." into a single category, i.e., location or organization in this case.
We can use the ne_chunk
function in NLTK to perform NER on our POS-tagged sentence, like so:
Python1from nltk import ne_chunk 2 3named_entities = ne_chunk(pos_tags) 4print(f'The named entities in our example sentences are:\n{named_entities}')
The output of the above code will be:
Plain text1The named entities in our example sentences are: 2(S 3 (PERSON Apple/NNP) 4 (ORGANIZATION Inc./NNP) 5 is/VBZ 6 planning/VBG 7 to/TO 8 open/VB 9 a/DT 10 new/JJ 11 store/NN 12 in/IN 13 (GPE San/NNP Francisco/NNP) 14 ./.)
Let's break down this output:
While Named Entity Recognition offers richer insights than simple POS tagging, it might not always be perfectly accurate due to the ambiguity and context-dependent nature of language. Despite this, it's a powerful tool in any NLP practitioner's arsenal.
Examining these NLP techniques in action on larger, more complex datasets allows us to understand the power of Natural Language Processing better. To this end, let's use POS tagging and Named Entity Recognition on a real-world dataset - the 20 Newsgroups dataset.
Python1from sklearn.datasets import fetch_20newsgroups 2from nltk import pos_tag, ne_chunk, word_tokenize 3 4# Loading the data with metadata removed 5newsgroups_data = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes')) 6 7# Selecting the first document 8first_doc = newsgroups_data.data[0] 9 10# Trimming the document's text down to the first 67 characters 11first_doc = first_doc[:67] 12 13# Tokenizing the text 14tokens_first_doc = word_tokenize(first_doc) 15 16# Applying POS tagging 17pos_tags_first_doc = pos_tag(tokens_first_doc) 18 19# Applying Named Entity Recognition 20named_entities = ne_chunk(pos_tags_first_doc) 21 22print(f'The first chunk of named entities in the first document are:\n{named_entities}')
Here's the output you can expect:
Plain text1The first chunk of named entities in the first document are: 2(S 3 I/PRP 4 was/VBD 5 wondering/VBG 6 if/IN 7 anyone/NN 8 out/IN 9 there/RB 10 could/MD 11 enlighten/VB 12 me/PRP 13 on/IN 14 this/DT 15 car/NN)
As you can see, even when we're working with a slimmed-down text input, both POS tagging and NER deliver valuable insights. We've applied these techniques to just a portion of a complex, real-world dataset, demonstrating how NLP can uncover important information from vast amounts of textual data. This highlights the critical role NLP plays in fields ranging from data analysis to AI and machine learning.
In this lesson, we have covered Part of Speech (POS) tagging, Named Entity Recognition (NER), and even applied these techniques to a real-world dataset! These concepts are fundamental to text preprocessing in Natural Language Processing (NLP). Having a grasp over these will allow you to approach more advanced topics in NLP with ease.
You have the upcoming tasks to practice these techniques, reinforcing your understanding and improving your Natural Language Processing skills. Let's head onto them and keep learning! Practice is key when it comes to mastering these techniques. Enjoy the hands-on session. Keep Learning!