Welcome to this lesson on expanding the Natural Language Processing (NLP) pipeline with custom components using the spaCy
library. Today, we're going to focus on adding extensions in two ways: using a pipeline component or using a getter for precomputing results. You'll learn when to use each method and practice creating meaningful custom components.
Extensions in spaCy
are an efficient and flexible system for adding extra functionality to the built-in Doc
, Token
, and Span
objects, as well as some other classes such as Language
and Vocab
. They can be used to add more information to a Token
, for example, the length of the sentence where the token is found.
Getter-based extensions are recommended when the attribute computation is straightforward, efficient, and highly dependent on the single Token
instance. Such extensions are dynamically computed at the time of access, ensuring up-to-date and context-specific information without upfront computational overhead.
Let's look at an example of a getter-based extension using the Reuters text corpus
Python1import spacy 2from spacy.tokens import Token 3from nltk.corpus import reuters 4 5nlp = spacy.load("en_core_web_sm") 6 7# Adding extensions with a getter 8Token.set_extension("sentence_len", getter=lambda token: len(token.sent)) 9 10texts = reuters.raw(categories=['crude', 'coffee', 'gold'])[0:5000] 11doc = nlp(texts)
Here, we created a simple getter-based extension that computes the length of the sentence in which each token is present.
To access the sentence_len
extension for each token, use the following approach:
Python1# Accessing sentence length information 2for token in doc: 3 print(f"{token.text}: {token._.sentence_len}") 4 5# Example output 6# JAPAN: 51 7# TO: 51 8# REVISE: 51 9# LONG: 51
Now, let's add a more linguistically meaningful extension, which computes a simple linguistic feature.
Consider phonetic similarity between words. For the sake of simplicity, we'll create a phonetic key consisting of the first two consonants of the word, or if they don't exist, the first two characters. Remember, real phonetic comparison would be much more elaborate and language-dependent.
Python1def get_phonetic_key(token): 2 non_vowels = [ch for ch in token.text.lower() if ch not in 'aeiou'] 3 return ''.join(non_vowels[:2]) if len(non_vowels) > 1 else token.text.lower()[:2] 4 5Token.set_extension('phonetic_key', getter=get_phonetic_key) 6 7texts = reuters.raw(categories=['crude', 'coffee', 'gold'])[0:5000] 8doc = nlp(texts)
After creating the phonetic key extension, you can access it for each token as follows:
Python1# Accessing the phonetic_key extension: 2for token in doc: 3 print(f"{token.text}: {token._.phonetic_key}") 4 5# Example output 6# JAPAN: jp 7# TO: to 8# REVISE: rv 9# LONG: ln
This segment demonstrates how to use getter-based extensions effectively and retrieve the additional data they provide.
In spaCy
, a pipeline component is a function that is handed a Doc
object, performs a specific operation on it, and then returns the modified Doc
. Pipeline components prove useful in adding more complex extensions, especially when the computation is intricate or depends on other Tokens
. This function is executed automatically as part of the spaCy
pipeline when we run the nlp
text processing on a document. This means that during the nlp
processing of a text, each pipeline component is invoked in sequence, allowing the modifications or enhancements implemented in the component to be computed and applied to the Doc
object in real-time. This design ensures that by the time the full processing is complete, all custom operations defined in the pipeline components have been performed, integrating seamlessly with spaCy
's built-in processes for a holistic NLP solution.
Consider this example where we use a pipeline component to calculate the sentence length for each Token
:
Python1import spacy 2from spacy.language import Language 3from spacy.tokens import Token 4from nltk.corpus import reuters 5 6nlp = spacy.load("en_core_web_sm") 7 8Token.set_extension('sentence_len', default=None) 9 10@Language.component("sentence_len_component") 11def sentence_len_component(doc): 12 for sent in doc.sents: 13 sent_len = len(sent) 14 for token in sent: 15 token._.set('sentence_len', sent_len) 16 return doc 17 18nlp.add_pipe("sentence_len_component", last=True) 19 20texts = reuters.raw(categories=['crude', 'coffee', 'gold'])[0:5000] 21doc = nlp(texts)
The sentence_len_component
calculates the length of each sentence in the Doc
and associates this length with every Token
within that sentence. This is achieved through token._.set('sentence_len', sent_len)
, where token._.set(...)
is used to dynamically assign the computed sentence length to each Token
.
The Language.component
decorator is a powerful feature used to craft custom pipeline components capable of computing and assigning sentence lengths to tokens. Defining a pipeline component is only the first step; integrating it into the spaCy
NLP pipeline is a seamless process facilitated by the add_pipe
method. The first argument of add_pipe
determines the component to incorporate into the pipeline. For our scenario, we're including the "sentence_len_component". The last=True
parameter, passed to add_pipe
, ensures this component is appended at the end of the pipeline. Positioning it last is crucial for ensuring that all preceding analyses and modifications have been completed on the Doc
object, which allows our custom component to operate with the most current version of the document, incorporating any updates made by earlier components. This strategic placement maximizes the effectiveness and accuracy of the sentence length computation, as it relies on the context established by the pipeline's prior stages.
With this arrangement, each Token
is now enriched with data regarding its sentence's length, providing a deeper level of linguistic insight for our analyses. Such enhancements to the spaCy
NLP pipeline facilitate the creation of more sophisticated and tailored natural language processing applications.
To retrieve the additional information assigned by our sentence_len_component
to each token, you can use:
Python1# Accessing sentence length information from the pipeline component 2for token in doc: 3 print(f"{token.text}: {token._.sentence_len}") 4 5# Example output 6# JAPAN: 51 7# TO: 51 8# REVISE: 51 9# LONG: 51
This retrieves the dynamically assigned sentence length for each token, showcasing how to interact with data generated by pipeline components.
In this lesson, you've learned two different ways to add custom linguistic features to the spaCy
NLP pipeline: using a getter or a pipeline component. By segregating the instruction on how to access custom extensions and components right after their introduction, the lesson aims to reinforce the learning by immediate application of the concepts.
In the upcoming practice exercises, you will have the chance to reinforce these concepts by applying them to real-life scenarios. Through practice, you'll be able to gain a more profound understanding of how these components can enhance your NLP tasks. Unlock the potential of custom pipeline components in spaCy
and revolutionize your linguistic analysis!