Expanding the spaCy NLP Pipeline with Custom Components

Lesson 5

Introduction

Welcome to this lesson on expanding the Natural Language Processing (NLP) pipeline with custom components using the spaCy library. Today, we're going to focus on adding extensions in two ways: using a pipeline component or using a getter for precomputing results. You'll learn when to use each method and practice creating meaningful custom components.

Understanding spaCy Extensions

Extensions in spaCy are an efficient and flexible system for adding extra functionality to the built-in Doc, Token, and Span objects, as well as some other classes such as Language and Vocab. They can be used to add more information to a Token, for example, the length of the sentence where the token is found.

Getter-based extensions are recommended when the attribute computation is straightforward, efficient, and highly dependent on the single Token instance. Such extensions are dynamically computed at the time of access, ensuring up-to-date and context-specific information without upfront computational overhead.

Let's look at an example of a getter-based extension using the Reuters text corpus

Python
1import spacy
2from spacy.tokens import Token
3from nltk.corpus import reuters
4
5nlp = spacy.load("en_core_web_sm")
6
7# Adding extensions with a getter
8Token.set_extension("sentence_len", getter=lambda token: len(token.sent))
9
10texts = reuters.raw(categories=['crude', 'coffee', 'gold'])[0:5000]
11doc = nlp(texts)

Here, we created a simple getter-based extension that computes the length of the sentence in which each token is present.

Accessing Extensions Created with a Getter

To access the sentence_len extension for each token, use the following approach:

Python
1# Accessing sentence length information
2for token in doc:
3    print(f"{token.text}: {token._.sentence_len}")
4
5# Example output
6# JAPAN: 51
7# TO: 51
8# REVISE: 51
9# LONG: 51

Creating a Phonetic Key Extension

Now, let's add a more linguistically meaningful extension, which computes a simple linguistic feature.

Consider phonetic similarity between words. For the sake of simplicity, we'll create a phonetic key consisting of the first two consonants of the word, or if they don't exist, the first two characters. Remember, real phonetic comparison would be much more elaborate and language-dependent.

Python
1def get_phonetic_key(token):
2    non_vowels = [ch for ch in token.text.lower() if ch not in 'aeiou']
3    return ''.join(non_vowels[:2]) if len(non_vowels) > 1 else token.text.lower()[:2]
4
5Token.set_extension('phonetic_key', getter=get_phonetic_key)
6
7texts = reuters.raw(categories=['crude', 'coffee', 'gold'])[0:5000]
8doc = nlp(texts)

Accessing the Phonetic Key Extension

After creating the phonetic key extension, you can access it for each token as follows:

Python
1# Accessing the phonetic_key extension:
2for token in doc:
3    print(f"{token.text}: {token._.phonetic_key}")
4
5# Example output
6# JAPAN: jp
7# TO: to
8# REVISE: rv
9# LONG: ln

This segment demonstrates how to use getter-based extensions effectively and retrieve the additional data they provide.

Understanding spaCy Pipeline Components

In spaCy, a pipeline component is a function that is handed a Doc object, performs a specific operation on it, and then returns the modified Doc. Pipeline components prove useful in adding more complex extensions, especially when the computation is intricate or depends on other Tokens. This function is executed automatically as part of the spaCy pipeline when we run the nlp text processing on a document. This means that during the nlp processing of a text, each pipeline component is invoked in sequence, allowing the modifications or enhancements implemented in the component to be computed and applied to the Doc object in real-time. This design ensures that by the time the full processing is complete, all custom operations defined in the pipeline components have been performed, integrating seamlessly with spaCy's built-in processes for a holistic NLP solution.

Consider this example where we use a pipeline component to calculate the sentence length for each Token:

Python
1import spacy
2from spacy.language import Language
3from spacy.tokens import Token
4from nltk.corpus import reuters
5
6nlp = spacy.load("en_core_web_sm")
7
8Token.set_extension('sentence_len', default=None)
9
10@Language.component("sentence_len_component")
11def sentence_len_component(doc): 
12    for sent in doc.sents:
13        sent_len = len(sent)
14        for token in sent:
15            token._.set('sentence_len', sent_len)
16    return doc
17
18nlp.add_pipe("sentence_len_component", last=True)
19
20texts = reuters.raw(categories=['crude', 'coffee', 'gold'])[0:5000]
21doc = nlp(texts)

The sentence_len_component calculates the length of each sentence in the Doc and associates this length with every Token within that sentence. This is achieved through token._.set('sentence_len', sent_len), where token._.set(...) is used to dynamically assign the computed sentence length to each Token.

The Language.component decorator is a powerful feature used to craft custom pipeline components capable of computing and assigning sentence lengths to tokens. Defining a pipeline component is only the first step; integrating it into the spaCy NLP pipeline is a seamless process facilitated by the add_pipe method. The first argument of add_pipe determines the component to incorporate into the pipeline. For our scenario, we're including the "sentence_len_component". The last=True parameter, passed to add_pipe, ensures this component is appended at the end of the pipeline. Positioning it last is crucial for ensuring that all preceding analyses and modifications have been completed on the Doc object, which allows our custom component to operate with the most current version of the document, incorporating any updates made by earlier components. This strategic placement maximizes the effectiveness and accuracy of the sentence length computation, as it relies on the context established by the pipeline's prior stages.

With this arrangement, each Token is now enriched with data regarding its sentence's length, providing a deeper level of linguistic insight for our analyses. Such enhancements to the spaCy NLP pipeline facilitate the creation of more sophisticated and tailored natural language processing applications.

Accessing Custom Extensions from Pipeline Components

To retrieve the additional information assigned by our sentence_len_component to each token, you can use:

Python
1# Accessing sentence length information from the pipeline component
2for token in doc:
3    print(f"{token.text}: {token._.sentence_len}")
4
5# Example output
6# JAPAN: 51
7# TO: 51
8# REVISE: 51
9# LONG: 51

This retrieves the dynamically assigned sentence length for each token, showcasing how to interact with data generated by pipeline components.

Lesson Summary and Practice

In this lesson, you've learned two different ways to add custom linguistic features to the spaCy NLP pipeline: using a getter or a pipeline component. By segregating the instruction on how to access custom extensions and components right after their introduction, the lesson aims to reinforce the learning by immediate application of the concepts.

In the upcoming practice exercises, you will have the chance to reinforce these concepts by applying them to real-life scenarios. Through practice, you'll be able to gain a more profound understanding of how these components can enhance your NLP tasks. Unlock the potential of custom pipeline components in spaCy and revolutionize your linguistic analysis!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.