Understanding Semantic Similarity in NLP with spaCy

Lesson 2

Introduction to Semantic Similarity

Welcome! In this lesson, we are going to get hands-on with the concept of Semantic Similarity in Natural Language Processing (NLP).

In NLP, Semantic Similarity is the task of determining how similar two pieces of text are, in terms of meaning. This can be extremely useful in numerous applications, such as simplifying search engines by understanding that a search for "canine" could warrant results related to "dog", or even in more complex tasks such as automatic text summarization. Semantic similarity is usually represented in numerical form, where values close to 1 indicate high similarity, and values close to 0 indicate low similarity.

Understanding Word Vectors and Spacy's Pretrained Models

Before we deep dive into the code, it's important to understand a fundamental concept — Word Vectors.

A word vector is a numeric representation of a word that communicates its relationship to other words. Each word is interpreted as a unique and finite-dimensional vector in a pre-defined vector space. Each dimension in that space corresponds to a specific feature. Words that share common contexts in the corpus are positioned close to one another in the space.

In this lesson, we are using the en_core_web_md model, which is a medium size model that includes word vectors. It is already pre-trained and ready to use in Spacy.

Practical Implementation of Semantic Similarity

Now that we have a grasp of the underlying concepts, let's take a look at our example code.

First, we import the necessary libraries:

Python
1import spacy
2from nltk.corpus import reuters

We then load the pre-trained model using spaCy:

Python
1nlp = spacy.load("en_core_web_md")

We use a random document from the reuters corpus, create a spaCy document object and get a list of all sentences from that document:

Python
1doc_text = reuters.raw(reuters.fileids()[0])
2doc = nlp(doc_text)
3sentences = list(doc.sents)

Next, we calculate and print the semantic similarity between the sixth and fourteenth sentence, and between the second and thirteenth sentence of the document.

Python
1print('Sixth sentence:')
2print(sentences[5])
3print('Fourteenth sentence')
4print(sentences[13])
5similarity = sentences[5].similarity(sentences[13])
6print('Similarity score:', similarity, '\n')
7
8print('Second sentence:')
9print(sentences[1])
10print('Thirteenth sentence')
11print(sentences[12])
12similarity = sentences[1].similarity(sentences[12])
13print('Similarity score:', similarity, '\n')

Understanding the Output

Let's delve into the output produced by our code to better comprehend the semantic similarity analysis in action.

Our code initially displays the specific sentences we're comparing, giving us context for the similarity scores:

Plain text
1Sixth sentence:
2"We wouldn't be able to do business," said a spokesman for
3  leading Japanese electronics firm Matsushita Electric
4  Industrial Co Ltd <MC.T>.
5      
6Fourteenth sentence
7Last year South Korea had a trade surplus of 7.1 billion
8  dlrs with the U.S., Up from 4.9 billion dlrs in 1985.
9      
10Similarity score: 0.4780203700065613

Here, we see a relatively low similarity score of 0.4780. This score suggests that these sentences, while both relate to business, discuss significantly different topics: one addresses a direct statement from a corporate spokesperson and the other discusses trade statistics between countries.

Plain text
1Second sentence:
2They told Reuter correspondents in Asian capitals a U.S.
3  move against Japan might boost protectionist sentiment in the
4  U.S. and lead to curbs on American imports of their products.
5
6Thirteenth sentence
7A senior official of South Korea's trade promotion
8  association said the trade dispute between the U.S. And Japan
9  might also lead to pressure on South Korea, whose chief exports
10  are similar to those of Japan.
11      
12Similarity score: 0.9431586861610413

In contrast, the similarity score of 0.9432 between the second and thirteenth sentences indicates a high level of semantic similarity. This score underscores the shared themes in the narratives regarding the U.S. and Japan trade dynamics, specifically highlighting potential consequences on exports and international relations.

These examples highlight the numerical scores, ranging from 0 to 1, that quantify the semantic similarity between different sentence pairs with precision. Through these scores, we can effectively assess the relatedness of textual content in terms of meaning. The comparison showcases how semantic similarity can be applied to understand nuanced relationships between textual elements, drawing important insights for various NLP applications.

This exploration of semantic similarity scores not only demonstrates the functionality of spaCy's NLP capabilities but also emphasizes the practical utility of semantic analysis across different contexts, offering a solid foundation for further exploration in the realm of Natural Language Processing.

Lesson Summary and Practice Exercises

Great work! You have now learned, and practiced, how to deal with semantics in NLP particularly how to calculate semantic similarity. This is a critical skill for many advanced NLP tasks and will serve as a foundation for more complex NLP tasks, such as entity recognition and linking, which we will explore in future lessons.

In the next exercises, you will have the opportunity to apply the knowledge you have gained today and improve your semantic similarity estimation skills. As always, continue to experiment, as hands-on practice is the best teacher. See you in the next lesson!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.