Welcome! In this lesson, we are going to get hands-on with the concept of Semantic Similarity in Natural Language Processing (NLP).
In NLP, Semantic Similarity is the task of determining how similar two pieces of text are, in terms of meaning. This can be extremely useful in numerous applications, such as simplifying search engines by understanding that a search for "canine" could warrant results related to "dog", or even in more complex tasks such as automatic text summarization. Semantic similarity is usually represented in numerical form, where values close to 1 indicate high similarity, and values close to 0 indicate low similarity.
Before we deep dive into the code, it's important to understand a fundamental concept — Word Vectors.
A word vector is a numeric representation of a word that communicates its relationship to other words. Each word is interpreted as a unique and finite-dimensional vector in a pre-defined vector space. Each dimension in that space corresponds to a specific feature. Words that share common contexts in the corpus are positioned close to one another in the space.
In this lesson, we are using the en_core_web_md
model, which is a medium size model that includes word vectors. It is already pre-trained and ready to use in Spacy.
Now that we have a grasp of the underlying concepts, let's take a look at our example code.
First, we import the necessary libraries:
Python1import spacy 2from nltk.corpus import reuters
We then load the pre-trained model using spaCy:
Python1nlp = spacy.load("en_core_web_md")
We use a random document from the reuters
corpus, create a spaCy document object and get a list of all sentences from that document:
Python1doc_text = reuters.raw(reuters.fileids()[0]) 2doc = nlp(doc_text) 3sentences = list(doc.sents)
Next, we calculate and print the semantic similarity between the sixth and fourteenth sentence, and between the second and thirteenth sentence of the document.
Python1print('Sixth sentence:') 2print(sentences[5]) 3print('Fourteenth sentence') 4print(sentences[13]) 5similarity = sentences[5].similarity(sentences[13]) 6print('Similarity score:', similarity, '\n') 7 8print('Second sentence:') 9print(sentences[1]) 10print('Thirteenth sentence') 11print(sentences[12]) 12similarity = sentences[1].similarity(sentences[12]) 13print('Similarity score:', similarity, '\n')
Let's delve into the output produced by our code to better comprehend the semantic similarity analysis in action.
Our code initially displays the specific sentences we're comparing, giving us context for the similarity scores:
Plain text1Sixth sentence: 2"We wouldn't be able to do business," said a spokesman for 3 leading Japanese electronics firm Matsushita Electric 4 Industrial Co Ltd <MC.T>. 5 6Fourteenth sentence 7Last year South Korea had a trade surplus of 7.1 billion 8 dlrs with the U.S., Up from 4.9 billion dlrs in 1985. 9 10Similarity score: 0.4780203700065613
Here, we see a relatively low similarity score of 0.4780. This score suggests that these sentences, while both relate to business, discuss significantly different topics: one addresses a direct statement from a corporate spokesperson and the other discusses trade statistics between countries.
Plain text1Second sentence: 2They told Reuter correspondents in Asian capitals a U.S. 3 move against Japan might boost protectionist sentiment in the 4 U.S. and lead to curbs on American imports of their products. 5 6Thirteenth sentence 7A senior official of South Korea's trade promotion 8 association said the trade dispute between the U.S. And Japan 9 might also lead to pressure on South Korea, whose chief exports 10 are similar to those of Japan. 11 12Similarity score: 0.9431586861610413
In contrast, the similarity score of 0.9432 between the second and thirteenth sentences indicates a high level of semantic similarity. This score underscores the shared themes in the narratives regarding the U.S. and Japan trade dynamics, specifically highlighting potential consequences on exports and international relations.
These examples highlight the numerical scores, ranging from 0 to 1, that quantify the semantic similarity between different sentence pairs with precision. Through these scores, we can effectively assess the relatedness of textual content in terms of meaning. The comparison showcases how semantic similarity can be applied to understand nuanced relationships between textual elements, drawing important insights for various NLP applications.
This exploration of semantic similarity scores not only demonstrates the functionality of spaCy's NLP capabilities but also emphasizes the practical utility of semantic analysis across different contexts, offering a solid foundation for further exploration in the realm of Natural Language Processing.
Great work! You have now learned, and practiced, how to deal with semantics in NLP particularly how to calculate semantic similarity. This is a critical skill for many advanced NLP tasks and will serve as a foundation for more complex NLP tasks, such as entity recognition and linking, which we will explore in future lessons.
In the next exercises, you will have the opportunity to apply the knowledge you have gained today and improve your semantic similarity estimation skills. As always, continue to experiment, as hands-on practice is the best teacher. See you in the next lesson!