Exploring Syntactic Dependencies and Token Shapes in NLP

Lesson 1

Introduction to Syntactic Dependencies

Hello, welcome to the next step in your linguistic journey! In today's lesson, we'll expand upon your foundational linguistic knowledge and step into the world of syntactic dependencies and token shapes. This knowledge will equip you to delve even deeper into the fascinating realm of Natural Language Processing (NLP).

The first stop in our journey is syntactic dependencies. So, what are they? Simply put, syntactic dependencies are the grammatical relationships between words in a sentence. This could be a subject-verb relationship, an adjective-noun relationship, or other types of grammatical relations. Why are they important in NLP? They help us understand how words relate to each other and how they come together to convey meaning in a sentence.

In Python, with the help of SpaCy, we can extract these dependencies easily. Let's take a look at how to do this with our sample text from the Reuters corpus.

Python
1import spacy
2from nltk.corpus import reuters
3
4# Let's load the English NLP model
5nlp = spacy.load('en_core_web_sm')
6
7# Take a sample text from reuters corpus
8sample_text = reuters.raw(reuters.fileids()[0])
9
10# Pass the text to the nlp object
11doc = nlp(sample_text)
12
13# For syntactic tokens
14print('\nSyntactic Dependencies:\n')
15for token in doc:
16    print(f"{token.text:<10s} {token.dep_:<10s} {token.head.text:<10s}")

In each line of output, the first word is the token, the second word is the type of syntactic dependency, and the third word is the head of the token. The head of a token is typically the word that governs the relationship between the words. This simple code gives us a depth of insight into the grammatical structure of the text!

Unpacking Syntactic Dependencies Output

Alright, let's take a concrete look at the potential output our syntactic dependencies code could produce.

Plain text
1ASIAN      compound   EXPORTERS 
2EXPORTERS  nsubj      FEAR      
3FEAR       ccomp      said      
4DAMAGE     nsubj      raised    
5FROM       prep       DAMAGE    
6U.S.-JAPAN compound   friction

Even at first glance, we can already start to see patterns and relationships emerge from this data. However, to truly gain insights, we must understand what these outcome values mean:

ASIAN: Here, "ASIAN" has a compound dependency type. A compound relationship is formed when two nouns come together to form a new noun, such as "ASIAN EXPORTERS".
EXPORTERS: The nominal subject (nsubj) of the verb "FEAR" is "EXPORTERS". The nominal subject is typically the "doer" of the action and corresponds to "who" or "what" in the sentence.
FEAR: The ccomp in this case stands for clausal complement, referring to "FEAR". These complements are subclauses that provide additional information but usually can't make sense as separate sentences.
DAMAGE: It is considered to be the nominal subject (nsubj) for the verb "raised".
FROM: Labeled with a prep, which stands for preposition, "FROM" provides a relationship between "U.S.-JAPAN" and another word in the sentence.

Besides these, there are different types of dependencies that you might encounter as well:

relcl: It stands for relative clause modifier. They use words like "who" or "which" to provide more detail about the noun.
dobj: Denotes direct object. This may be the noun or noun phrase that is receiving the action in the sentence.
ROOT: This is the main verb in any given sentence, to which all other words are connected in a manner that is either direct or indirect.
nsubjpass: This refers to the nominal subject in a passive sentence. In such sentences, the subject is usually receiving the action of the verb.
pobj: Stands for object of a preposition. This is usually the noun coming after the preposition in the sentence.

Finally, remember that understanding these dependencies is vital if you want to dive deeply into the grammatical structure and meaning of a sentence. Now that we've dissected syntactic dependencies output, let's move on to our next interesting segment - the exploration of token shapes.

Delving into Token Shapes

The next concept we'll explore is token shapes. A token shape is a type of transformation applied to the string representation of a token to provide a description of its orthographic structure — in other words, its shape focuses on the form of characters rather than their actual content.

Here's how the transformation works:

Alphabetic characters are replaced by x or X. Lowercase characters become x and uppercase characters become X.
Numeric characters are replaced by d.
Sequences of the same character are truncated after length 4.

For example, a word like "Python" has an initial uppercase letter followed by lowercase letters, and thus gets transformed to "Xxxxx".

Let's see how to get these token shapes using our example text:

Python
1print('\nToken Shapes:\n')
2for token in doc[25:]:
3    print(f"{token.text:<10s} {token.shape_:<10s}")

When put to work, token shapes can provide valuable insights. You may realize, for instance, that uppercase words are typically proper nouns, and digits represent numerical values, among other patterns.

Understanding the Token Shapes Output

Looking at the output produced by our code:

Plain text
1seven      xxxx      
2and        xxx       
312         dd        
4pct        xxx       
5of         xx        
6China      Xxxxx     
7's         'x

Here's how to interpret these shapes:

seven: The shape xxxx conveys that "seven" is composed of lowercase letters, hinting at its alphabetic nature without indicating specific letters, which helps in analyzing text patterns while abstracting away the details. Note that the shape was truncated to 4 characters.
and: With a shape of xxx, this indicates that "and" consists of three lowercase letters. This distinct shape aids in recognizing small, commonly used words in analyses.
12: Represented as dd, it clearly illustrates that "12" is a numeric token, consisting of two digits. This differentiation is vital for tasks that require numeric value processing or identification.
pct: The token "pct" is shown with a shape xxx, indicating three lowercase letters.
of: Its shape xx succinctly reflects that "of" is a short, two-letter word, all in lowercase. Recognizing such functional tokens is crucial for understanding the grammatical structure of sentences.
China: The shape Xxxxx signals that "China" starts with an uppercase letter followed by lowercase letters, a characteristic feature of proper nouns. This insight is fundamental for tasks like Named Entity Recognition, as it distinguishes proper nouns from other text elements.
's: With a shape of 'x, this combination suggests the presence of a punctuation mark followed by a lowercase letter, a common feature in possessive constructions or contractions. Identifying these constructions is essential for parsing and understanding sentence structures.

With this understanding of token shapes, you can now integrate this intelligence into your NLP tasks, yielding even more insightful results!

Experience the Power of Linguistics

Now that we've extracted syntactic dependencies and token shapes from our text, let's take a moment to reflect on the insights that these features offer. First, the syntactic dependencies give us a good understanding of the grammatical structure of the text. This can be extremely helpful when we're trying to parse sentences and understand the relationships between words.

On the other hand, token shapes allow us to observe patterns in the structure of words. This can be especially useful in tasks such as spam detection, where certain patterns of words or characters might be more common.

On the whole, understanding these linguistic features provides us with a deeper understanding of text, equipping us to perform more nuanced analyses.

Lesson Summary and Practice

Congratulations on completing this detailed journey into syntactic dependencies and token shapes! You've not only learned what these concepts are, but have also extracted them from a text using Python and spaCy. Remember, linguistics is at the heart of Natural Language Processing, and understanding these features will stand you in good stead for more advanced tasks in this field.

In the upcoming practice exercises, you'll have the opportunity to apply these concepts to various texts. This practice will solidify your understanding and prepare you for the next lesson, where we'll explore the intricacies of semantics in NLP. Happy learning!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.