Hello, welcome to the next step in your linguistic journey! In today's lesson, we'll expand upon your foundational linguistic knowledge and step into the world of syntactic dependencies and token shapes. This knowledge will equip you to delve even deeper into the fascinating realm of Natural Language Processing (NLP).
The first stop in our journey is syntactic dependencies. So, what are they? Simply put, syntactic dependencies are the grammatical relationships between words in a sentence. This could be a subject-verb relationship, an adjective-noun relationship, or other types of grammatical relations. Why are they important in NLP? They help us understand how words relate to each other and how they come together to convey meaning in a sentence.
In Python, with the help of SpaCy
, we can extract these dependencies easily. Let's take a look at how to do this with our sample text from the Reuters corpus.
Python1import spacy 2from nltk.corpus import reuters 3 4# Let's load the English NLP model 5nlp = spacy.load('en_core_web_sm') 6 7# Take a sample text from reuters corpus 8sample_text = reuters.raw(reuters.fileids()[0]) 9 10# Pass the text to the nlp object 11doc = nlp(sample_text) 12 13# For syntactic tokens 14print('\nSyntactic Dependencies:\n') 15for token in doc: 16 print(f"{token.text:<10s} {token.dep_:<10s} {token.head.text:<10s}")
In each line of output, the first word is the token, the second word is the type of syntactic dependency, and the third word is the head of the token. The head of a token is typically the word that governs the relationship between the words. This simple code gives us a depth of insight into the grammatical structure of the text!
Alright, let's take a concrete look at the potential output our syntactic dependencies code could produce.
Plain text1ASIAN compound EXPORTERS 2EXPORTERS nsubj FEAR 3FEAR ccomp said 4DAMAGE nsubj raised 5FROM prep DAMAGE 6U.S.-JAPAN compound friction
Even at first glance, we can already start to see patterns and relationships emerge from this data. However, to truly gain insights, we must understand what these outcome values mean:
ASIAN
: Here, "ASIAN" has acompound
dependency type. A compound relationship is formed when two nouns come together to form a new noun, such as "ASIAN EXPORTERS".EXPORTERS
: The nominal subject (nsubj) of the verb "FEAR" is "EXPORTERS". The nominal subject is typically the "doer" of the action and corresponds to "who" or "what" in the sentence.FEAR
: Theccomp
in this case stands for clausal complement, referring to "FEAR". These complements are subclauses that provide additional information but usually can't make sense as separate sentences.DAMAGE
: It is considered to be the nominal subject (nsubj) for the verb "raised".FROM
: Labeled with aprep
, which stands for preposition, "FROM" provides a relationship between "U.S.-JAPAN" and another word in the sentence.
Besides these, there are different types of dependencies that you might encounter as well:
relcl
: It stands for relative clause modifier. They use words like "who" or "which" to provide more detail about the noun.dobj
: Denotes direct object. This may be the noun or noun phrase that is receiving the action in the sentence.ROOT
: This is the main verb in any given sentence, to which all other words are connected in a manner that is either direct or indirect.nsubjpass
: This refers to the nominal subject in a passive sentence. In such sentences, the subject is usually receiving the action of the verb.pobj
: Stands for object of a preposition. This is usually the noun coming after the preposition in the sentence.
Finally, remember that understanding these dependencies is vital if you want to dive deeply into the grammatical structure and meaning of a sentence. Now that we've dissected syntactic dependencies output, let's move on to our next interesting segment - the exploration of token shapes.
The next concept we'll explore is token shapes. A token shape is a type of transformation applied to the string representation of a token to provide a description of its orthographic structure — in other words, its shape focuses on the form of characters rather than their actual content.
Here's how the transformation works:
- Alphabetic characters are replaced by
x
orX
. Lowercase characters becomex
and uppercase characters becomeX
. - Numeric characters are replaced by
d
. - Sequences of the same character are truncated after length 4.
For example, a word like "Python" has an initial uppercase letter followed by lowercase letters, and thus gets transformed to "Xxxxx".
Let's see how to get these token shapes using our example text:
Python1print('\nToken Shapes:\n') 2for token in doc[25:]: 3 print(f"{token.text:<10s} {token.shape_:<10s}")
When put to work, token shapes can provide valuable insights. You may realize, for instance, that uppercase words are typically proper nouns, and digits represent numerical values, among other patterns.
Looking at the output produced by our code:
Plain text1seven xxxx 2and xxx 312 dd 4pct xxx 5of xx 6China Xxxxx 7's 'x
Here's how to interpret these shapes:
-
seven
: The shapexxxx
conveys that "seven" is composed of lowercase letters, hinting at its alphabetic nature without indicating specific letters, which helps in analyzing text patterns while abstracting away the details. Note that the shape was truncated to 4 characters. -
and
: With a shape ofxxx
, this indicates that "and" consists of three lowercase letters. This distinct shape aids in recognizing small, commonly used words in analyses. -
12
: Represented asdd
, it clearly illustrates that "12" is a numeric token, consisting of two digits. This differentiation is vital for tasks that require numeric value processing or identification. -
pct
: The token "pct" is shown with a shapexxx
, indicating three lowercase letters. -
of
: Its shapexx
succinctly reflects that "of" is a short, two-letter word, all in lowercase. Recognizing such functional tokens is crucial for understanding the grammatical structure of sentences. -
China
: The shapeXxxxx
signals that "China" starts with an uppercase letter followed by lowercase letters, a characteristic feature of proper nouns. This insight is fundamental for tasks like Named Entity Recognition, as it distinguishes proper nouns from other text elements. -
's
: With a shape of'x
, this combination suggests the presence of a punctuation mark followed by a lowercase letter, a common feature in possessive constructions or contractions. Identifying these constructions is essential for parsing and understanding sentence structures.
With this understanding of token shapes, you can now integrate this intelligence into your NLP tasks, yielding even more insightful results!
Now that we've extracted syntactic dependencies and token shapes from our text, let's take a moment to reflect on the insights that these features offer. First, the syntactic dependencies give us a good understanding of the grammatical structure of the text. This can be extremely helpful when we're trying to parse sentences and understand the relationships between words.
On the other hand, token shapes allow us to observe patterns in the structure of words. This can be especially useful in tasks such as spam detection, where certain patterns of words or characters might be more common.
On the whole, understanding these linguistic features provides us with a deeper understanding of text, equipping us to perform more nuanced analyses.
Congratulations on completing this detailed journey into syntactic dependencies and token shapes! You've not only learned what these concepts are, but have also extracted them from a text using Python and spaCy
. Remember, linguistics is at the heart of Natural Language Processing, and understanding these features will stand you in good stead for more advanced tasks in this field.
In the upcoming practice exercises, you'll have the opportunity to apply these concepts to various texts. This practice will solidify your understanding and prepare you for the next lesson, where we'll explore the intricacies of semantics in NLP. Happy learning!