2 - Unit - 1 - Find Structures of Words

Finding the Structure of Words
• Human language is a complicated thing.
• We use it to express our thoughts, and through language, we

receive information and infer (understand) its meaning.
• Linguistic expressions may appear to unorganized, though they

actually have an underlying organization or structure.
• Trying to understand a language all together is not a viable

approach.
• Linguists have developed a whole disciplines that look at language

from different perspectives and at different levels of detail.
• For Example :
• The point of morphology, for instance, is to study of the variable

forms and functions of words.
• Its syntax is concerned with the arrangement of words into
phrases, clauses, and sentences.
• The rules and limitations governing how words are structured and
formed based on their pronunciation are explained and defined by
the field of linguistics known as phonology.
• The conventions for writing constitute the orthography of a

language.
• The terms like etymology and lexicology cover especially the

evolution of words and explain the semantic, morphological, and
other links among them.
• Words are perhaps the most intuitive units of language, yet they
are in general tricky to define.
• Knowing how to work with them allows, in particular, the
development of syntactic and semantic abstractions.
• Hence, finding the structure of words involves two steps:
• First we need to explore how to identify words of distinct types in

human languages, and
• Second, how the internal structure of words can be modelled in

connection with the grammatical properties and lexical concepts.
• The discovery of word structure is called morphological parsing.

Words and Their Components
• Words are defined in most languages as the smallest linguistic units
that can form a complete utterance by themselves.
• The minimal parts of words that deliver meaning to them are called
morphemes.
• Words in English are delimited only by whitespace and punctuation (

fullstop, comma, and brackets)
• For Example - Will you read the newspaper? Will you read it? I won’t
read it.
• If we make an assumption with insights from etymology and syntax,

we notice two words are here: newspaper and won’t.
• The word newspaper has an interesting derivational structure.

• In writing, newspaper and the associated concept is distinguished
from the isolated news and paper.
• Generally, linguists prefer to analyze won’t as two syntactic words,

or tokens, each of which has its independent role and can be
reverted back to its normalized form.
• The structure of won’t could be parsed as will followed by not.
• In English, this kind of tokenization and normalization may apply

to just a limited set of cases.
• But in other languages, these phenomena have to be treated in a

less trivial manner.
• In Arabic or Hebrew certain tokens are concatenated in writing
with the preceding or the following ones and possibly may change
their forms.
• The lexical or syntactic units blur into one compact string of letters
and no longer appear as distinct words.
• Tokens behaving in this way can be found in various languages and

are often called clitics.
• For Example - writing systems of Chinese, Japanese and Thai,

whitespace is not used to separate words.
• Tokenization, also known as word segmentation, is the fundamental

step of morphological analysis and a prerequisite for most language
processing applications.
1. Lexemes
• By the term word, we often denote word with not just the one
linguistic form in the given context but also as the concept behind
the form and the set of alternative forms that can express it.
• Such sets are called lexemes or lexical items and they constitute
the lexicon of a language.
• Lexemes can be divided by their behavior into the lexical categories
of verbs, nouns, adjectives, conjunctions, articles, or other parts of
speech.
• The citation form of a lexeme, by which it is commonly identified, is
also called its lemma.

• When we transform a lexeme into another one that is
morphologically related we say we derive the lexeme:
• for instance, the nouns receiver and reception are derived from the
verb to receive.
• For Example - Did you see him? I didn’t see him. I didn’t see anyone.
• The example presents the problem of tokenization of didn’t and the

investigation of the internal structure of anyone.
• The paraphrase of Did you see him ? is I saw no one
• The lexeme to see would be inflected into the form saw to reflect its
grammatical function of expressing positive past tense.
• Likewise, him is the oblique case form of he.

• In the paraphrase, no one can be perceived as the minimal word
synonymous with nobody.
2. Morphemes
• Morphology is the study of the structure of words i.e the way words
are built up with smaller and minimal units of meaning which are
termed as morphemes.
• For Example
• Played = play ed
• Cats = cat s
• The word played has two morphemes: play ‘word’ and ed ‘plural
marker’.
• The word cat has two morphemes: cat ‘word’ and s ‘plural marker’.
• There are two broad types of morphemes :
i. Stem (Root) ii. Affixes
• Root is the main meaning bearing morpheme of the word.
• Example : play, cat, friend etc.
• Affixes add ‘additional’ meanings of different kinds.
• Example : -ed, -s, un-, -ly etc.
• Two main types of affixes:
i. Prefixes precede the stem: un-, in- etc.
ii. Suffixes follow the stem: -ed, -s, un-, -ly, etc.
• Affixes are called bound morphemes as they cannot occur on their own
and must combine with a root/stem.
• There are two basic processes of word formation :
i. Inflection ii. Derivation
• Inflection is a process where affixes are added to a root/stem to perform

some grammatical functions but the category of the word remains the
same.
• Example
Lemma Singular Plural
cat cat cats
knife knife knives
• The process through which the new words are formed by adding an affix
to an existing word is called derivation. Example – inter+national =
international.
• Unlike inflection, derivation often leads to change in the category.
• The simplest morphological process concatenates morphs one by

one.
• For Example –
• The word dis-agree-ment-s, where agree is a free lexical morpheme

and the other elements are bound grammatical morphemes
contributing some partial meaning to the whole word.
• The alternative forms of a morpheme are termed allomorphs.

3. Typology
• Morphological typology is a linguistic classification system that

categorizes languages based on the ways they use morphemes, which
are the smallest units of meaning in language.
• It can consider various criteria, and during the history of linguistics,

different classifications have been proposed.
• Based on quantitative relations between words, their morphemes, and

their features, the primary morphological typological categories are :
A. Isolating or Analytical languages
• In isolating languages, words are typically composed of one or more free

morphemes, and there is little or no use of bound morphemes like
prefixes or suffixes to indicate grammatical relationships.
• Each word often carries a single, specific meaning.
• Example - typical isolating members are Chinese, Vietnamese, and

Thai.
B. Synthetic languages
• Synthetic languages can combine more morphemes in one word

and are further divided into agglutinative and fusional languages.
i. Agglutinative languages
• Agglutinative languages use a high number of bound morphemes,

each of which typically carries a single grammatical meaning.
• Example – Japanese language is an agglutinative language where

suffixes are added to roots to convey aspects of tense, mood, and
case.
ii. Fusional Languages
• Fusional languages use bound morphemes, but these morphemes often

carry multiple grammatical meanings or features simultaneously.
• The morphemes are fused together, making it more challenging to

separate individual meanings.
• Example - Arabic, Latin, Sanskrit, German use fusional languages.
• In addition with the word formation processes mentioned above, we can

also find out languages using concatenative and nonlinear forms.
• Concatenative languages link morphs and morphemes one after

another and Nonlinear languages allowing structural components to
merge non-sequentially to apply tonal morphemes or change the
consonantal of words.
Issues and Challenges
• Issues and challenges related to words and their components,
which encompass morphemes, phonemes, and other linguistic
elements which are important considerations in the field of
linguistics and natural language processing.
• Here are some key issues and challenges:
1. Irregularity word forms are not described by a prototypical

linguistic model.
2. Ambiguity word forms be understood in multiple ways out of the

context of their discourse.
3. Productivity is the inventory of words in a language finite, or is it

unlimited ?
• Irregularity
• The phenomenon where certain words or word forms does not follow
regular patterns or rules interms of their morphology or syntax.
• By irregularity, we mean existence of such forms and structures

that are not described appropriately by a prototypical linguistic
model.
• Some irregularities can be redesigned and improve its rules, but

other lexically dependent irregularities often cannot be generalized.
• It is a challenge for the algorithms which follow particular patterns.
• What are actually the word forms that form irregularity ?
• There are many word forms that form irregularity like:

i. Irregular verb or nouns – which does not follow standard pattern
of inflection.
For Example – Some common irregular word verbs are:
ii. Exceptional Inflection – is mainly caused by comparative and

superlative adjectives. For example
Comparative Superlative
Big Bigger Biggest
Dark Darker Darkest
Good Better Best
2. Ambiguity
• Words forms that can be understood in multiple ways out of the context.
• Word forms that look same but have distinct functions or meaning also
called as homonyms.
• Ambiguity arises in morphological processing and language processing.
• Four kinds of ambiguity are :
i. Word sense ambiguity – A particular word will be having different

meanings depending on the context in which they are used.
• For Example
Bank has ambiguity relating to money or river bank.
Bat has ambiguity relating to a bird or cricket bat.

ii. Parts of speech ambiguity – The parts of speech of a particular
word will be changing in different context.
For Example - I run (verb)
He went for a run (noun)
iii. Structural ambiguity – Structural ambiguity is having with

multiple valid syntactic structures.
A sentence which can’t be written in one particular form. The

sentence is written in multiple ways which give same meaning.
For example - “She walked gracefully through the garden.”
Sentence 1 - Through the garden, she walked with grace.
Sentence 2 - She strolled elegantly in the garden.

iv. Referential ambiguity – In referential ambiguity, the name of a person
or thing is reference by pronouns like he/she/it/they/this etc.
For Example - John went to the store, and he bought some groceries.
I found a book at the library, and it was really interesting.
3. Productivity
• Productivity refers to the ability to generate new words or word forms

using productive rules.
• For Example – According to Wikipedia, googol means 1 followed by 100

zeroes.
• From googol, an unknown word google was generated and by using some
productivity rules, a new words are generated like googling, googlish,
googleology etc.
Morphological Models
• There are many possible approaches to designing and
implementing morphological models.
• Over time, computational linguistics has witnessed the

development of a number of formalisms and frameworks.
• The most prominent types of computational approaches to

morphology are:
1. Dictionary Lookup
2. Finite-State Morphology
3. Unification-Based Morphology
4. Functional Morphology
1. Dictionary Lookup
• Dictionary Lookup, also known as Lexicon-Based Morphological

Analysis that relies on pre-built dictionaries or lexicons to
analyze and process words.
• A dictionary is understood as a data structure that directly

enables obtaining some precomputed results, in our case word
analyses.
• The data structure can be optimized for efficient lookup, and

the results can be shared.
• Lookup operations are relatively simple and are usually quick.

• Dictionaries can be implemented, for instance, as lists, binary
search trees, hash tables, and so on.
• Dictionary Lookup as a morphological model works with

following ways:
i. Dictionary Creation: In this approach, a comprehensive

dictionary or lexicon is compiled for the target language.
ii. Word Analysis: When a word is encountered in text, the

Dictionary Lookup model first attempts to find an exact match
for the word in the dictionary.
iii. Lemma Retrieval: The model then retrieves the lemma (base
form) of the word from the dictionary..
iv. Handling Ambiguity: Dictionary Lookup models may need to handle
cases of word ambiguity, where a single word form can have multiple
possible lemmas or meanings.
• Advantages of Dictionary Lookup are :
a. Accuracy
b. Transparency
• Limitations of Dictionary Lookup are :
a. Limited Coverage
b. Ambiguity Handling
c. Resource-Intensive
2. Finite-State Morphology
• By finite-state morphological models, we mean those in which the

specifications written by human programmers are directly compiled
into finite-state transducers.
• The two most popular tools supporting this approach, XFST (Xerox
Finite-State Tool) and Lex Tools.
• Finite-state transducers are computational devices extending the

power of finite-state automata.
• FST consist of a finite set of nodes connected by directed edges

labeled with pairs of input and output symbols.
• In such a network or graph, nodes are also called states, while edges
are called arcs.
• Traversing the network from the set of initial states to the set
of final states along the arcs is equivalent to reading the
sequences of encountered input symbols and writing the
sequences of corresponding output symbols.
• The set of possible sequences accepted by the transducer

defines the input language;
• The set of possible sequences emitted by the transducer defines

the output language.
• Following example shows the FST state diagram for the input
words and their corresponding morphological parsed output or
morphological parsing.
Input Input Morphological parsed
output
Cats cat +N +PL
Cat cat +N +SG

Cities city +N +PL
Geese goose +N +PL
Goose goose +N +SG) or (goose +V)
Gooses goose +V +3SG

Merging merge +V +PRES-PART • Figure : A schematic finite state transducer for
Caught (caught +V +PAST-PART) or English number inflection Tnoun . The symbols

(catch +V +PAST) above each arc represent elements of the
morphological parse in the lexical tape.
•In finite-state computational morphology, it is common to refer to
the input word forms as surface strings and to the output
descriptions as lexical strings.
•In English, a finite-state transducer could analyze the surface string

children into the lexical string child [+plural].
•for instance - woman from woman [+plural(women)]
•Relations on languages can also be viewed as functions.
•Let us have a relation R, and let us denote by [Σ] the set of all
sequences over some set of symbols Σ.
•The domain and the range of R are subsets of [Σ].

• We can then consider R as a function mapping an input string
into a set of output strings, formally denoted by this type of
signature, where [Σ] equals String.
𝑅 ∷[∑]՜ Σ
𝑅 ∷ 𝑆𝑡𝑟𝑖𝑛𝑔 ՜ 𝑆𝑡𝑟𝑖𝑛𝑔
•A theoretical limitation of finite-state models of morphology is

the problem of capturing reduplication of words or their
elements (e.g., to express plurality) found in several human
languages.
3. Unification-based morphology
• Unification-based morphology is a computational approach to

morphological analysis and generation that uses unification to
combine information about the morphemes and features of a
word.
• Unification is a logical operation that merges two or more sets

of constraints into a single consistent set.
• This approach focuses on the relationships between morphemes

(the smallest units of meaning in a language) and how they
combine to create word forms.
• The key components and concepts of Unification-Based
Morphology are :
1. Feature Structures:
• Unification-Based Morphology represents linguistic information

using feature structures, which are data structures that consist
of attribute-value pairs.
• These feature structures encode information about morphemes,

their grammatical properties, and their relationships within
words.
2. Morphemes and Morphological Rules : Morphemes are the

building blocks of words, representing units of meaning.
• Unification-Based Morphology defines morphological rules that
specify how morphemes can combine and interact.
• These rules are expressed using feature structures.
• For Example :
• In English, the word "cats" consists of two morphemes: "cat"

and "-s" (indicating plural).
• These morphemes can be represented as feature structures:
• Morpheme "cat": {base: "cat", pos: "noun", number: "singular"}
• Morpheme "-s": {base: "", pos: "plural marker", number: "plural"}

3. Lexical Entries:
Each word in a language is associated with a lexical entry, which

includes information about the word's morphemes, their features, and
how they are combined.
Lexical entries are represented as feature structures.
For Example :
Lexical Entry for "cats":
Word: "cats"
Morphemes: {"cat", "-s"}
Feature Structures:
{ { base: "cat", pos: "noun", number: "singular"},
{ base: “ ”, pos: "plural marker", number: "plural"}}
4. Morphological Analysis:
• Unification-Based Morphology performs morphological analysis by

unifying feature structures representing morphemes to generate the
complete word form.
• This process involves unification, which combines feature structures

while preserving shared features and resolving conflicts.
• For Example :
To analyze the word "cats," the feature structures for its

morphemes are unified to generate the complete word form:
• Unified Feature Structure for "cats":

{base: "cat", pos: "noun", number: "plural"}
5. Morphological Generation:
• Morphological generation, involves starting with feature structures

representing the desired grammatical and semantic properties of a word
and generating the word form by applying morphological rules in
reverse.
• For Example:
• Conversely, in morphological generation, we can start with the desired

features and generate a word form:
• Desired Feature Structure: {base: "dog", pos: "noun", number: "plural"}
• Morphological Rules:
Apply "-s" to indicate plural.
• Generated Word Form: "dogs"

6. Ambiguity Handling
• Unification-Based Morphology provides a framework for handling

morphological ambiguity by representing multiple potential
feature structures and allowing for dis-ambiguation based on
context or additional linguistic constraints.
• Consider the word "saw," which could be a past tense verb or a

noun (e.g., a tool).
• Ambiguity Representation:
• Past Tense Verb: {base: "see", pos: "verb", tense: "past"}
• Noun (Tool): {base: "saw", pos: "noun"}
4. Functional Morphology
•Functional morphology defines its models using principles of

functional programming and type theory.
•It treats morphological operations and processes as pure

mathematical functions and organizes the linguistic expression.
•It also abstract elements of a model into distinct types of values

and type classes.
•Functional morphology is not limited to modelling particular

types of morphologies in human languages, it is also especially
useful for fusional morphologies.
•Linguistic notions like paradigms, rules, exceptions,
grammatical categories, parameters, lexemes, morphemes, and
morphs can be represented intuitively and clearly.
•Functional morphology implementations are intended to be

reused as programming libraries capable of handling the
complete morphology of a language.
•We can describe inflection I, derivation D, and lookup L as

functions of these generic type as below :
• 𝑰 ∷ 𝒍𝒆𝒙𝒆𝒎𝒆 ՜ 𝒑𝒂𝒓𝒂𝒎𝒆𝒕𝒆𝒓 ՜ 𝒇𝒐𝒓𝒎
• 𝑫 ∷ 𝒍𝒆𝒙𝒆𝒎𝒆 ՜ 𝒑𝒂𝒓𝒂𝒎𝒆𝒕𝒆𝒓 ՜ 𝒍𝒆𝒙𝒆𝒎𝒆
• 𝑳 ∷ 𝒄𝒐𝒏𝒕𝒆𝒏𝒕 ՜ 𝒍𝒆𝒙𝒆𝒎𝒆
•Many functional morphology implementations are embedded in
a general-purpose programming language.
•This gives programmers more freedom with advanced

programming techniques and allows them to develop full-
featured, real-world applications for their models.
•For instance
•The Zen toolkit for Sanskrit morphology is written in OCaml.
•It influenced the functional morphology framework in Haskell,

with which morphologies of Latin, Swedish, Spanish, Urdu, and
other languages have been implemented.
• Morphological grammars in Grammatical Framework can be
extended with descriptions of the syntax and semantics of a
language.
• Grammatical Framework itself supports multi-linguality, and

models of more than a dozen languages that are available as
open-source software.
Assignment Questions
1. Define the following terminologies
a. Morphology
b. Phonology
c. Orthography
d. Etymology and lexicology
e. Morphological Parsing
2. Give the steps to find the structure of the words given a sentence.
3. What is the significance of words and their components.
4. Explain the key components of word with example.
5. What are the issues and challenges in finding the structure o words ? Discuss.
6. Explain the dictionary lookup as a morphological model.
7. What is a finite state morphology ? Traverse the given words using FST
i. cities ii. Cats iii. Gooses iv. Caught v. runs vi. Feet
8. What is unification-based morphology ? Give its key components with
examples.
9. Define functional morphology. What is the significance of using functional
morphology.

2 - Unit - 1 - Find Structures of Words

Uploaded by

2 - Unit - 1 - Find Structures of Words

Uploaded by

Finding the Structure of Words

• Human language is a complicated thing.

• We use it to express our thoughts, and through language, we

• Linguistic expressions may appear to unorganized, though they

• Trying to understand a language all together is not a viable

• Linguists have developed a whole disciplines that look at language

• The point of morphology, for instance, is to study of the variable

• The conventions for writing constitute the orthography of a

• The terms like etymology and lexicology cover especially the

• Hence, finding the structure of words involves two steps:

• First we need to explore how to identify words of distinct types in

• Second, how the internal structure of words can be modelled in

• The discovery of word structure is called morphological parsing.

• Words in English are delimited only by whitespace and punctuation (

• If we make an assumption with insights from etymology and syntax,

• The word newspaper has an interesting derivational structure.

• Generally, linguists prefer to analyze won’t as two syntactic words,

• The structure of won’t could be parsed as will followed by not.

• In English, this kind of tokenization and normalization may apply

• But in other languages, these phenomena have to be treated in a

• Tokens behaving in this way can be found in various languages and

• For Example - writing systems of Chinese, Japanese and Thai,

• Tokenization, also known as word segmentation, is the fundamental

the lexicon of a language.

• Lexemes can be divided by their behavior into the lexical categories

of verbs, nouns, adjectives, conjunctions, articles, or other parts of

• The citation form of a lexeme, by which it is commonly identified, is

also called its lemma.

• The example presents the problem of tokenization of didn’t and the

• The paraphrase of Did you see him ? is I saw no one

• Likewise, him is the oblique case form of he.

• There are two broad types of morphemes :

i. Stem (Root) ii. Affixes

• Root is the main meaning bearing morpheme of the word.

• Example : play, cat, friend etc.

• Affixes add ‘additional’ meanings of different kinds.

• Example : -ed, -s, un-, -ly etc.

• Two main types of affixes:

i. Prefixes precede the stem: un-, in- etc.

• There are two basic processes of word formation :

i. Inflection ii. Derivation

• Inflection is a process where affixes are added to a root/stem to perform

• The simplest morphological process concatenates morphs one by

• The word dis-agree-ment-s, where agree is a free lexical morpheme

• The alternative forms of a morpheme are termed allomorphs.

• Morphological typology is a linguistic classification system that

• It can consider various criteria, and during the history of linguistics,

• Based on quantitative relations between words, their morphemes, and

A. Isolating or Analytical languages

• In isolating languages, words are typically composed of one or more free

• Example - typical isolating members are Chinese, Vietnamese, and

• Synthetic languages can combine more morphemes in one word

• Agglutinative languages use a high number of bound morphemes,

• Example – Japanese language is an agglutinative language where

• Fusional languages use bound morphemes, but these morphemes often

• The morphemes are fused together, making it more challenging to

• Example - Arabic, Latin, Sanskrit, German use fusional languages.

• In addition with the word formation processes mentioned above, we can

• Concatenative languages link morphs and morphemes one after

• Here are some key issues and challenges:

1. Irregularity word forms are not described by a prototypical

2. Ambiguity word forms be understood in multiple ways out of the

3. Productivity is the inventory of words in a language finite, or is it

• By irregularity, we mean existence of such forms and structures

• Some irregularities can be redesigned and improve its rules, but

• It is a challenge for the algorithms which follow particular patterns.

• What are actually the word forms that form irregularity ?

• There are many word forms that form irregularity like:

For Example – Some common irregular word verbs are:

ii. Exceptional Inflection – is mainly caused by comparative and

• Ambiguity arises in morphological processing and language processing.

• Four kinds of ambiguity are :

i. Word sense ambiguity – A particular word will be having different

Bank has ambiguity relating to money or river bank.

Bat has ambiguity relating to a bird or cricket bat.

For Example - I run (verb)