0% found this document useful (0 votes)
20 views42 pages

2 - Unit - 1 - Find Structures of Words

Uploaded by

Kola Siri
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
20 views42 pages

2 - Unit - 1 - Find Structures of Words

Uploaded by

Kola Siri
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 42

Finding the Structure of Words

• Human language is a complicated thing.

• We use it to express our thoughts, and through language, we


receive information and infer (understand) its meaning.

• Linguistic expressions may appear to unorganized, though they


actually have an underlying organization or structure.

• Trying to understand a language all together is not a viable


approach.

• Linguists have developed a whole disciplines that look at language


from different perspectives and at different levels of detail.

• For Example :

• The point of morphology, for instance, is to study of the variable


forms and functions of words.
Finding the Structure of Words
• Its syntax is concerned with the arrangement of words into
phrases, clauses, and sentences.

• The rules and limitations governing how words are structured and
formed based on their pronunciation are explained and defined by
the field of linguistics known as phonology.

• The conventions for writing constitute the orthography of a


language.

• The terms like etymology and lexicology cover especially the


evolution of words and explain the semantic, morphological, and
other links among them.

• Words are perhaps the most intuitive units of language, yet they
are in general tricky to define.
Finding the Structure of Words
• Knowing how to work with them allows, in particular, the
development of syntactic and semantic abstractions.

• Hence, finding the structure of words involves two steps:

• First we need to explore how to identify words of distinct types in


human languages, and

• Second, how the internal structure of words can be modelled in


connection with the grammatical properties and lexical concepts.

• The discovery of word structure is called morphological parsing.


Words and Their Components
• Words are defined in most languages as the smallest linguistic units
that can form a complete utterance by themselves.

• The minimal parts of words that deliver meaning to them are called
morphemes.

• Words in English are delimited only by whitespace and punctuation (


fullstop, comma, and brackets)

• For Example - Will you read the newspaper? Will you read it? I won’t
read it.

• If we make an assumption with insights from etymology and syntax,


we notice two words are here: newspaper and won’t.

• The word newspaper has an interesting derivational structure.


Words and Their Components
• In writing, newspaper and the associated concept is distinguished
from the isolated news and paper.

• Generally, linguists prefer to analyze won’t as two syntactic words,


or tokens, each of which has its independent role and can be
reverted back to its normalized form.

• The structure of won’t could be parsed as will followed by not.

• In English, this kind of tokenization and normalization may apply


to just a limited set of cases.

• But in other languages, these phenomena have to be treated in a


less trivial manner.
Words and Their Components
• In Arabic or Hebrew certain tokens are concatenated in writing
with the preceding or the following ones and possibly may change
their forms.

• The lexical or syntactic units blur into one compact string of letters
and no longer appear as distinct words.

• Tokens behaving in this way can be found in various languages and


are often called clitics.

• For Example - writing systems of Chinese, Japanese and Thai,


whitespace is not used to separate words.

• Tokenization, also known as word segmentation, is the fundamental


step of morphological analysis and a prerequisite for most language
processing applications.
Words and Their Components
1. Lexemes

• By the term word, we often denote word with not just the one

linguistic form in the given context but also as the concept behind

the form and the set of alternative forms that can express it.

• Such sets are called lexemes or lexical items and they constitute

the lexicon of a language.

• Lexemes can be divided by their behavior into the lexical categories

of verbs, nouns, adjectives, conjunctions, articles, or other parts of

speech.

• The citation form of a lexeme, by which it is commonly identified, is

also called its lemma.


Words and Their Components
• When we transform a lexeme into another one that is
morphologically related we say we derive the lexeme:

• for instance, the nouns receiver and reception are derived from the
verb to receive.

• For Example - Did you see him? I didn’t see him. I didn’t see anyone.

• The example presents the problem of tokenization of didn’t and the


investigation of the internal structure of anyone.

• The paraphrase of Did you see him ? is I saw no one

• The lexeme to see would be inflected into the form saw to reflect its
grammatical function of expressing positive past tense.

• Likewise, him is the oblique case form of he.


Words and Their Components
• In the paraphrase, no one can be perceived as the minimal word
synonymous with nobody.

2. Morphemes

• Morphology is the study of the structure of words i.e the way words
are built up with smaller and minimal units of meaning which are
termed as morphemes.

• For Example

• Played = play ed

• Cats = cat s

• The word played has two morphemes: play ‘word’ and ed ‘plural
marker’.
Words and Their Components
• The word cat has two morphemes: cat ‘word’ and s ‘plural marker’.

• There are two broad types of morphemes :

i. Stem (Root) ii. Affixes

• Root is the main meaning bearing morpheme of the word.

• Example : play, cat, friend etc.

• Affixes add ‘additional’ meanings of different kinds.

• Example : -ed, -s, un-, -ly etc.

• Two main types of affixes:

i. Prefixes precede the stem: un-, in- etc.

ii. Suffixes follow the stem: -ed, -s, un-, -ly, etc.
Words and Their Components
• Affixes are called bound morphemes as they cannot occur on their own
and must combine with a root/stem.

• There are two basic processes of word formation :

i. Inflection ii. Derivation

• Inflection is a process where affixes are added to a root/stem to perform


some grammatical functions but the category of the word remains the
same.

• Example
Lemma Singular Plural
cat cat cats
knife knife knives
• The process through which the new words are formed by adding an affix
to an existing word is called derivation. Example – inter+national =
international.
Words and Their Components
• Unlike inflection, derivation often leads to change in the category.

• The simplest morphological process concatenates morphs one by


one.

• For Example –

• The word dis-agree-ment-s, where agree is a free lexical morpheme


and the other elements are bound grammatical morphemes
contributing some partial meaning to the whole word.

• The alternative forms of a morpheme are termed allomorphs.


Words and Their Components
3. Typology

• Morphological typology is a linguistic classification system that


categorizes languages based on the ways they use morphemes, which
are the smallest units of meaning in language.

• It can consider various criteria, and during the history of linguistics,


different classifications have been proposed.

• Based on quantitative relations between words, their morphemes, and


their features, the primary morphological typological categories are :

A. Isolating or Analytical languages

• In isolating languages, words are typically composed of one or more free


morphemes, and there is little or no use of bound morphemes like
prefixes or suffixes to indicate grammatical relationships.
Words and Their Components
• Each word often carries a single, specific meaning.

• Example - typical isolating members are Chinese, Vietnamese, and


Thai.

B. Synthetic languages

• Synthetic languages can combine more morphemes in one word


and are further divided into agglutinative and fusional languages.

i. Agglutinative languages

• Agglutinative languages use a high number of bound morphemes,


each of which typically carries a single grammatical meaning.

• Example – Japanese language is an agglutinative language where


suffixes are added to roots to convey aspects of tense, mood, and
case.
Words and Their Components
ii. Fusional Languages

• Fusional languages use bound morphemes, but these morphemes often


carry multiple grammatical meanings or features simultaneously.

• The morphemes are fused together, making it more challenging to


separate individual meanings.

• Example - Arabic, Latin, Sanskrit, German use fusional languages.

• In addition with the word formation processes mentioned above, we can


also find out languages using concatenative and nonlinear forms.

• Concatenative languages link morphs and morphemes one after


another and Nonlinear languages allowing structural components to
merge non-sequentially to apply tonal morphemes or change the
consonantal of words.
Issues and Challenges
• Issues and challenges related to words and their components,
which encompass morphemes, phonemes, and other linguistic
elements which are important considerations in the field of
linguistics and natural language processing.

• Here are some key issues and challenges:

1. Irregularity word forms are not described by a prototypical


linguistic model.

2. Ambiguity word forms be understood in multiple ways out of the


context of their discourse.

3. Productivity is the inventory of words in a language finite, or is it


unlimited ?
Issues and Challenges
• Irregularity

• The phenomenon where certain words or word forms does not follow
regular patterns or rules interms of their morphology or syntax.

• By irregularity, we mean existence of such forms and structures


that are not described appropriately by a prototypical linguistic
model.

• Some irregularities can be redesigned and improve its rules, but


other lexically dependent irregularities often cannot be generalized.

• It is a challenge for the algorithms which follow particular patterns.

• What are actually the word forms that form irregularity ?

• There are many word forms that form irregularity like:


Issues and Challenges
i. Irregular verb or nouns – which does not follow standard pattern
of inflection.

For Example – Some common irregular word verbs are:

ii. Exceptional Inflection – is mainly caused by comparative and


superlative adjectives. For example
Comparative Superlative
Big Bigger Biggest
Dark Darker Darkest
Good Better Best
Issues and Challenges
2. Ambiguity

• Words forms that can be understood in multiple ways out of the context.

• Word forms that look same but have distinct functions or meaning also
called as homonyms.

• Ambiguity arises in morphological processing and language processing.

• Four kinds of ambiguity are :

i. Word sense ambiguity – A particular word will be having different


meanings depending on the context in which they are used.

• For Example

Bank has ambiguity relating to money or river bank.

Bat has ambiguity relating to a bird or cricket bat.


Issues and Challenges
ii. Parts of speech ambiguity – The parts of speech of a particular
word will be changing in different context.

For Example - I run (verb)

He went for a run (noun)

iii. Structural ambiguity – Structural ambiguity is having with


multiple valid syntactic structures.

A sentence which can’t be written in one particular form. The


sentence is written in multiple ways which give same meaning.

For example - “She walked gracefully through the garden.”

Sentence 1 - Through the garden, she walked with grace.

Sentence 2 - She strolled elegantly in the garden.


Issues and Challenges
iv. Referential ambiguity – In referential ambiguity, the name of a person
or thing is reference by pronouns like he/she/it/they/this etc.

For Example - John went to the store, and he bought some groceries.

I found a book at the library, and it was really interesting.

3. Productivity

• Productivity refers to the ability to generate new words or word forms


using productive rules.

• For Example – According to Wikipedia, googol means 1 followed by 100


zeroes.

• From googol, an unknown word google was generated and by using some
productivity rules, a new words are generated like googling, googlish,
googleology etc.
Morphological Models
• There are many possible approaches to designing and
implementing morphological models.

• Over time, computational linguistics has witnessed the


development of a number of formalisms and frameworks.

• The most prominent types of computational approaches to


morphology are:

1. Dictionary Lookup

2. Finite-State Morphology

3. Unification-Based Morphology

4. Functional Morphology
Morphological Models
1. Dictionary Lookup

• Dictionary Lookup, also known as Lexicon-Based Morphological


Analysis that relies on pre-built dictionaries or lexicons to
analyze and process words.

• A dictionary is understood as a data structure that directly


enables obtaining some precomputed results, in our case word
analyses.

• The data structure can be optimized for efficient lookup, and


the results can be shared.

• Lookup operations are relatively simple and are usually quick.


Morphological Models
• Dictionaries can be implemented, for instance, as lists, binary
search trees, hash tables, and so on.

• Dictionary Lookup as a morphological model works with


following ways:

i. Dictionary Creation: In this approach, a comprehensive


dictionary or lexicon is compiled for the target language.

ii. Word Analysis: When a word is encountered in text, the


Dictionary Lookup model first attempts to find an exact match
for the word in the dictionary.
iii. Lemma Retrieval: The model then retrieves the lemma (base
form) of the word from the dictionary..
Morphological Models
iv. Handling Ambiguity: Dictionary Lookup models may need to handle
cases of word ambiguity, where a single word form can have multiple
possible lemmas or meanings.

• Advantages of Dictionary Lookup are :

a. Accuracy

b. Transparency

• Limitations of Dictionary Lookup are :

a. Limited Coverage

b. Ambiguity Handling

c. Resource-Intensive
Morphological Models
2. Finite-State Morphology

• By finite-state morphological models, we mean those in which the


specifications written by human programmers are directly compiled
into finite-state transducers.

• The two most popular tools supporting this approach, XFST (Xerox
Finite-State Tool) and Lex Tools.

• Finite-state transducers are computational devices extending the


power of finite-state automata.

• FST consist of a finite set of nodes connected by directed edges


labeled with pairs of input and output symbols.

• In such a network or graph, nodes are also called states, while edges
are called arcs.
Morphological Models
• Traversing the network from the set of initial states to the set
of final states along the arcs is equivalent to reading the
sequences of encountered input symbols and writing the
sequences of corresponding output symbols.

• The set of possible sequences accepted by the transducer


defines the input language;

• The set of possible sequences emitted by the transducer defines


the output language.

• Following example shows the FST state diagram for the input
words and their corresponding morphological parsed output or
morphological parsing.
Morphological Models
Input Input Morphological parsed
output

Cats cat +N +PL

Cat cat +N +SG


Cities city +N +PL
Geese goose +N +PL
Goose goose +N +SG) or (goose +V)

Gooses goose +V +3SG


Merging merge +V +PRES-PART • Figure : A schematic finite state transducer for

Caught (caught +V +PAST-PART) or English number inflection Tnoun . The symbols


(catch +V +PAST) above each arc represent elements of the
morphological parse in the lexical tape.
Morphological Models
•In finite-state computational morphology, it is common to refer to
the input word forms as surface strings and to the output
descriptions as lexical strings.

•In English, a finite-state transducer could analyze the surface string


children into the lexical string child [+plural].

•for instance - woman from woman [+plural(women)]

•Relations on languages can also be viewed as functions.

•Let us have a relation R, and let us denote by [Σ] the set of all
sequences over some set of symbols Σ.

•The domain and the range of R are subsets of [Σ].


Morphological Models
• We can then consider R as a function mapping an input string
into a set of output strings, formally denoted by this type of
signature, where [Σ] equals String.

𝑅 ∷[∑]՜ Σ

𝑅 ∷ 𝑆𝑡𝑟𝑖𝑛𝑔 ՜ 𝑆𝑡𝑟𝑖𝑛𝑔

•A theoretical limitation of finite-state models of morphology is


the problem of capturing reduplication of words or their
elements (e.g., to express plurality) found in several human
languages.
Morphological Models
3. Unification-based morphology

• Unification-based morphology is a computational approach to


morphological analysis and generation that uses unification to
combine information about the morphemes and features of a
word.

• Unification is a logical operation that merges two or more sets


of constraints into a single consistent set.

• This approach focuses on the relationships between morphemes


(the smallest units of meaning in a language) and how they
combine to create word forms.
Morphological Models
• The key components and concepts of Unification-Based
Morphology are :

1. Feature Structures:

• Unification-Based Morphology represents linguistic information


using feature structures, which are data structures that consist
of attribute-value pairs.

• These feature structures encode information about morphemes,


their grammatical properties, and their relationships within
words.

2. Morphemes and Morphological Rules : Morphemes are the


building blocks of words, representing units of meaning.
Morphological Models
• Unification-Based Morphology defines morphological rules that
specify how morphemes can combine and interact.

• These rules are expressed using feature structures.

• For Example :

• In English, the word "cats" consists of two morphemes: "cat"


and "-s" (indicating plural).

• These morphemes can be represented as feature structures:

• Morpheme "cat": {base: "cat", pos: "noun", number: "singular"}

• Morpheme "-s": {base: "", pos: "plural marker", number: "plural"}


Morphological Models
3. Lexical Entries:

Each word in a language is associated with a lexical entry, which


includes information about the word's morphemes, their features, and
how they are combined.

Lexical entries are represented as feature structures.

For Example :

Lexical Entry for "cats":

Word: "cats"

Morphemes: {"cat", "-s"}

Feature Structures:
{ { base: "cat", pos: "noun", number: "singular"},
{ base: “ ”, pos: "plural marker", number: "plural"}}
Morphological Models
4. Morphological Analysis:

• Unification-Based Morphology performs morphological analysis by


unifying feature structures representing morphemes to generate the
complete word form.

• This process involves unification, which combines feature structures


while preserving shared features and resolving conflicts.

• For Example :

To analyze the word "cats," the feature structures for its


morphemes are unified to generate the complete word form:

• Unified Feature Structure for "cats":


{base: "cat", pos: "noun", number: "plural"}
Morphological Models
5. Morphological Generation:

• Morphological generation, involves starting with feature structures


representing the desired grammatical and semantic properties of a word
and generating the word form by applying morphological rules in
reverse.

• For Example:

• Conversely, in morphological generation, we can start with the desired


features and generate a word form:

• Desired Feature Structure: {base: "dog", pos: "noun", number: "plural"}

• Morphological Rules:
Apply "-s" to indicate plural.

• Generated Word Form: "dogs"


Morphological Models
6. Ambiguity Handling

• Unification-Based Morphology provides a framework for handling


morphological ambiguity by representing multiple potential
feature structures and allowing for dis-ambiguation based on
context or additional linguistic constraints.

• Consider the word "saw," which could be a past tense verb or a


noun (e.g., a tool).

• Ambiguity Representation:
• Past Tense Verb: {base: "see", pos: "verb", tense: "past"}
• Noun (Tool): {base: "saw", pos: "noun"}
Morphological Models
4. Functional Morphology

•Functional morphology defines its models using principles of


functional programming and type theory.

•It treats morphological operations and processes as pure


mathematical functions and organizes the linguistic expression.

•It also abstract elements of a model into distinct types of values


and type classes.

•Functional morphology is not limited to modelling particular


types of morphologies in human languages, it is also especially
useful for fusional morphologies.
Morphological Models
•Linguistic notions like paradigms, rules, exceptions,
grammatical categories, parameters, lexemes, morphemes, and
morphs can be represented intuitively and clearly.

•Functional morphology implementations are intended to be


reused as programming libraries capable of handling the
complete morphology of a language.

•We can describe inflection I, derivation D, and lookup L as


functions of these generic type as below :
• 𝑰 ∷ 𝒍𝒆𝒙𝒆𝒎𝒆 ՜ 𝒑𝒂𝒓𝒂𝒎𝒆𝒕𝒆𝒓 ՜ 𝒇𝒐𝒓𝒎

• 𝑫 ∷ 𝒍𝒆𝒙𝒆𝒎𝒆 ՜ 𝒑𝒂𝒓𝒂𝒎𝒆𝒕𝒆𝒓 ՜ 𝒍𝒆𝒙𝒆𝒎𝒆

• 𝑳 ∷ 𝒄𝒐𝒏𝒕𝒆𝒏𝒕 ՜ 𝒍𝒆𝒙𝒆𝒎𝒆
Morphological Models
•Many functional morphology implementations are embedded in
a general-purpose programming language.

•This gives programmers more freedom with advanced


programming techniques and allows them to develop full-
featured, real-world applications for their models.

•For instance

•The Zen toolkit for Sanskrit morphology is written in OCaml.

•It influenced the functional morphology framework in Haskell,


with which morphologies of Latin, Swedish, Spanish, Urdu, and
other languages have been implemented.
Morphological Models
• Morphological grammars in Grammatical Framework can be
extended with descriptions of the syntax and semantics of a
language.

• Grammatical Framework itself supports multi-linguality, and


models of more than a dozen languages that are available as
open-source software.
Assignment Questions
1. Define the following terminologies
a. Morphology
b. Phonology
c. Orthography
d. Etymology and lexicology
e. Morphological Parsing
2. Give the steps to find the structure of the words given a sentence.
3. What is the significance of words and their components.
4. Explain the key components of word with example.
5. What are the issues and challenges in finding the structure o words ? Discuss.
6. Explain the dictionary lookup as a morphological model.
7. What is a finite state morphology ? Traverse the given words using FST
i. cities ii. Cats iii. Gooses iv. Caught v. runs vi. Feet
8. What is unification-based morphology ? Give its key components with
examples.
9. Define functional morphology. What is the significance of using functional
morphology.

You might also like