0% found this document useful (0 votes)
10 views52 pages

NLP3 - Lecture 3

Uploaded by

im.second007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views52 pages

NLP3 - Lecture 3

Uploaded by

im.second007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Finite State Transducers

Transducers
• Recognisers either accept or reject a
word.
• Although this is useful, networks can
actually return more substantial
information.
• This is achieved by providing networks
with the ability to write as well as to read.

27
Basic Transducer
• Each transition of a transducer is labelled with a
pair of symbols rather than with a single symbol.

• Analysis proceeds as before, except that input


symbols are matched against the lower-side
symbols on transitions.

• If analysis succeeds, return the string of upper-


side symbols on the path to the final state
Finite State Transducers
• The simple story
– Add another tape
– Add extra symbols to the transitions

– On one tape we read “cats”, on the other we


write “cat +N +PL”
Confusing Terminology
• Lower side = surface side.
• Upper side = "deep" side.
• Analysis proceeds from lower to upper.
• Synthesis (generation) proceeds from
upper to lower.
Finite-State Transducers (FST)

The symbols of the FST are complex: they’re


really pairs of symbols, one for each of two
“tapes” or levels.
Recognizer: decides if a given pair of
representations fits together “OK”
Generator: generates pairs of representations that
fit together
Translator: takes a representation on one level
and produces the appropriate representation on
the other level
Finite state transducers

• can be inverted, or

• Composed and you get another FST.


Four-Fold View of FSTs
• As a recognizer
• As a generator
• As a translator
• As a set relater –
A machine that computes within the
set of input string and output string
Formally, a finite transducer T is a 6-tuple (Σ,∆,Q E, I, F, ) such
that:
•Q is a finite set, the set of states;
•Σ is a finite set, called the input labels;
•∆ is a finite set, called the output labels;
•I is a subset of Q, the set of initial states;
•F is a subset of Q, the set of final states; and
• (where ε is the empty string) is the transition relation.
Finite State Transducers – Black
Box View
• We say T transduces w2  L2
the lower string w1
into string upper
string w2 in the
upward direction
(lookup) Finite State
Transducer
• We say T transduces T
the upper string w2
into string lower string
w1 in the downward
direction (lookdown)
• A given string may w1  L1
map to >=0 strings
Lexical Transducers

• In common parlance, a transducer is a


device which converts one form of energy
into another, e.g. a microphone converts
from sound to electrical signals.

• Lexical transducers which convert one


string of symbols into another.
29
Example
• A lexical transducer is a specialized finite-state
automaton that maps lexical forms and
morphological specifications to corresponding
inflected forms, and vice versa.
• For example, a lexical transducer for English
might relate the word
dining - dine+PresPart
swam - swim+Past.
Lexical Transducer Example
lexical string
C A S A
C A S E surface string

• Input: CASE
• Output: CASA

30
Morphological Analysis

C O N T A
C O N T 

 R

 O  
+SG +1P +V E

• Input: CONTO
• Output: CONTARE +V +1P +SG 31
Nominal Inflection FST
Remarks
 stands for "epsilon". During analysis, epsilon
transitions are taken freely without consuming any
input.

• Note also single symbols with multi-character


print names (e.g. +SG).

• The order of these symbols, and the choice of


infinitive as baseform, is determined by linguists.
Synthesis

• Transducers are reversible. This means


that they can be used to perform the
inverse transduction from an transducers.

• The process of synthesis is the inverse of


analysis
The Process of Synthesis
• Start at the start state and at the beginning
of the input string.
• Match the input symbols against the
upper-side symbols of the arcs,
consuming symbols until a final state is
reached.
• If successful, return the string of lower-
side symbols (else nothing).
35
Morphological Synthesis
C O N T A
C O N T 

 R

 O  
+SG +1P +V E

•Input: CONTARE +V +1P +SG


•Output: CONTO
•N.B.  symbols are ignored on output 36
Analysis and Synthesis
• Upper Side Language (Lexical Strings).
• Lower Side Language (Surface Strings).
• Transducer maps between the two.
• However large the lexical transducer may
become, analysis and synthesis are
performed by the same language-
independent matching techniques.

37
FSTs
Transitions
c:c a:a t:t +N:ε +PL:s

• c:c means read a c on one tape and write a c on the


other
• +N:ε means read a +N symbol on one tape and write
nothing on the other
• +PL:s means read +PL and write an s
English Plural
surface lexical
cat cat+N+Sg
cats cat+N+Pl
foxes fox+N+Pl
mice mouse+N+Pl
sheep sheep+N+Pl
sheep+N+Sg
Morphological Anlayser
To build a morphological analyser we need:
• lexicon: the list of stems and affixes, together
with basic information about them
• morphotactics: the model of morpheme
ordering (eg English plural morpheme follows
the noun rather than a verb)
• orthographic rules: these spelling rules are
used to model the changes that occur in a word,
usually when two morphemes combine (e.g.,
fly+s = flies)
Lexicon & Morphotactics
• Typically list of word parts (lexicon) and
the models of ordering can be combined
together into an FSA which will recognise
the all the valid word forms.
• For this to be possible the word parts must
first be classified into sublexicons.
• The FSA defines the morphotactics
(ordering constraints).
Sublexicons
to classify the list of word parts
reg-noun irreg-pl- irreg-sg- plural
noun noun
cat mice mouse -s

fox sheep sheep

geese goose
FSA Expresses Morphotactics
(ordering model)
Intermediate Form to Surface
• The reason we need to have an
intermediate form is that funny things
happen at morpheme boundaries, e.g.
cat^s  cats
fox^s  foxes
fly^s  flies
• The rules which describe these changes
are called orthographic rules or "spelling
rules".
More English Spelling Rules
• consonant doubling: beg / begging
• y replacement: try/tries
• k insertion: panic/panicked
• e deletion: make/making
• e insertion: watch/watches
• Each rule can be stated in more detail ...
Spelling Rules
• Chomsky & Halle (1968) invented a
special notation for spelling rules.
• A very similar notation is embodied in the
"conditional replacement" rules of xfst.
E -> F || L _ R
which means replace E with F when it
appears between left context L and right
context R
A Particular Spelling Rule
This rule does e-insertion

^ -> e || x _ s#
Typical Uses
• Typically, we’ll read from one tape using
the first symbol on the machine transitions
(just as in a simple FSA).
• And we’ll write to the second tape using
the other symbols on the transitions.
Composing Transducers -
Bin iki yüz yetmiş üç
Example Bin iki yüz yetmiş üç

Number to
Turkish Numeral Transducer
Compose English Numeral
to
1273 Turkish Numeral
Transducer
English Numeral to
Number Transducer

One thousand two hundred seventy three One thousand two hundred seventy three
Multi-Tape Machines

• To deal with this we can simply add more


tapes and use the output of one tape
machine as the input to the next

• So to handle irregular spelling changes


we’ll add intermediate tapes with
intermediate symbols
Multi-Level Tape Machines

• We use one machine to transduce between the


lexical and the intermediate level, and another to
handle the spelling changes to the surface tape
Stage 1:
Lexical  Intermediate Levels
• Example:
– g o o s e +N +PL (lexical)
– g e e s e # (intermediate)
• Example:
– g o o s e +N +SG (lexical)
– g o o s e # (intermediate)
• Example:
– m o u s e +N +PL (lexical)
– m i  c e # (intermediate)
• Example:
– s h e e p +N +PL (lexical)
– s h e e p # (intermediate)
Morphological Analysis
• Morphological happy+Adj+Sup
analysis can be seen
as a finite state
transduction
Finite State
Transducer
T

happiest  English_Words
Morphological Analysis as FS
Transduction
• First approximation

• Need to describe
– Lexicon (of free and bound morphemes)

– Spelling change rules in a finite state


framework.
The Lexicon as a Finite State
Transducer
• Assume words have the form prefix+root+suffix where the prefix
and the suffix are optional.
So:
Prefix = [P1 | P2 | …| Pk]
Root = [ R1 | R2| … | Rm]
Suffix = [ S1 | S2 | … | Sn]
Lexicon = (Prefix) Root (Suffix)

(R) = [R | ], that is, R is optional.


The Lexicon as a Finite State
Transducer
• Prefix = [[ u n +] | [ d i s +] | [i n +]]
Root = [ [t i e] | [ e m b a r k ] | [ h a p p y ]
|[decent]|[faste
n]]
Suffix = [[+ s] | [ + i n g ] | [+ e r] | [+ e d] ]

Tie, embark, happy, un+tie, dis+embark+ing,


in+decent, un+happy √
un+embark, in+happy+ing …. X
The Lexicon as a Finite State
Transducer
• Lexicon =
[ ([ u n +]) [ [t i e] | [ f a s t e n]] ([[+e d] | [ + i n g] | [+ s]]) ]
|
[ ([d i s +]) [ e m b a r k ] ([[+e d] | [ + i n g] | [+ s]])]
|
[ ([u n +]) [ h a p p y] ([+ e r])]
|
[ (i n +) [ d e c e n t] ]

Note that some patterns are now emerging tie, fasten, embark are verbs, but
differ in prefixes, happy and decent are adjectives, but behave differently.
The Lexicon

• The lexicon structure can be refined to a


point so that all and only valid forms are
accepted and others rejected.

• This is very painful to do manually for any


(natural) language.
Describing Lexicons
• Current available systems for morphology
provide a simple scheme for describing finite
state lexicons.
– Xerox Finite State Tools
– PC-KIMMO

• Roots and affixes are grouped and linked to


each other as required by the morphotactics.
• A compiler (lexc) converts this description to a
finite state transducer
Lexicon as a FS Transducer
h a p p y +Adj +Sup 0 0
h a p p y + e s t
.
s .
s
a. v e +Verb +Past 0
a v e + e d
t .
t .
a
. b l e +Noun +Pl
a b l e + s

. +
+Verb
. +Pres +3sg
. s 0

A typical lexicon will be represented with 105 to 106 states.


Lexicon as a FS Transducer
h a p p y +Adj +Sup 0 0
h a p p y + e s t
.
s .
s
a. v e +Verb +Past 0
a v e + e d
t .
t .
a
. b l e +Noun +Pl
a b l e + s

. +
+Verb
. +Pres +3sg
. s 0

Nondeterminism
The Lexicon Transducer
• Note that the lexicon transducer solves
part of the problem.
– It maps from a sequence of morphemes to
root and features.

– Where do we get the sequence of


morphemes?
Morphological Analyzer Structure
happy+Adj+Sup

Lexicon Transducer

happy+est

Morphographemic ????????
Transducer

happiest
Sneak Preview (of things to
happy+Adj+Sup
come) happy+Adj+Sup

Lexicon Transducer

Morphological
happy+est Compose
Analyzer/Generator

Morphographemic
Transducer

happiest happiest
The Morphographemic
Transducer
• The morphographemic transducer
generates
– all possible ways the input word can be
segmented and “unmangled”
– As sanctioned by the alternation rules of the
language
• Graphemic conventions
• Morphophonological processes (reflected to the
orthography)
The Morphographemic
Transducer
• The morphographic
transducers thinks:
happy+est – There may be a
morpheme boundary
between i and e, so let
me mark that with a +.
Morphographemic
Transducer – There is i+e situation,
now and
happiest – There is a rule that
says, change the i to a
y in this context.
– So let me output
happy+est
The Morphographemic

Transducer
happy+est
h+ap+py+e+st
• However, the
…. morphographemic
happiest
… transducer is
oblivious to the
Morphographemic lexicon,
Transducer – it does not really know
about words and
happiest morphemes,
– but rather about what
happens when you
combine them
Only some of these will actually be sanctioned by the lexicon
What kind of changes does the MG
Transducer handle?
• Insertions
– brag+ed  bragged
• Deletions
– (T) koy+nHn  koyun (of the bay)
– (T) alın+Hm+yA  alnıma (to my forehead)
• Changes
– happy+est  happiest
– (T) tarak+sH  tarağı (his comb)
– (G) Mann+er  Männer

You might also like