Learning Abstract Phonological from Auditory Phonetic
Categories: An Integrated Model for the Acquisition of
Language-Specific Sound Categories
Paul Boersma* , Paola Escudero† and Rachel Hayes‡
* University
of Amsterdam, Amsterdam, The Netherlands
† Utrecht University, Utrecht, The Netherlands
‡ University of Arizona, Tucson, USA
ABSTRACT 2. LEARNING PHONETIC CATEGORIES
We introduce a two-stage model for the perceptual 2.1 Speech perception in Optimality Theory
acquisition of speech sound categories within the We model early speech perception learning according to
framework of Stochastic Optimality Theory and the the proposal of Functional Phonology [7], where three
Gradual Learning Algorithm [1]. During the first stage, families of competing constraints determine the mapping
learning of language-specific sound categories by infants from auditory inputs to phonetic categories. A family of
is driven by distributional evidence in the linguistic P ERCEIVE constraints militates against not perceiving
input. This auditory-driven learning leads to a warping auditory inputs at all. Thus, PERCEIVE (F1: [700 Hz])
of the baby’s perceptual space, to discrimination curves, requires the listener to treat an auditory input with an F1
to the perceptual magnet effect, and ultimately to the value of 700 Hz as a member of some category. The
creation of phonetic categories. In the transition to the particular category that it is assigned to is determined by
second stage, these phonetic categories turn into simple two types of constraints: *C ATEG(ORIZE) and *WARP.
abstract phonological categories. During the second The *CATEG family punishes perceptual categories with
stage, when the lexicon is in place, lexically-driven particular acoustic values, e.g., *C ATEG (F1: /700 Hz/)
learning will develop more abstract representations and militates against perceiving an incoming F1 into the
optimize multi-dimensional perception. The results of ‘category’ /700 Hz/. The *WARP family requires every
our simulations compare well with findings from the acoustic input to be perceived as a member of the most
infant literature and from cross-language studies. similar available category. Thus, *WARP (F1: 40 Hz)
says that an acoustic input with an F1 of 680 Hz should
not be perceived as any F1 ‘category’ that is 40 Hz off
1. INTRODUCTION (or more), i.e. as /640 Hz/ or /720 Hz/ or anything even
farther away.
Infants have a remarkable capacity to calculate the
statistical distributions of auditory phonetic information
2.2 The initial state
in their linguistic input. It has been argued that their
We propose that in the initial state of the infant, all
knowledge of these distributions ultimately leads to the
*C ATEG constraints are ranked high, and all PERCEIVE
creation of phonetic categories at 6-8 months of age [2].
constraints are ranked low. This means that it is worse
This auditory-driven learning has been modelled in the
for the child to perceive an incoming F1 as something
domains of cognitive science and psychology with
than to not perceive the incoming F1 at all:
neural networks whose outcomes automatically reflect
the statistical distributions of the language input [3]. [340 Hz] *C ATEG *C ATEG P ERCEIVE *WARP
Once the lexicon is in place, however, more abstract (/320/) (/340/) ([340]) (20)
levels of representation come into being: features
combine into segments (as witnessed by the /320 Hz/ *! *
development of the weighting of auditory cues [4]), and /340 Hz/ *!
allophones combine into phonemes [5]. For cue
weighting, this lexicon-driven learning has been ☞ /–/ *
modelled in the domain of linguistics by Stochastic
This tableau (which for reasons of space contains only a
Optimality Theory and the Gradual Learning Algorithm
very small subset of all the constraints and candidates)
[6]. The present work proposes an underlying
shows that an incoming [340 Hz] will be perceived as
mechanism common to both kinds of learning, and
the null candidate /–/, which violates the constraint
explicitly models the transition between the two. The
P ERCEIVE (F1: [340 Hz]), because the competing
model employs a gradual perceptual learning device that
candidates /320 Hz/ and /340 Hz/ violate higher-ranked
is fed by two types of evidence: (i) acoustic events in the
constraints. *WARP (F1: 20 Hz) is violated if [340 Hz]
linguistic input, which give birth to ‘phonetic’
is perceived as /320 Hz/; its ranking does not contribute
categories; and (ii) lexical representations, which lead to
to determining the winner here.
the development of ‘phonological’ categories.
2.3 The learning mechanism [280 Hz] *WRP P ERC *C AT *C AT *C AT *WRP
The learner will not be satisfied with always perceiving /280/ 60 [280] /280/ /320/ /340/ 40
the null category. We propose that when she hears an F1
of [340 Hz], her blind innate distributional learning √ /280/ *!→
device will tell her that she should have perceived this as ☞ /320/ ←* ←*
the ‘identical’ value /340 Hz/. This value should /340/ *! * *
therefore be included in the tableau’s top left cell, which
contains all the information given to the learner. The /–/ *!
child will now consider one of her candidates ‘correct’,
as shown with a check mark in the next tableau: Thus, if [280 Hz] is even less common in the input than
[320 Hz], an incoming [280] will be perceived as /320/.
[340 Hz] *C ATEG *C ATEG P ERCEIVE *WARP The listener has established a compromise: the high
/340 Hz/ (/320/) (/340/) ([340]) (20) ranking of *WARP (60) tells her that /340 Hz/ is too far
off, and the relatively high ranking of *CATEG
/320 Hz/ *! * (/280 Hz/) tells her that /280 Hz/ is a too uncommon
√ /340 Hz/ *!→ ‘category’. We observe that as a result of distributional
☞ /–/ ←* skewings in her language environment, the infant will
warp her perceptual space in favour of the commonest
Now that the learner knows that she has made an error, F1 values. A situation in which some F1 values are more
she can learn: she will lower the ranking of the common than others is likely to occur in practice,
constraints violated in the form that she considers correct namely as the result of a finite number of vowel height
(this is shown by the rightward arrow in the tableau) and categories in the speakers of the ambient language.
raise the ranking of the constraints violated in her own Suppose a language has vowels with average produced
winning form (the leftward arrow). This procedure is the heights of 340 and 480 Hz. If we shelve the problems of
Gradual Learning Algorithm [1]. In Stochastic Optima- between-speaker variation, the environment will have an
lity Theory, constraints are ranked along a continuous F1 distribution with peaks around [340] and [480] Hz.
scale, and the algorithm typically reranks them in small The model just described predicts that the infant will
steps, achieving accuracy and robustness. After many learn to map incoming F1 values in the following way:
incoming [340 Hz] values, *CATEG (F1: /340 Hz/) will Incoming: Perceived:
ultimately fall below P ERCEIVE (F1: [340 Hz]), and the 580 580
infant will perceive this input ‘correctly’: 560 560
540 540
[340 Hz] *C ATEG P ERCEIVE *C ATEG *WARP 520 520
500 500
/340 Hz/ (/320/) ([340]) (/340/) (20) 480 480
460 460
/320 Hz/ *! * 440 440
420 420
√ ☞ /340 Hz/ * 400 400
/–/ *! 380 380
360 360
340 340
Learning will now stop, since the learner now considers 320 320
the output of her grammar correct... 300 300
280 280
But the child does not only hear [340 Hz] values. 260 260
Suppose that she will also hear some F1 values of [320 Figure 1: Warping of the perceptual space.
Hz], but less often than [340 Hz]. The constraints
*C ATEG (F1: /320 Hz/) and PERCEIVE (F1: [320 Hz]) 2.4 Simulation of distributional learning
will move, but less so than the constraints for 340 Hz. A The previous two tableaux show that in the later stages a
possible ranking is shown in the following tableau: learning step no longer simply involves one *CATEG
lowering and one P ERCEIVE raising, but involves an
[320 Hz] *WRP P ERC P ERC *C AT *C AT *WRP intricate movement of *C ATEG constraints. To check if
/320/ 60 [340] [320] /320/ /340/ 20 Figure 1 actually results, we ran a computer simulation
(in Praat) on a 20-Hz discretization of the F1 continuum,
√ /320/ *!→ giving 122 constraints. Their initial rankings were:
☞ /340/ ←* ←*
*WARP (F1: 800 Hz): ranked at a height of 800
/–/ *! *WARP (F1: 780 Hz): ranked at a height of 780
...
We see how the auditory input [320 Hz] is perceived *WARP (F1: 60 Hz): ranked at a height of 60
into the ‘category’ /340 Hz/. This happens because the *C ATEG (F1: /200 Hz/), *CATEG (/220 Hz/), ...,
*WARP constraint against perceiving an input into a *C ATEG (/1000 Hz/): all ranked at a height of 0
category that is off by 20 Hz is ranked very low (20 Hz P ERCEIVE (F1: [200 Hz]), P ERCEIVE ([220 Hz]), ...,
is below the just noticeable difference for formants [8]). P ERCEIVE (F1: [1000 Hz]): all ranked at -1000
But incoming F1 values that are more distant from 340 *WARP (F1: 40 Hz): ranked at -109
Hz will not be perceived as /340 Hz/. See next tableau. *WARP (F1: 20 Hz): ranked at -109
1000 2.6 Conversion to discrete categories
The output of the ‘grammar’ is commensurate to its
Perceived F1 (Hz)
800 input, i.e., the input and output are expressed in the same
units, namely Hertz. This special situation allows us to
feed the output back to the input, giving a self-enhancing
600 circuit that is capable of warping the frequencies quite
far away from their original values. In the example of
400 Figure 1, an input of 580 Hz maps to 540 Hz in the first
cycle; in the second cycle, this 540 Hz maps to 500 Hz,
200 which maps to 480 Hz in the third. After this, it will not
400 600 800 1000 200 change any further (480 Hz maps to 480 Hz). Thus, all
Incoming F1 (Hz) inputs between 410 and 580 Hz will ultimately map to
480 Hz, and all inputs between 260 and 410 Hz will map
Figure 2: Noisy warping of the perceptual space. to 340 Hz. The number of possible outputs has now
We fed the learner with 400,000 F1 values drawn from become finite, and we can call the values of /340 Hz/
an environment with four vowels with average F1 values and /480 Hz/ discrete phonetic categories.
of 280, 420, 560, and 700 Hz and standard deviations of
30 Hz. The plasticity (the size of the step by which 3. LEARNING PHONOLOGICAL
rankings can change on each input) was taken to drop CATEGORIES
gradually from 1.0 to 0.001, and the evaluation noise
(power of the noise added to the ranking of each Now that discrete categories exist, the child can give
constraint at evaluation time) was taken constant at 2.0. them arbitrary labels, severing the connection to the
After learning finished, we ran 1000 tokens of each of actual continuous F1 values. To stay in line with
the 41 possible F1 values through the resulting grammar. traditional phonological terminology, we label /340 Hz/
Figure 2 shows us how the listener perceived each F1 as /high/ and /480 Hz/ as /mid/.
value. We see that as a result of the noisy evaluation,
3.1 Lexically-driven optimization of perception
every incoming F1 value can be perceived as several
Once categories are discrete labels (feature values), the
different F1 values. Nevertheless, there is a clear
child can store lexical entries economically as structures
warping of the perceptual space: perceived values cluster
consisting only of these labels and their discrete
around 280, 420, 560, and 700 Hz. Inputs higher than
temporal and hierarchical relations. The *WARP con-
800 Hz are still perceived as /–/ because of their rarity.
straints have to be translated to accommodate the new
2.5 The discrimination task categories: if the category centres for /high/ and /mid/
The warping of the auditory F1 space leads to a change are 340 and 480 Hz, *WARP (F1: 80 Hz) will be split
in perceptual distances. Around the distributional peak into “[260 Hz] is not /high/”, “[420 Hz] is not /high/”,
of 420 Hz, for instance, distances shrink by one third, “[400 Hz] is not /mid/”, and “[560 Hz] is not /mid/”, all
since [390] is on average perceived as /400/, and [450] initially ranked equally high. The universally lower-
as /440/. Around the distributional valley of 490 Hz, the ranked *WARP (F1: 60 Hz) splits into four lower-ranked
reverse happens: [460] is perceived on average as /445/, constraints, among which “[420 Hz] is not /m i d/”.
[520] as /535/. Figure 3 shows the perceived distance These initial rankings cause the listener to have a
for an acoustic difference of 60 Hz centred around every reasonably good initial categorization performance:
acoustic F1 value. Thus, the perceptual space is warped [420 Hz] P ERC [420 Hz] [420 Hz]
in such a way that differences near ambient category [420 Hz] is not /high/ is not /mid/
centres are less well perceived than differences near
ambient category boundaries: we observe discrimination /high/ *!
effects without the infant having any discrete ☞ /mid/ *
categories yet. This perceptual magnet effect occurs in
real infants [9] and has been modelled with simulated But the lexicon can now act as a supervisor for achieving
neural networks [10]. We have been able to model the a more accurate perception. If it tells the listener that she
same effect within a linguistic framework. should have perceived this particular token as /high/
rather than as /mid/, perhaps because the semantic
100
context forces a recognition of sheep rather than ship,
F1 difference (Hz)
80 the listener will take appropriate action by making sure
60 that she will be more likely to perceive the next [420 Hz]
Perceived
as /high/ (in the tableau, the lexical recognition is part
40 of the input, i.e. the facts known to the child, and is
20 therefore written between pipes in the top left cell):
0 [420 Hz] P ERC [420 Hz] [420 Hz]
600 230800 400
970 ñ high ñ [420 Hz] is not /high/ is not /mid/
Centre F1 (Hz)
√ /high/ *!→
Figure 3: Perceived distances in the warped perception
of F1, for an acoustic distance of 60 Hz. ☞ /mid/ ←*
In the case of noisy evaluation the learner will ultimately REFERENCES
become a probability-matching listener, i.e., her [1] Paul Boersma and Bruce Hayes, “Empirical tests of
probability of perceiving [420 Hz] as /high/ or /mid/ the gradual learning algorithm.” Linguistic Inquiry
will mimick the distribution of underlyingly /high/ and 32, 45–86, 2000.
/mid/ realized as [420 Hz] in her environment [7].
[2] Jessica Maye, Janet F. Werker and LouAnn Gerken,
3.2 High-level perceptual integration “Infant sensitivity to distributional information can
There will be more discretized continua than just F1. For affect phonetic discrimination.” Cognition 82,
instance, acoustic vowel duration may have been divided B101–B111, 2002.
into categories arbitrarily labelled /short/ and /long/, [3] Kay Behnke, The acquisition of phonetic categories
with constraints such as “[91 ms] is not /short/”. Around in young infants: a self-organising artificial neural
9 months of age, infants start to integrate multiple network approach. Doctoral thesis, Universiteit
categories into higher-level abstractions: 9 month old but Twente [Max Planck Institute Series in Psycho-
not 6 month old infants use both sequential and rhythmic linguistics 5], 1998.
information to recognize two-syllable words in a larger
speech stream [11], and multi-dimensional [4] Susan Nittrouer, “Discriminability and perceptual
categorization occurs in the development of visual weighting of some acoustic cues to speech percep-
categories by infants from 9 months on as well [12]. tion by 3-year-olds.” JSHR 39, 278–297, 1996.
Thus, in some varieties of English, the vowel of the [5] Judith E. Pegg and Janet F. Werker, “Adult and
lexical entry sheep will be stored with the feature values infant perception of two English phones”, JASA 102,
/high, long/, the vowel of ship with /mid, short/, and 3742–3753, 1997.
the child will learn to use both F1 and duration in
perceiving this /i/–/I/ contrast. This perception can be [6] Paola Escudero and Paul Boersma, “Modelling the
modelled with initially high ranked constraints against perceptual development of phonological contrasts
feature co-occurrence, i.e. */high, long/, */high, short/, with Optimality Theory and the Gradual Learning
*/mid, long/, and */mid, short/. The tableau at the Algorithm.” Proceedings of the 25th Penn
bottom of this page, which was the end result of our Linguistics Colloquium, to appear.
computer simulation of learning with plausibly [7] Paul Boersma, Functional Phonology. Doctoral
distributed sheep–ship tokens, shows how a relatively thesis, University of Amsterdam. The Hague:
long token of /I/, despite a preference for perceiving Holland Academic Graphics, 1998.
/long/ rather than /short/ , will nevertheless be
perceived correctly as /mid, short/. [8] Diane Kewley-Port, “Thresholds for formant-
frequency discrimination of vowels in consonantal
3.3 Low-level perceptual integration context.” JASA 97, 3139–3146, 1995.
Once categories have arbitrary labels, the child can
[9] Patricia K. Kuhl, “Human adults and human infants
consider the relations of each category with all auditory
show a “perceptual magnetic effect” for the
continua, not just with one. Thus, there is nothing
prototypes of speech categories, monkeys do not.”
against including perverse-sounding constraints like “[an
Perception and Psychophysics 50, 93–107, 1991.
F1 of 430 Hz] is not /short/” and “[a duration of 91 ms]
is not /mid/”, initially low-ranked. Such a procedure is [10] Frank H. Guenther and Marin N. Gjaja, “The
needed for e.g. the integration of [vowel duration], [burst perceptual magnet effect as an emergent property of
strength], and [closure duration] in the perception of the neural map formation.” JASA 100, 1111-1121, 1996.
English word-final obstruent voicing contrast [6].
[11] James L. Morgan and Jenny R. Saffran, “Emerging
integration of sequential and suprasegmental
4. CONCLUSION information in preverbal speech segmentation.”
Child Development 66, 911–936, 1995.
By expressing the insights of cognitive-psychological
speech perception research with the decision mechanism [12] Barbara A. Younger, “Parsing objects into
of the linguistic framework of Stochastic Optimality categories: Infants’ perception and use of correlated
Theory, our model provides an explicit explanation of attributes.” In D.H. Rakison and L. Oakes (eds.),
the auditory-driven and lexicon-driven mechanisms that Early concept and category development, ch. 4.
underlie the acquisition of language-specific sound Oxford University Press, 2003.
categorization.
[500 Hz, 104 ms] */high, */mid, [500 Hz] [104 ms] [104 ms] */high, */mid, [500 Hz]
ñ mid, short ñ short/ long/ not /high/ not /short/ not /long/ long/ short/ not /mid/
/high, long/ *! * *
/high, short/ *! * ∗
/mid, long/ *! ∗ ∗
√ ☞ /mid, short/ ∗ ∗ *