Acoustics of Speech
Julia Hirschberg
CS 4706
5/22/2019 1
Claim: How things are said can be critical to
understanding
• I.e., Varying phrasing, prominence, pitch range,
speaking rate, pitch contour, voice
quality…conveys meaning
• What is our evidence? How do we prove?
– Observation
– Hypotheses
– Experimentation (perception, production)
– Speech analysis (independent variables)
– Correlation with dependent variable
5/22/2019 2
• What does our data look like?
• What tools do we have for analysis?
5/22/2019 3
What is sound?
• Pressure fluctuations in the air caused by a
musical instrument, a car horn, a voice
– Cause eardrum to move
– Auditory system translates into neural
impulses
– Brain interprets as sound
• Can we tell one sound from another?
• Can we distinguish one particular sound in
‘noise’?
5/22/2019 4
– From a speech-centric point of view, when
sound is not produced by the human voice,
we may term it noise
– Ratio of speech-generated sound to other
simultaneous sound: signal-to-noise ratio
5/22/2019 5
How ‘Loud’ are Common Sounds?
Event Pressure (Pa) Db
Absolute 20 0
Whisper 200 20
Quiet office 2K 40
Conversation 20K 60
Bus 200K 80
Subway 2M 100
Thunder 20M 120
*DAMAGE* 200M 140
5/22/2019 6
Some Sounds are Periodic
• Simple Periodic Waves (sine waves) defined by
– Frequency: how often does pattern repeat per
time unit
• Cycle: one repetition
• Period: duration of cycle
• Frequency=# cycles per time unit, e.g.
– Frequency in Hz = 1sec/period_in_sec
– E.g. 400Hz pitch = 1/.0025 (1 cycle has a period of
.0025; 400 cycles complete in 1 sec)
– Amplitude: peak deviation of pressure from
normal atmospheric pressure
5/22/2019 7
– Phase: timing of waveform relative to a
reference point
• Complex periodic waves
• Cyclic but composed of two or more sine waves
• Fundamental frequency (F0): rate at which largest
pattern repeats (also GCD of component freqs)
• Components not always easily identifiable: power
spectrum graphs amplitude vs. frequency
• Any complex waveform can be analyzed into a set
of sine waves with their own frequencies,
amplitudes, and phases (Fourier’s theorem)
– E.g. some speech sounds (mostly vowels)
cat.wav
5/22/2019 8
Some Sounds are Aperiodic
• Waveforms with random or non-repeating
patterns
– Random aperiodic waveforms: white noise
• Flat spectrum: equal amplitude for all frequency
components
– Transients: sudden bursts of pressure (clicks,
pops, door slams)
• Waveform shows a single impulse (click.wav)
• Fourier analysis shows a flat spectrum
• Some speech sounds, e.g. many consonants
(e.g. cat.wav)
5/22/2019 9
Speech Production
• Voiced and voiceless sounds
• Vocal fold vibration filtered by the Vocal tract
produces complex periodic waveform
– Cycles per sec of lowest frequency
component of signal = fundamental frequency
(F0)
– Fourier analysis yields power spectrum with
component frequencies and amplitudes
• F0 is first (lowest frequency) peak
• Harmonics are resonances of vocal track,
multiples of F0
5/22/2019 10
Vocal fold vibration
[UCLA Phonetics Lab demo]
5/22/2019 11
Places of articulation
alveolar post-alveolar/palatal
dental
velar
uvular
labial
pharyngeal
laryngeal/glottal
http://www.chass.utoronto.ca/~danhall/phonetics/sammy.html
5/22/2019 12
How do we capture speech for analysis?
• Recording conditions
– A quiet office, a sound booth, an anachoic
chamber
• Microphones
• Analog devices (e.g. tape recorders) store and
analyze continuous air pressure variations
(speech) as a continuous signal
• Digital devices (e.g. computers,DAT) first
convert continuous signals into discrete signals
(A-to-D conversion)
5/22/2019 13
• File format:
– .wav, .aiff, .ds, .au, .sph,…
– Conversion programs, e.g. sox
• Storage
– Function of how much information we store
about speech in digitization
• Higher quality, closer to original
• More space (1000s of hours of speech take up a
lot of space)
5/22/2019 14
Sampling
• Sampling rate: how often do we need to
sample?
– At least 2 samples per cycle to capture
periodicity of a waveform component at a
given frequency
• 100 Hz waveform needs 200 samples per sec
• Nyquist frequency: highest-frequency component
captured with a given sampling rate (half the
sampling rate)
5/22/2019 15
Sampling/storage tradeoff
• Human hearing: ~20K top frequency
– Do we really need to store 40K samples per
second of speech?
• Telephone speech: 300-4K Hz (8K sampling)
– But some speech sounds (e.g. fricatives, /f/,
/s/, /p/, /t/, /d/) have energy above 4K!
– Peter/teeter/Dieter
• 44k (CD quality audio) vs.16-22K (usually good
enough to study pitch, amplitude, duration, …)
5/22/2019 16
Sampling Errors
• Aliasing:
– Signal’s frequency higher than half the
sampling rate
– Solutions:
• Increase the sampling rate
• Filter out frequencies above half the sampling rate
(anti-aliasing filter)
5/22/2019 17
Quantization
• Measuring the amplitude at sampling points:
what resolution to choose?
– Integer representation
– 8, 12 or 16 bits per sample
• Noise due to quantization steps avoided by
higher resolution -- but requires more storage
– How many different amplitude levels do we
need to distinguish?
– Choice depends on data and application (44K
16bit stereo requires ~10Mb storage)
5/22/2019 18
– But clipping occurs when input volume is
greater than range representable in digitized
waveform
• Increase the resolution
• Decrease the amplitude
5/22/2019 19
What can we do if our data is ‘noisy’?
• Acoustic filters block out certain frequencies of
sounds
– Low-pass filter blocks high frequency
components of a waveform
– High-pass filter blocks low frequencies
– Reject band (what to block) vs. pass band
(what to let through)
• But if frequencies of two sounds
overlap….source separation
5/22/2019 20
How can we capture pitch contours, pitch
range?
• What is the pitch contour of this utterance? Is
the pitch range of X greater than that of Y?
• Pitch tracking: Estimate F0 over time as fn of
vocal fold vibration
• A periodic waveform is correlated with itself
– One period looks much like another (cat.wav)
– Find the period by finding the ‘lag’ (offset)
between two windows on the signal for which
the correlation of the windows is highest
– Lag duration (T) is 1 period of waveform
– Inverse is F0 (1/T)
5/22/2019 21
• Errors to watch for:
– Halving: shortest lag calculated is too long
(underestimate pitch)
– Doubling: shortest lag too short (overestimate
pitch)
– Microprosody errors (e.g. /v/)
5/22/2019 22
Sample Analysis File: Pitch Track Header
• version 1
• type_code 4
• frequency 12000.000000
• samples 160768
• start_time 0.000000
• end_time 13.397333
• bandwidth 6000.000000
• dimensions 1
• maximum 9660.000000
• minimum -17384.000000
• time Sat Nov 2 15:55:50 1991
• operation record: padding xxxxxxxxxxxx
5/22/2019 23
Sample Analysis File: Pitch Track Data
(F0 Pvoicing Energy A/C Score)
• 147.896 1 2154.07 0.902643
• 140.894 1 1544.93 0.967008
• 138.05 1 1080.55 0.92588
• 130.399 1 745.262 0.595265
• 0 0 567.153 0.504029
• 0 0 638.037 0.222939
• 0 0 670.936 0.370024
• 0 0 790.751 0.357141
• 141.215 1 1281.1 0.904345
5/22/2019 24
Pitch Perception
• But do pitch trackers capture what humans perceive?
• Auditory system’s perception of pitch is non-linear
– Sounds at lower frequencies with same difference in
absolute frequency sound more different than those at
higher frequencies (male vs. female speech)
– Bark scale (Zwicker) and other models of perceived
difference
5/22/2019 25
How do we capture loudness/intensity?
• Is one utterance louder than another?
• Energy closely correlated experimentally with
perceived loudness
• For each window, square the amplitude values
of the samples, take their mean, and take the
root of that mean (RMS energy)
– What size window?
– Longer windows produce smoother amplitude
traces but miss sudden acoustic events
5/22/2019 26
Perception of Loudness
• But the relation is non-linear: sones or decibels (dB)
– Differences in soft sounds more salient than loud
– Intensity proportional to square of amplitude
so…intensity of sound with pressure x vs. reference
sound with pressure r = x2/r2
– bel: base 10 log of ratio
– decibel: 10 bels
– dB = 10log10 (x2/r2)
– Absolute (20 Pa, lowest audible pressure fluctuation of
1000 Hz tone), typical threshold level for tone at frequency
5/22/2019 27
How do we capture….
• For utterances X and Y
• Pitch contour: Same or different?
• Pitch range: Is X larger than Y?
• Duration: Is utterance X longer than utterance
Y?
• Speaker rate: Is the speaker of X speaking
faster than the speaker of Y?
• Voice quality….
5/22/2019 28
Next Class
• Tools for the Masses: Read the Praat tutorial
• Download Praat from the course syllabus page
and play with a speech file (e.g.
http://www.cs.columbia.edu/~julia/cs4706/cc_00
1_sadness_1669.04_August-second-.wav or
record your own)
5/22/2019 29