0% found this document useful (0 votes)
97 views29 pages

Acoustics of Speech: Julia Hirschberg CS 4706

This document discusses acoustics of speech and speech analysis. It covers several key topics: 1) How acoustic properties like phrasing, prominence, pitch range convey meaning in speech. Experimental evidence and tools for speech analysis are discussed. 2) Fundamental concepts in acoustics including the nature of sound, periodic and aperiodic waves, speech production mechanisms, and places of articulation. 3) Digital speech analysis including sampling, file formats, pitch tracking, and challenges in analyzing noisy speech. Pitch perception in humans is also briefly covered.

Uploaded by

jcms
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
97 views29 pages

Acoustics of Speech: Julia Hirschberg CS 4706

This document discusses acoustics of speech and speech analysis. It covers several key topics: 1) How acoustic properties like phrasing, prominence, pitch range convey meaning in speech. Experimental evidence and tools for speech analysis are discussed. 2) Fundamental concepts in acoustics including the nature of sound, periodic and aperiodic waves, speech production mechanisms, and places of articulation. 3) Digital speech analysis including sampling, file formats, pitch tracking, and challenges in analyzing noisy speech. Pitch perception in humans is also briefly covered.

Uploaded by

jcms
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 29

Acoustics of Speech

Julia Hirschberg
CS 4706

5/22/2019 1
Claim: How things are said can be critical to
understanding
• I.e., Varying phrasing, prominence, pitch range,
speaking rate, pitch contour, voice
quality…conveys meaning
• What is our evidence? How do we prove?
– Observation
– Hypotheses
– Experimentation (perception, production)
– Speech analysis (independent variables)
– Correlation with dependent variable
5/22/2019 2
• What does our data look like?
• What tools do we have for analysis?

5/22/2019 3
What is sound?

• Pressure fluctuations in the air caused by a


musical instrument, a car horn, a voice
– Cause eardrum to move
– Auditory system translates into neural
impulses
– Brain interprets as sound
• Can we tell one sound from another?
• Can we distinguish one particular sound in
‘noise’?
5/22/2019 4
– From a speech-centric point of view, when
sound is not produced by the human voice,
we may term it noise
– Ratio of speech-generated sound to other
simultaneous sound: signal-to-noise ratio

5/22/2019 5
How ‘Loud’ are Common Sounds?

Event Pressure (Pa) Db


Absolute 20 0
Whisper 200 20
Quiet office 2K 40
Conversation 20K 60
Bus 200K 80
Subway 2M 100
Thunder 20M 120
*DAMAGE* 200M 140
5/22/2019 6
Some Sounds are Periodic

• Simple Periodic Waves (sine waves) defined by


– Frequency: how often does pattern repeat per
time unit
• Cycle: one repetition
• Period: duration of cycle
• Frequency=# cycles per time unit, e.g.
– Frequency in Hz = 1sec/period_in_sec
– E.g. 400Hz pitch = 1/.0025 (1 cycle has a period of
.0025; 400 cycles complete in 1 sec)
– Amplitude: peak deviation of pressure from
normal atmospheric pressure
5/22/2019 7
– Phase: timing of waveform relative to a
reference point
• Complex periodic waves
• Cyclic but composed of two or more sine waves
• Fundamental frequency (F0): rate at which largest
pattern repeats (also GCD of component freqs)
• Components not always easily identifiable: power
spectrum graphs amplitude vs. frequency
• Any complex waveform can be analyzed into a set
of sine waves with their own frequencies,
amplitudes, and phases (Fourier’s theorem)
– E.g. some speech sounds (mostly vowels)
cat.wav
5/22/2019 8
Some Sounds are Aperiodic
• Waveforms with random or non-repeating
patterns
– Random aperiodic waveforms: white noise
• Flat spectrum: equal amplitude for all frequency
components
– Transients: sudden bursts of pressure (clicks,
pops, door slams)
• Waveform shows a single impulse (click.wav)
• Fourier analysis shows a flat spectrum
• Some speech sounds, e.g. many consonants
(e.g. cat.wav)
5/22/2019 9
Speech Production

• Voiced and voiceless sounds


• Vocal fold vibration filtered by the Vocal tract
produces complex periodic waveform
– Cycles per sec of lowest frequency
component of signal = fundamental frequency
(F0)
– Fourier analysis yields power spectrum with
component frequencies and amplitudes
• F0 is first (lowest frequency) peak
• Harmonics are resonances of vocal track,
multiples of F0
5/22/2019 10
Vocal fold vibration

[UCLA Phonetics Lab demo]

5/22/2019 11
Places of articulation

alveolar post-alveolar/palatal
dental
velar
uvular
labial
pharyngeal

laryngeal/glottal

http://www.chass.utoronto.ca/~danhall/phonetics/sammy.html
5/22/2019 12
How do we capture speech for analysis?

• Recording conditions
– A quiet office, a sound booth, an anachoic
chamber
• Microphones
• Analog devices (e.g. tape recorders) store and
analyze continuous air pressure variations
(speech) as a continuous signal
• Digital devices (e.g. computers,DAT) first
convert continuous signals into discrete signals
(A-to-D conversion)
5/22/2019 13
• File format:
– .wav, .aiff, .ds, .au, .sph,…
– Conversion programs, e.g. sox
• Storage
– Function of how much information we store
about speech in digitization
• Higher quality, closer to original
• More space (1000s of hours of speech take up a
lot of space)

5/22/2019 14
Sampling

• Sampling rate: how often do we need to


sample?
– At least 2 samples per cycle to capture
periodicity of a waveform component at a
given frequency
• 100 Hz waveform needs 200 samples per sec
• Nyquist frequency: highest-frequency component
captured with a given sampling rate (half the
sampling rate)

5/22/2019 15
Sampling/storage tradeoff

• Human hearing: ~20K top frequency


– Do we really need to store 40K samples per
second of speech?
• Telephone speech: 300-4K Hz (8K sampling)
– But some speech sounds (e.g. fricatives, /f/,
/s/, /p/, /t/, /d/) have energy above 4K!
– Peter/teeter/Dieter
• 44k (CD quality audio) vs.16-22K (usually good
enough to study pitch, amplitude, duration, …)
5/22/2019 16
Sampling Errors

• Aliasing:
– Signal’s frequency higher than half the
sampling rate
– Solutions:
• Increase the sampling rate
• Filter out frequencies above half the sampling rate
(anti-aliasing filter)

5/22/2019 17
Quantization

• Measuring the amplitude at sampling points:


what resolution to choose?
– Integer representation
– 8, 12 or 16 bits per sample
• Noise due to quantization steps avoided by
higher resolution -- but requires more storage
– How many different amplitude levels do we
need to distinguish?
– Choice depends on data and application (44K
16bit stereo requires ~10Mb storage)
5/22/2019 18
– But clipping occurs when input volume is
greater than range representable in digitized
waveform
• Increase the resolution
• Decrease the amplitude

5/22/2019 19
What can we do if our data is ‘noisy’?

• Acoustic filters block out certain frequencies of


sounds
– Low-pass filter blocks high frequency
components of a waveform
– High-pass filter blocks low frequencies
– Reject band (what to block) vs. pass band
(what to let through)
• But if frequencies of two sounds
overlap….source separation

5/22/2019 20
How can we capture pitch contours, pitch
range?
• What is the pitch contour of this utterance? Is
the pitch range of X greater than that of Y?
• Pitch tracking: Estimate F0 over time as fn of
vocal fold vibration
• A periodic waveform is correlated with itself
– One period looks much like another (cat.wav)
– Find the period by finding the ‘lag’ (offset)
between two windows on the signal for which
the correlation of the windows is highest
– Lag duration (T) is 1 period of waveform
– Inverse is F0 (1/T)
5/22/2019 21
• Errors to watch for:
– Halving: shortest lag calculated is too long
(underestimate pitch)
– Doubling: shortest lag too short (overestimate
pitch)
– Microprosody errors (e.g. /v/)

5/22/2019 22
Sample Analysis File: Pitch Track Header

• version 1
• type_code 4
• frequency 12000.000000
• samples 160768
• start_time 0.000000
• end_time 13.397333
• bandwidth 6000.000000
• dimensions 1
• maximum 9660.000000
• minimum -17384.000000
• time Sat Nov 2 15:55:50 1991
• operation record: padding xxxxxxxxxxxx
5/22/2019 23
Sample Analysis File: Pitch Track Data

(F0 Pvoicing Energy A/C Score)


• 147.896 1 2154.07 0.902643
• 140.894 1 1544.93 0.967008
• 138.05 1 1080.55 0.92588
• 130.399 1 745.262 0.595265
• 0 0 567.153 0.504029
• 0 0 638.037 0.222939
• 0 0 670.936 0.370024
• 0 0 790.751 0.357141
• 141.215 1 1281.1 0.904345
5/22/2019 24
Pitch Perception

• But do pitch trackers capture what humans perceive?


• Auditory system’s perception of pitch is non-linear
– Sounds at lower frequencies with same difference in
absolute frequency sound more different than those at
higher frequencies (male vs. female speech)
– Bark scale (Zwicker) and other models of perceived
difference

5/22/2019 25
How do we capture loudness/intensity?

• Is one utterance louder than another?


• Energy closely correlated experimentally with
perceived loudness
• For each window, square the amplitude values
of the samples, take their mean, and take the
root of that mean (RMS energy)
– What size window?
– Longer windows produce smoother amplitude
traces but miss sudden acoustic events

5/22/2019 26
Perception of Loudness

• But the relation is non-linear: sones or decibels (dB)


– Differences in soft sounds more salient than loud
– Intensity proportional to square of amplitude
so…intensity of sound with pressure x vs. reference
sound with pressure r = x2/r2
– bel: base 10 log of ratio
– decibel: 10 bels
– dB = 10log10 (x2/r2)
– Absolute (20 Pa, lowest audible pressure fluctuation of
1000 Hz tone), typical threshold level for tone at frequency

5/22/2019 27
How do we capture….

• For utterances X and Y


• Pitch contour: Same or different?
• Pitch range: Is X larger than Y?
• Duration: Is utterance X longer than utterance
Y?
• Speaker rate: Is the speaker of X speaking
faster than the speaker of Y?
• Voice quality….

5/22/2019 28
Next Class

• Tools for the Masses: Read the Praat tutorial


• Download Praat from the course syllabus page
and play with a speech file (e.g.
http://www.cs.columbia.edu/~julia/cs4706/cc_00
1_sadness_1669.04_August-second-.wav or
record your own)

5/22/2019 29

You might also like