0% found this document useful (0 votes)
2 views

Chapter 1 Introduction

Uploaded by

fmlomat
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Chapter 1 Introduction

Uploaded by

fmlomat
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Speech Processing

Course Code : CS300


Course Overview

Course Specification

Course Plan
Chapter 1
Introduction
Simple Period Waves (sine waves)
• Characterized by: 0.99

• period: T
• amplitude A
• phase  0

• Fundamental frequency in cycles


per second, or Hz
• F0=1/T
–0.99
0 0.02
Time (s)

1 cycle
Simple periodic waves

 Computing the frequency of a wave:


• 5 cycles in .5 seconds = 10 cycles/second = 10 Hz
 Amplitude:
• 1
 Equation:
• Y = A sin(2ft)
Speech sound waves

A little piece from the waveform of the vowel [iy]


Y axis:
•Amplitude = amount of air pressure at that time point
•Positive is compression
•Zero is normal air pressure,
•negative is rarefaction ( ‫(تخلخالت‬
Digitizing Speech
Analog to
Digital
Converter
Digitizing Speech

Analog-to-digital conversion Or A/D conversion.

Three steps
• Sampling
• Quantization
• Coding

Sampler Quantizer Encoder


Mic
Sampling
 Measuring amplitude of signal at time t
 The sampling rate needs to have at least two samples for each
cycle

• Roughly speaking, one for the positive and one for the
negative half of each cycle.
• More than two sample per cycle is ok
• Less than two samples will cause frequencies to be missed
• So the maximum frequency that can be measured is one
that is half the sampling rate.
• The maximum frequency for a given sampling rate called
Nyquist frequency
Sampling
Original signal in red:

If measure at green dots, will


see a lower frequency wave
and miss the correct higher
frequency one!
Sampling
In practice, then, we use the following sample rates.
• 16,000 Hz (samples/sec) Microphone (“Wideband”):
• 8,000 Hz (samples/sec) Telephone
Why?
 Need at least 2 samples per cycle
 max measurable frequency is half sampling rate
 Human speech < 10,000 Hz, so need max 20K
 Telephone filtered at 4K, so 8K is enough
Sampling Theorem:
Sampling Frequency = 2 * maximum frequency of the signal

fs ≥ 2fm
Where fs is the sampling frequency
and fm is the maximum frequency of the signal to be sampled.
Quantization
Definition:
“Representing the real value of each amplitude as an integer”

8-bit (-128 to 127) or 16-bit (-32768 to 32767)


Formats:
16 bits PCM (Pulse Code Modulation)
8 bits log compression
Headers:
Raw (no header) 40 byte
header
Microsoft: filename.wav
Sun: filename.au
WAV format
Fundamental frequency

Waveform of the vowel [iy]


(10 reps in .03875 secs)

Frequency: repetitions/second of a wave


• Above vowel has 10 repetitions in .03875 secs
• So freq is 10/.03875 = 258 Hz
• This is speed that vocal folds move, hence voicing
• Each peak corresponds to an opening of the vocal folds
• The frequency of the complex wave is called the fundamental
frequency of the wave or F0
Amplitude
• We need a way to talk about the amplitude of a
region of a signal (frame) over tune.
• We can’t just average all the values. Why not?
Because the Average ≈ Zero
• So we often talk about the Root Mean Square
(RMS) amplitude
N 2
x[i]
ARMS  
i1
N
“The square Root of the Mean of the Squares of the
samples”
Power and Intensity
Power: related to square of amplitude

1 N
Power   x[i]2
N i1

Intensity in air: power normalized to auditory


threshold, given in dB.

P0 is the auditory threshold pressure = 2x10-5 pa
N
1
Intensity  10 log10 ( power / Po)  10 log10
NP0
 x[
i 1
i ]2
Plot of Intensity
Pitch and Loudness
• Pitch is the mental sensation or perceptual correlated of F0.

• Relationship between pitch and F0 is not linear;


human pitch perception is most accurate between 100Hz and
1000Hz.
Linear in this range
Logarithmic above 1000Hz
Mel scale is one model of this F0-pitch
mapping.
A Mel is a unit of pitch defined so that pairs of
sounds which are perceptually equidistant in
pitch are separated by an equal number of mels

Frequency in mels = 1127 ln (1 + f/700)


Pitch track


Pitch

RETONE: manipulate pitch contour.

Record some speech and listen to what happens when you


adjust its pitch contour.
She just had a baby

• Note that vowels all have regular amplitude peaks


• Stop consonant
Closure followed by release
Notice the silence followed by slight bursts of emphasis: very clear for
[b] of “baby”
• Fricative: noisy. [sh] of “she” at beginning
Fricative
Waves have different frequencies
0.99

0
100 Hz

–0.99
0 0.02
Time (s)

0.99

0
1000 Hz

–0.99
0 0.02
Time (s)
Complex waves: Adding a 100 Hz and 1000 Hz
wave together
0.99

–0.9654
0 0.05
Time (s)

The Discrete Fourier Transform (DFT)

 xn    

   xn   xne

j  jn
Xe
n   n  
Notes:
• X(ejω ) is a complex-valued continuous function

• ω = 2π f [rad/sec]

• f is the digital frequency measured in [ C/S]


The Discrete Fourier Transform (DFT)
Spectrum Analysis (Cont.)

   xn   xne

j  jn
Xe
n  

    xne
 
Xe j

n  
 jn
  x(n)cos(n)  j sin(n)
n  
 
  x(n) cos(n)  j  x(n) sin(n)
n   n  

ESynth - Mark Huckvale - University


College London (speechandhearing.net)
Spectrum

Amplitude
Frequency
components (100 and
1000 Hz) on x-axis

100 Frequency in Hz 1000

Fourier analysis:
any wave can be represented as the
(infinite) sum of sine waves of different
frequencies (amplitude, phase)

40

Spectrum of one instant in an


actual sound wave: many
20

components across frequency


range
0

0 5000
Frequency (Hz)
Part of [ae] waveform from “had”

• Note complex wave repeating nine times in figure


• Plus smaller waves which repeats 4 times for every large
pattern
• Large wave has frequency of 250 Hz (9 times in .036 seconds)
• Small wave roughly 4 times this, or roughly 1000 Hz
• Two little tiny waves on top of peak of 1000 Hz waves
Back to spectrum
Spectrum represents these freq components computed by
Fourier transform, algorithm which separates out each
frequency component of wave.

x-axis shows frequency, y-axis shows magnitude (in decibels, a


log measure of amplitude)
Peaks at 930 Hz, 1860 Hz, and 3020 Hz.
Spectrogram: spectrum + time dimension
f

Note that: The grey level represents the amplitude or energy


Seeing formants: the spectrogram
Third Formant
F3

Second Formant
F2

First Formant
F1

Formants
Vowels largely distinguished by 2 characteristic pitches (F1 and F2).
One of them (the higher of the two) goes downward throughout
the series iy ih eh ae aa ao ou u
The other goes up for the first four vowels and then down for the
next four.
These are called “Formants" of the vowels, lower is 1st formant, higher is 2nd
formant.
Different vowels have different formants
• Vocal tract as "amplifier"; amplifies different frequencies
• Formants are result of different shapes of vocal tract.
• Any body of air will vibrate in a way that depends on its size and shape.
• Air in vocal tract is set in vibration by action of vocal cords.
• Every time the vocal cords open and close, pulse of air from the lungs,
acting like sharp taps on air in vocal tract,
• Setting resonating cavities into vibration so produce a number of
different frequencies.

Again: why is a speech sound wave composed of these peaks?


Articulatory facts:
1. The vocal cord vibrations create harmonics
2. The mouth is an amplifier
3. Depending on shape of mouth, some harmonics are
amplified more than others
How Formants are produced
• Q: Why do vowels have different pitches if the vocal cords are
same rate?

• A: This is a confusion of frequencies of SOURCE and


frequencies of FILTER!

Source Filter Speech

(Vocal Cords)
(Vocal Tract)

Fundamental
frequency Fo Formants F1, F2, F3
Source-filter model of speech production
Input Filter Output

Glottal spectrum Vocal tract frequency


(Source) response function

Glottal :The vocal cords and opening between them

Source and filter are independent, so:


• Different vowels can have same pitch:
When they are produced by the same cavity structure
(Filter responses are identical).
• The same vowel can have different pitch:
e.g.; Different speakers.
Deriving schwa: how shape of mouth (filter function)
creates peaks!

Basic facts about sound waves:


f = c/
c = speed of sound (approx 35,000 cm/sec)
A sound with =10 meters has low frequency f = 35 Hz
(35,000/1000)
A sound with =2 centimeters has high frequency f =
17,500 Hz (35,000/2)
Resonances of the vocal tract
• The human vocal tract as an open tube
Closed end Open end

Length 17.5 cm.


• Air in a tube of a given length will tend to vibrate at resonance
frequency of tube.
Resonances of the vocal tract
The human vocal tract as an open tube

Closed end Open end

Length 17.5 cm.

Air in a tube of a given length will tend


to vibrate at resonance frequency of
tube.
• If vocal tract is cylindrical tube open at one end
• Standing waves form in tubes
• Waves will resonate if their wavelength corresponds to dimensions of tube
• Constraint: Pressure differential should be maximal at (closed)
glottal end and minimal at (open) lip end.
• Next slide shows what kind of length of waves can fit into a tube with this
contsraint
Max Energy at
Closed ends Min Energy at
Open ends
Computing the 3 formants of schwa
Let the length of the tube be L

F1 = c/1 = c/(4L) = 35,000/4*17.5 = 500Hz


F2 = c/2 = c/(4/3L) = 3c/4L = 3*35,000/4*17.5 = 1500Hz
F3 = c/3 = c/(4/5L) = 5c/4L = 5*35,000/4*17.5 = 2500Hz

So we expect a neutral vowel to have 3 resonances at 500,


1500, and 2500 Hz

These vowel resonances are called Formants


Vowel [i] sung at successively higher pitch.

1 2 3

4 5 6

7
Vocal Tract Simulation
Time Total time
ms Segment Duration
JW Jaw Position
TP Tongue Position
TS Tongue Shape
TA Tongue Expansion
LA Lip Aperture ‫بؤرة الشفاه‬
LP Lip Protrusion ‫نتوء‬
LH Larynx Height ‫عرض الحنجرة‬
GA Glottal Aperture ‫بؤرة لسان المزمار‬
FX Fundamental Frequency
NS Velo-pharyngeal port opening ‫فتحة البلعوم‬
Vocal Tract Simulation

VTDEMO: vocal tract synthesizer


How to read spectrograms

bab: closure of lips lowers all formants: so rapid increase in all


formants at beginning of "bab”
dad: first formant increases, but F2 and F3 slight fall
gag: F2 and F3 come together: this is a characteristic of velars.
Formant transitions take longer in velars than in alveolar or labials
‫حلقى‬ ‫الصوت الساكن‬ ‫شفوى‬
She came back and started again

1. lots of high-freq energy


3. closure for k
4. burst of aspiration for k
5. ey vowel;faint 1100 Hz formant is nasalization
6. bilabial nasal
7. short b closure, voicing barely visible.
8. ae; note upward transitions after bilabial stop at beginning
9. note F2 and F3 coming together for "k"
Phonetic Resources
Phonetic dictionaries
CMU dict
CELEX
Phonetically transcribed corpora
TIMIT
Switchboard
TIMIT
Read speech corpus, time aligned

Switchboard
Spontaneous speech corpus
Telephone conversations between strangers
“They’re kind of in between right now” Time alignments
Summary
Acoustic Phonetics
Waves, sound waves, and spectra
Speech waveforms
F0, pitch, intensity
Spectra
Spectrograms
Formants
Reading spectrograms
Deriving schwa: why are formants where they are
PRAAT
Resources: dictionaries and phonetically-labeled corpora.
Examples

pad

bad

spat
Useful Textbooks
Useful Textbooks (Cont.)
Software Resources
• Snack Speech Toolkit
– http://speech.kth.se/snack/
• OGI Speech Toolkit
• University of Colorado SONIC recognizer
– http://cslr.colorado.edu
• Cambridge Hidden Markov Model Toolkit (HTK)
• CMU Sphinx-II Speech Recognizer
• NIST Speech Recognition Scoring Utilities
• SRI Language Model Toolkit
• CMU / Cambridge Language Model Toolkit
Literature Resources
Conference Proceedings
• International Conference on Acoustics, Speech,
and
Signal Processing (ICASSP)
• International Conference on Spoken Language
Processing (ICSLP)
• Eurospeech
Journal Publications
• Speech Communication
• IEEE Transactions on Speech and Audio
Processing
Useful Website

Internet Institute for Speech and Hearing

You might also like