0% found this document useful (0 votes)

93 views86 pages

Speech Processing Basics

Speech signals are produced by the vocal cords and vocal tract and can be analyzed using various techniques. Speech signals are non-stationary but can be divided into segments with common acoustic properties. There are two major classes of speech sounds: vowels which are voiced and consonants which can be voiced, unvoiced, or plosive. The vocal tract acts as a filter that shapes the excitation from the vocal cords. Formants are resonant frequencies of the vocal tract that change based on its configuration. Spectrograms provide a visual representation of the time-varying spectral characteristics of speech.

Uploaded by

rakeshinani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

93 views86 pages

Speech Processing Basics

Uploaded by

rakeshinani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 86

Speech Signal Basics

Nimrod Peleg
Updated: Feb. 2010

Course Objectives
To get familiar with:
Speech coding in general
Speech coding for communication (military, cellular)
Means:
Introduction the basics of speech processing
Presenting an overview of speech production and
hearing systems
Focusing on speech coding: LPC codecs:
Basic principles
Discussing of related standards
Listening and reviewing different codecs.

,
-",
1925

What is Speech ???

Meaning: Text, Identity, Punctuation, Emotion
prosody : rhythm, pitch, intensity

Statistics: sampled speech is a string of numbers

Distribution (Gamma, Laplace) Quantization

Information Theory: statistical redundancy
ZIP, ARJ, compress, Shorten, Entropy coding
...1001100010100110010...

Perceptual redundancy
Temporal masking , frequency masking (mp3, Dolby)

Speech production Model

Physical system: Linear Prediction

The Speech Signal

Created at the Vocal cords, Travels through the
Vocal tract, and Produced at speakers mouth
Gets to the listeners ear as a pressure wave
Non-Stationary, but can be divided to sound
segments which have some common acoustic
properties for a short time interval
Two Major classes: Vowels and Consonants

Speech Production
A sound source excites a (vocal tract) filter
Voiced: Periodic source, created by vocal cords
UnVoiced: Aperiodic and noisy source

The Pitch is the fundamental frequency of

the vocal cords vibration (also called F0)
followed by 4-5 Formants (F1 - F5) at higher
frequencies.

Spectral look of Ooooohhhh

Vocal Excitation

Spectral Envelope
Pitch value

(In purple: quantized envelope)

Schematic Diagram of Vocal Mechanism

The Vocal Cords

]Human voice [Wikipedia

Glottal Volume and Mouth Sound

Pipelines Model

Typical Voiced Sound

1Sec, 10,000 Samples, 8bps, Voiced (Ahhhhh)
A QuasiPeriodic
Signal

Typical Pitch
Speech:

male ~ 85-155 Hz;

female ~ 165-255 Hz;

Note the overlap !

Singers vocal range:
from bass to soprano: 80 Hz-1100 Hz

Waves: Sine/Cosine

The graphs of the

sine and cosine
functions are
sinusoids of
different phases

Frequency Analysis
The first four Fourier
series approximations
for a square wave
Analysis performed
by Fourier Transform

Power Spectrum: Voiced Speech

Vowel Production
In vowel production, air is forced from the lungs by
contraction of the muscles around the lung cavity
Air flows through the vocal cords, which are two
masses of flesh, causing periodic vibration of the cords
whose rate gives the pitch of the sound.
The resulting periodic puffs of air act as an excitation
input, or source, to the vocal tract.

Typical Vowels

Average Formant Locations

From: Rabiner & Schafer, Digital Processing of Speech Signals

The Vocal Tract

The vocal tract is the cavity between the vocal
cords and the lips, and acts as a resonator that
spectrally shapes the periodic input, much like
the cavity of a musical wind instrument.
A simple model of a steady-state vowel regards
the vocal tract as a linear time-invariant (LTI)
filter with a periodic impulse-like input.

Speech Production Model

Pitch interval control
Pulse
Generator
(Glottis)

Voiced
UnVoiced

Random
Noise

Time
Varying
Filter
(Vocal Tract)

Speech

Phonemes
The basic sounds of a language (e.g. "a" in the word
"father) are called phonemes.
A typical speech utterance consists of a string of
vowel and consonant phonemes whose temporal
and spectral characteristics change with time .
In addition, the time-varying source and system
can also nonlinearly interact in a complex way:
our simple model is correct for a steady vowel,
but the sounds of speech are not always well
represented by linear time-invariant systems !

Speech Sounds Categories

Periodic
Noisy
Impulsive

(Sonorants, Voiced: )
(Fricatives , Un-Voiced: -)
(Plosive: )

Example:
In the word shop, the sh, o, and p are
generated from a noisy, periodic, and
impulsive source, respectively.

All you need to know about

phonemes:

http://cslu.cse.ogi.edu/tutordemos/SpectrogramReading/spectrogram_reading.html

The Pitch
Pitch period:
The time duration of one glottal cycle.
Pitch (fundamental frequency):
The reciprocal of the pitch period.

Typical Pitch & Intensity Variations

Harmonics
2
=
k
p

The frequencies
are referred to as the
harmonics of the glottal waveform
2
(
is the fundamental frequency, pitch)
p
As the pitch period P decreases, the spacing
between the harmonics increases:

Pitch Detection
The pitch period and V/UV decisions are
elementary to many speech coders
Many methods for the calculation:
Autocorrelation function
Pitch Calculation utility

SPDemo demonstration
NOW

Autocorrelation Pitch Detector

Calculate the autocorrelation function of the signal
within estimated range
For speech signal, sampled at 8KHz, the range in samples
varies between 20-130 (~2.5-16mSec)

Mathematical definition:

r (m) =

i =

(i ) (i + m )

First maxima location is the pitch period

No maxima
no pitch period

The Vocal Tract and Formants

The relation between a glottal airflow velocity
input and vocal tract airflow velocity output
can be approximated by a linear filter with
resonances called Formants. (like resonances
of organ pipes and wind instruments)
Formants change with different vocal tract
configurations corresponding to different
resonant cavities and thus different phonemes.

The Vocal Tract and Formants (contd)

Generally, the frequencies of the formants
decrease as the vocal tract length increases.
Therefore, a male speaker tends to have lower
formants than a female, and a female has lower
formants than a child.
Under a vocal tract linearity and time-invariance
(LTI) assumption, and when the sound source
occurs at the glottis, the speech waveform (i.e., the
airflow velocity at the vocal tract output) can be expressed as
the convolution of the glottal flow input and vocal
tract impulse response.

Glottal source harmonics, Vocal tract formant and Spectral envelope

Categorization of Speech Sounds

Speech sounds are studied and classified from

the following perspectives:
1) The nature of the source: periodic, noisy, or
impulsive, and combinations of the three.
More optional classes:
2) The shape of the vocal tract.
3) The time-domain waveform, which gives the
pressure change with time at the lips output.
4) The time-varying spectral characteristics revealed
through the spectrogram.

From: Rabiner & Schafer, Digital Processing of Speech Signals

Waveforms Examples
Voiced
(a,e,u,o,i)

Un-Voiced
(s,f,sh)

Plosive
(p,k,t)

Most common Manner of articulation

Plosive, or oral stop, where there is complete occlusion (blockage) of both the oral and nasal cavities
of the vocal tract, and therefore no air flow. Examples include English /p t k/ (voiceless) and /b d g/
(voiced). ()
Nasal stop, where there is complete occlusion of the oral cavity, and the air passes instead through
the nose. The shape and position of the tongue determine the resonant cavity that gives different nasal
stops their characteristic sounds. Examples include English /m, n/.
()
Fricative, sometimes called spirant, where there is continuous frication (turbulent and noisy airflow)
at the place of articulation. Examples include English /f, s/ (voiceless), /v, z/ (voiced), etc. (')
Sibilants are a type of fricative where the airflow is guided by a groove in the tongue toward the
teeth, creating a high-pitched and very distinctive sound. These are by far the most common
fricatives. English sibilants include /s/ and /z/. ()
Affricate, which begins like a plosive, but this releases into a fricative rather than having a separate
release of its own. The English letters "ch" and "j" represent affricates. (')
Trill, in which the articulator (usually the tip of the tongue) is held in place, and the airstream causes
it to vibrate. The double "r" of Spanish "perro" is a trill. ()
Approximant, where there is very little obstruction. Examples include English /w/ and /r/. Lateral
approximants, usually shortened to lateral, are a type of approximant pronounced with the side of
the tongue. English /l/ is a lateral.
And some more .

Clear Pitch
Period

No Pitch

No Formats

Formats
structure

Example of speech waveform (male) of the word problems.

Spectrograms
The time-varying spectral characteristics of the speech
signal can be graphically displayed through the use of a
tow-dimensional pattern.
Vertical axis: frequency, Horizontal axis: time
The pseudo-color of the pattern is proportional to
signal energy (red: high energy)
The resonance frequencies of the vocal tract show up
as energy bands
Voiced intervals characterized by striated appearance
(periodically of the signal)
Un-Voiced intervals are more solidly filled in

Time domain view

A lathe is a big tool

is a big
Spectrogram view

Voiced region
Un-Voiced region

Analysis Window
Wide-Band Spectrogram: A narrow
analysis window (narrower than the pitch
period) the narrow vertical lines match
succeeding pitch periods
Narrow-Band Spectrogram: A wide
analysis window (includes several pitch
periods) the narrow horizontal lines are
pitch harmonies
The yellow bands describe the formants
change in time (previous slide)

Spectrograms with different

analysis Window
Should we chase

A Typical Speech Processing

Measurements of the acoustic Waveform

Waveform and spectral

representations

Speech production models

Analysis / Synthesis

Applications: Coding, Modifications, Enhancement, Recognition

Modifications
The goal is to alter the speech signal to have some
desired property: time-scale, pitch, and spectral
changes.
Applications: fitting radio and TV commercials into an
allocated time slot, synchronization of audio and video
presentations, etc.
In addition, speeding up speech has use in message
playback, voice mail, and reading machines and books
for the blind, while slowing down speech has
application to learning a foreign language.

Modification Demo

Pitch and vocal tract length change Sinewave-based modification

Some Wideband Audio Examples

TSM:

Original (Depeche mode : Martyr)

Mono, 15 Sec
Fast 50%
Slow 200%

Automatic Transcription:

Original

Polyphonic Wav-to-MIDI

Speech Enhancement
The goal: to improve the quality of degraded speech.
One approach is to pre-process the (analog) speech
waveform before it is degraded.
Another is post-processing : enhancement after the
signal is degraded:
Increasing the transmission power, e.g.: automatic gain
control (AGC) in a noisy environment.
Reduction of additive noise in digital telephony, and vehicle
communication and aircraft communication.
Reduction of interfering backgrounds and speakers for the
hearing-impaired.

Enhancement demo:
Noise reduction adaptive Wiener filter
with adaptivity based on spectral change

Speech Synthesis Demo

Voder
1939 New York Worlds Fair: First speech
synthesizer
H. Dudley, R.R. Reisz, and S.S.A. Watkins,
A synthetic speaker, Journal of the Franklin
Institute, vol. 227, pp. ,1939.

http://davidszondy.com/future/robot/voder.htm

And Some Modern TTS Machines

AT&T

http://public.research.att.com/~ttsweb/tts/demo.html

Oddcast

http://vhost.oddcast.com/vhost_minisite/demos/tts/tts_example.html

A modern book reader: Kindle

And some Criticism

by Amazon

The Human
Auditory System

The Human Hearing System

Hearing System
The ear performs frequency analysis of the received
signal, and allows the listener to discriminate small
differences in time and frequency found in the sound

Hearing range: ~ 16Hz - 18KHz

Most sensitive to the range: 2KHz - 4KHz
Dynamic range (quietest to loudest) is about 96 dB
Normal human voice range is about 500 Hz to 2 kHz

How does it work ?

1. The Outer Ear: Catch the Wave

Called the pinna or auricle

After waves enter the outer ear, they travel
through the ear canal and make their way
toward the middle ear.
The outer ear canal's other job is to protect the
ear by making earwax. That special wax contains
chemicals that fight off infections that could hurt
the skin inside the ear canal. It also collects dirt
to help keep the ear canal clean.

The Eardrum
The eardrum is a piece of thin skin stretched tight.
It is attached to the first ossicle, a small bone called the
malleous (hammer) which is attached to another tiny
bone called the incus (anvil), which is attached to the
smallest bone in the body, called the stirrup (stapes )
When sound waves travel into the ear and reach the
eardrum, they cause the eardrum to vibrate. These
sound vibrations are carried to the three tiny bones of
the middle ear.

The Middle Ear

The middle ear's main job is to take the sound
waves, turn them into vibrations, and deliver them
to the inner ear.
It also helps the eardrum handle the pressure:
The middle ear is connected to the back of the nose by a
narrow tube called the eustachian tube. Together the
eustachian tube and the middle ear keep the air pressure
equal on both sides of the eardrum.
Keeping the air pressure equal is important so the eardrum
can work properly and not get injured.

The Inner Ear

The vibrations in the inner ear go into the cochlea - a
small, curled tube in the inner ear.
The cochlea is filled with liquid and lined with cells that
have thousands of tiny microscopic hairs (15,000 to
20,000) on their surface.
When the sound vibrations hit the liquid in the cochlea,
the liquid begins to vibrate. Different kinds of sounds
will make different patterns of vibrations.
The vibrations cause the sensory hairs in the cochlea to
move - sound vibrations are transformed into nerve
signals and delivered to the brain via the hearing nerve
- Also called the eighth nerve

Outer, Middle and Inner Ear

(From: Intro. to Digital Audio Coding and Standards, Bosi & Goldberg)

Inside The Cochlea

Frequency Sensitivity along the

Basilar Membrane
American Physical Society, 1940

Hearing: Thresholds
Hearing Threshold: Minimum intensity at
which sounds can be perceived
The ear is less sensitive to the pitch and first
formant (F1) than to the higher formants, in
the sense of intelligibility

Hearing: Masking
Masking: One sound is obscured in the
presence of another: presence of one raises
the threshold for another one
Lower frequencies usually mask higher
frequencies, with largest effect near the
harmonics of the masker
A wider band signal masks narrower band
signal

How sensitive is human hearing?

Put a person in a quiet room. Raise level of 1
kHz tone until just barely audible.

Vary the frequency and plot.

Frequency Masking
Play 1 kHz tone (masking tone) at fixed level (60dB).
Play test tone at a different level (e.g., 1.1kHz),
and raise level until just distinguishable.

Vary the frequency of the test tone and plot the

threshold when it becomes audible:

Frequency Masking

(Contd)

Repeat for various frequencies of masking tones:

Critical Bands
Perceptually uniform measure of frequency,
non-proportional to width of masking curve,
About 100 Hz for masking frequency < 500 Hz,
grow larger and larger above 500 Hz.
The width is called the size of the critical band

Barks
Introduce new unit for frequency called a
bark (after Barkhausen)
1 Bark = width of one critical band
For frequency < 500 Hz, 1 Bark Freq/100
For frequency > 500 Hz,
1 Bark 9 + 4log(Freq/1000)

Frequency partitioning into Barks

Masking Thresholds on critical band

scale:

Masking effect

Masked and non-masked tones

Temporal masking
If we hear a loud sound, then it stops, it
takes a little while until we can hear a soft
tone nearby
Question: how to quantify?

Temporal masking

(Contd)

Experiment: Play 1 kHz masking tone at 60

dB, plus a test tone at 1.1 kHz at 40 dB:
Test tone can't be heard (it's masked).

Stop masking tone, then stop test tone after

a short delay.
Adjust delay time to the shortest time that
test tone can be heard (e.g., 5 ms).

Temporal masking

(Contd)

Repeat with different level of the test tone and plot:

Temporal masking

(Contd)

Try other frequencies for test tone (masking tone

duration constant). Total effect of masking:

Masking effect in time

Voiced and Unvoiced Noise Thresholds

Hearing Model
Filter Bank
H1

Sound

Speech Quality
Includes:
Intelligibility (Phone)
Naturalness
Speaker Identify
Perceptual convenience (?)
More.

How to quantify the degradation ?

SNR, MSE etc. are not perceptual measures
Subjective criteria (Listening tests: MOS)
are expensive and time consuming
PESQ: an objective used standard

What about the Phone Line ?

Band limited: ~300-3.3 KHz
Distortions, Echo, Noise etc.

Original

Band limited
(Phone-line)

(A very good) Reference book:

Introduction to Digital
Speech Processing
(Foundations and
Trends in Signal
Processing)
by Lawrence R.
Rabiner (Author),
Ronald W. Schafer
(Author)

First edition, 1978

Second edition, 2007

Acoustic Phonetics Overview
No ratings yet
Acoustic Phonetics Overview
19 pages
Speech Processing Course Guide
No ratings yet
Speech Processing Course Guide
54 pages
Speech Signal Processing
100% (2)
Speech Signal Processing
173 pages
Vowel Production and Spectrogram Analysis
No ratings yet
Vowel Production and Spectrogram Analysis
103 pages
Speaker Identification and Authentication
No ratings yet
Speaker Identification and Authentication
14 pages
The Basic Properties of Speech
0% (1)
The Basic Properties of Speech
3 pages
224s 22 Lec2
No ratings yet
224s 22 Lec2
53 pages
DSP II - DVP - CDP 2pp
No ratings yet
DSP II - DVP - CDP 2pp
141 pages
Vocal Science for Singers
No ratings yet
Vocal Science for Singers
44 pages
Phonetics and Phonology Explained
No ratings yet
Phonetics and Phonology Explained
21 pages
Vowel Articulation Basics
No ratings yet
Vowel Articulation Basics
110 pages
Understanding Spectrograms in Sound Analysis
No ratings yet
Understanding Spectrograms in Sound Analysis
5 pages
Lecture 1-7: Source-Filter Model
No ratings yet
Lecture 1-7: Source-Filter Model
6 pages
UNc2rjc ncr2ocmxedIT 2
No ratings yet
UNc2rjc ncr2ocmxedIT 2
3 pages
Articulatory Gestures & Phonetics
No ratings yet
Articulatory Gestures & Phonetics
41 pages
(Alli) Linear Predictive Modelling of Speech Signal
No ratings yet
(Alli) Linear Predictive Modelling of Speech Signal
25 pages
Lec5 6
No ratings yet
Lec5 6
32 pages
Speech Processing: Review # (Or) Seminar #
No ratings yet
Speech Processing: Review # (Or) Seminar #
49 pages
1.0 Introduction To Speech Processing
No ratings yet
1.0 Introduction To Speech Processing
40 pages
Speech Sound Production: Recognition Using Recurrent Neural Networks
No ratings yet
Speech Sound Production: Recognition Using Recurrent Neural Networks
20 pages
P and P Essay Spectrogram
No ratings yet
P and P Essay Spectrogram
3 pages
An Introduction To Speech Recognition B. Plannere
No ratings yet
An Introduction To Speech Recognition B. Plannere
69 pages
Lin213 (Acoustic Phonetics) - 1
No ratings yet
Lin213 (Acoustic Phonetics) - 1
32 pages
Acoustic Theory of Speech Production
No ratings yet
Acoustic Theory of Speech Production
57 pages
Articulation and Formants
No ratings yet
Articulation and Formants
25 pages
Pitch Synchronous Spectrogram
No ratings yet
Pitch Synchronous Spectrogram
32 pages
Vowels
No ratings yet
Vowels
43 pages
Acoustic Phonetics 2025
No ratings yet
Acoustic Phonetics 2025
33 pages
ENG 507 Final Term
No ratings yet
ENG 507 Final Term
47 pages
Speech Sounds in NLP: Production & Analysis
No ratings yet
Speech Sounds in NLP: Production & Analysis
9 pages
Voice Acoustics1
No ratings yet
Voice Acoustics1
28 pages
Handout Spectrogram
100% (1)
Handout Spectrogram
5 pages
MIT24 915F15 Lec4
No ratings yet
MIT24 915F15 Lec4
33 pages
Lab9: Speech Synthesis
No ratings yet
Lab9: Speech Synthesis
13 pages
Unit - 2 PD
No ratings yet
Unit - 2 PD
76 pages
Book Perception of Vowels
No ratings yet
Book Perception of Vowels
3 pages
How To Read A Spectrograms (Course3)
No ratings yet
How To Read A Spectrograms (Course3)
28 pages
3.2 Automatic Speech Recognition
No ratings yet
3.2 Automatic Speech Recognition
151 pages
2-1-Fonologia Língua Inglesa
No ratings yet
2-1-Fonologia Língua Inglesa
19 pages
Phon Itics
No ratings yet
Phon Itics
12 pages
Speech Analysis
No ratings yet
Speech Analysis
10 pages
How Do I Read A Spectrogram?: Rob's Blog
No ratings yet
How Do I Read A Spectrogram?: Rob's Blog
15 pages
Understanding Speech Analysis Techniques
No ratings yet
Understanding Speech Analysis Techniques
32 pages
Speech Lab
No ratings yet
Speech Lab
7 pages
Lec9-10 Speech Processing
No ratings yet
Lec9-10 Speech Processing
34 pages
Phonetics Phonology
No ratings yet
Phonetics Phonology
41 pages
SLHS1301 Week7
No ratings yet
SLHS1301 Week7
60 pages
Types of Phonetic
100% (3)
Types of Phonetic
5 pages
Acoustic Phonetics Course Guide
No ratings yet
Acoustic Phonetics Course Guide
22 pages
Vocal Tract Modeling for Speech Production
No ratings yet
Vocal Tract Modeling for Speech Production
10 pages
Speech
No ratings yet
Speech
39 pages
Speech Compression
No ratings yet
Speech Compression
37 pages
Assignment On Speech
No ratings yet
Assignment On Speech
9 pages
Source-Filter Model of Speech Production
No ratings yet
Source-Filter Model of Speech Production
4 pages
EC8252 Electronic Devices
No ratings yet
EC8252 Electronic Devices
9 pages
RA Removing and Installing - Replacing Front Left or Right Entrance Cover Strip
No ratings yet
RA Removing and Installing - Replacing Front Left or Right Entrance Cover Strip
2 pages
Barefoot Contessa - Complete Index
No ratings yet
Barefoot Contessa - Complete Index
14 pages
Change of Authorship Request Form 1
No ratings yet
Change of Authorship Request Form 1
2 pages
Alexis de Tocqueville
No ratings yet
Alexis de Tocqueville
14 pages
Urilyzer Auto User Manual
No ratings yet
Urilyzer Auto User Manual
49 pages
Pipenet Installation
0% (1)
Pipenet Installation
11 pages
Concentrating Solar Power Explained
100% (1)
Concentrating Solar Power Explained
8 pages
3rd Golaghat Open Order of Events 2025
No ratings yet
3rd Golaghat Open Order of Events 2025
3 pages
Distribution Network Design Guide
No ratings yet
Distribution Network Design Guide
9 pages
Chalukyan Architecture: A Presentation On
No ratings yet
Chalukyan Architecture: A Presentation On
38 pages
Training Input & Training Calendar
No ratings yet
Training Input & Training Calendar
11 pages
Format Pencetakan Ijazah Maarif Kurmer Tingkat Mts Tahun 2024 2025
No ratings yet
Format Pencetakan Ijazah Maarif Kurmer Tingkat Mts Tahun 2024 2025
25 pages
AFW368Tutorial 2answer
100% (1)
AFW368Tutorial 2answer
6 pages
AOAC 937 07 Fish and Marine Products Treatment and Preparation
No ratings yet
AOAC 937 07 Fish and Marine Products Treatment and Preparation
1 page
K of Debate Aff - Michigan7 2016
No ratings yet
K of Debate Aff - Michigan7 2016
147 pages
Bird 1875 Chess Masterpieces
No ratings yet
Bird 1875 Chess Masterpieces
159 pages
Frimpong 4 Ors Vrs Gyan Cudjoe 2 Ors (j8182025) 2024 Ghasc 63 (5 December 2024)
No ratings yet
Frimpong 4 Ors Vrs Gyan Cudjoe 2 Ors (j8182025) 2024 Ghasc 63 (5 December 2024)
16 pages
Marketing Management Principles Overview
No ratings yet
Marketing Management Principles Overview
88 pages
FIBOCOM SU806 Series Building Development Environment Guide - V1.0.0
No ratings yet
FIBOCOM SU806 Series Building Development Environment Guide - V1.0.0
8 pages
Ragdoll Fox Crochet Pattern Guide
100% (6)
Ragdoll Fox Crochet Pattern Guide
11 pages
Strategic Talent Management Critique
No ratings yet
Strategic Talent Management Critique
13 pages
Guidebook To The Gunung Leuser National Park
No ratings yet
Guidebook To The Gunung Leuser National Park
96 pages
FedEx Corporation: Mission and HR Strategies
No ratings yet
FedEx Corporation: Mission and HR Strategies
13 pages
BDC - OKCODE BATCH INPUT Meaning (Significado, Ejemplo)
No ratings yet
BDC - OKCODE BATCH INPUT Meaning (Significado, Ejemplo)
3 pages
Taipei Tech EECS
No ratings yet
Taipei Tech EECS
26 pages
Congenital Stationary Night Blindness
No ratings yet
Congenital Stationary Night Blindness
3 pages
Naval Officer and Penal Reform Insights
No ratings yet
Naval Officer and Penal Reform Insights
14 pages
Anuja Dhyani
No ratings yet
Anuja Dhyani
2 pages
Eng V New 1
No ratings yet
Eng V New 1
5 pages

Speech Processing Basics

Uploaded by

Speech Processing Basics

Uploaded by

Speech Signal Basics

What is Speech ???

Statistics: sampled speech is a string of numbers

Distribution (Gamma, Laplace) Quantization

Speech production Model

The Speech Signal

The Pitch is the fundamental frequency of

Spectral look of Ooooohhhh

(In purple: quantized envelope)

Schematic Diagram of Vocal Mechanism

The Vocal Cords

]Human voice [Wikipedia

Glottal Volume and Mouth Sound

Typical Voiced Sound

male ~ 85-155 Hz;

Note the overlap !

The graphs of the

Power Spectrum: Voiced Speech

Average Formant Locations

From: Rabiner & Schafer, Digital Processing of Speech Signals

The Vocal Tract

Speech Production Model

Speech Sounds Categories

All you need to know about

Typical Pitch & Intensity Variations

Autocorrelation Pitch Detector

First maxima location is the pitch period

The Vocal Tract and Formants

The Vocal Tract and Formants (contd)

Glottal source harmonics, Vocal tract formant and Spectral envelope

Categorization of Speech Sounds

Speech sounds are studied and classified from

From: Rabiner & Schafer, Digital Processing of Speech Signals

Most common Manner of articulation

Example of speech waveform (male) of the word problems.

Time domain view

Spectrograms with different

A Typical Speech Processing

Waveform and spectral

Speech production models

Applications: Coding, Modifications, Enhancement, Recognition

Pitch and vocal tract length change Sinewave-based modification

Some Wideband Audio Examples

Original (Depeche mode : Martyr)

Speech Synthesis Demo

And Some Modern TTS Machines

A modern book reader: Kindle

The Human Hearing System

Hearing range: ~ 16Hz - 18KHz

How does it work ?

Called the pinna or auricle

The Middle Ear

The Inner Ear

Outer, Middle and Inner Ear

Inside The Cochlea

Frequency Sensitivity along the

How sensitive is human hearing?

Vary the frequency and plot.

Vary the frequency of the test tone and plot the

Repeat for various frequencies of masking tones:

Frequency partitioning into Barks

Masking Thresholds on critical band

Masked and non-masked tones

Experiment: Play 1 kHz masking tone at 60

Stop masking tone, then stop test tone after

Repeat with different level of the test tone and plot:

Try other frequencies for test tone (masking tone

Masking effect in time

Voiced and Unvoiced Noise Thresholds

How to quantify the degradation ?

What about the Phone Line ?

(A very good) Reference book:

First edition, 1978

Second edition, 2007

You might also like