CNN For Phoneme Recognition

CNN

Uploaded by

Carlangaslangas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

119 views6 pages

CNN For Phoneme Recognition

CNN

Uploaded by

Carlangaslangas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Convolutional Neural Networks for Phoneme Recognition

Cornelius Glackin1, Julie Wall2, Gérard Chollet1, Nazim Dugan1 and Nigel Cannings1
1Intelligent Voice Ltd., London, U.K.
2School of Architecture, Computing and Engineering, University of East London, U.K.

Keywords: Phoneme Recognition, Convolutional Neural Network, TIMIT.

Abstract: This paper presents a novel application of convolutional neural networks to phoneme recognition. The
phonetic transcription of the TIMIT speech corpus is used to label spectrogram segments for training the
convolutional neural network. A window of a fixed size slides over the spectrogram of the TIMIT utterances
and the resulting spectrogram patches are assigned to the appropriate phone class by parsing TIMIT’s phone
transcription. The convolutional neural network is the standard GoogLeNet implementation trained with
stochastic gradient descent with mini batches. After training, phonetic rescoring is performed in the usual way
to map the TIMIT phone set to the smaller standard set. Benchmark results are presented for comparison to
other state-of-the-art approaches. Finally, conclusions and future directions with regard to extending the
approach are discussed.

1 INTRODUCTION responsible for extracting acoustic features from

speech and classifying them to symbol classes.
Traditionally, Automatic Speech Recognition (ASR) Specifically, in the CNN Acoustic Model (CNN-AM)
involves multiple successive layers of feature presented in this paper we use spectrograms as input
extraction to compress the amount of information and phonemes as output classes for training. We will
processed from the raw audio so that the training of use the phonetic transcription of the TIMIT corpus as
the ASR does not take an unreasonably long time. the ‘ground truth’ for training, validation and testing
However, in recent years with increases in the CNN-AM.
computational speed, the adoption of parallel
computation with General Purpose Graphic
Processing Units (GPGPUs), and advances in neural 2 CNN-BASED ACOUSTIC
networks (the so-called Deep Learning trend), many MODELLING
researchers are replacing traditional ASR algorithms
with data-driven approaches that simply take the A CNN is usually employed for the classification of
audio data in its frequency form (e.g. spectrogram) static images, see for example (Krizhevsky,
and process it with a Deep Neural Network (DNN), Sutskever and Hinton, 2012). They are inspired by
or more appropriately, since speech is temporal, a receptive fields in the mammalian brains which are
Recurrent Neural Network (RNN) that can be trained formed by neurons in the V1 processing centres of our
quickly with GPUs. The RNN then converts the cortex responsible for vision; they are also present in
spectrogram directly to phonetic symbols and in some the cochlear nucleus of the auditory processing areas
cases directly to text (Hannun et al., 2014). (Shamma, 2001). The receptive field of a sensory
Convolutional Neural Networks (CNNs) present neuron transforms the firing of that neuron depending
an interesting alternative to the use of DNNs and on its spatial input (Paulin, 1998). Usually there is an
RNNs for ASR. In this paper, we will demonstrate inhibitory region surrounding a receptive field which
how the CNN, which is known for state of the art suppresses any stimulus which is not altered by the
performance for image processing tasks, can be bounds of the receptive field. In this way, receptive
adapted for learning the Acoustic Model (AM) fields behave like feature extractors.
component of an ASR system. The AM model is

190
Glackin, C., Wall, J., Chollet, G., Dugan, N. and Cannings, N.
Convolutional Neural Networks for Phoneme Recognition.
DOI: 10.5220/0006653001900195
In Proceedings of the 7th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2018), pages 190-195
ISBN: 978-989-758-276-9
Copyright © 2018 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved
Convolutional Neural Networks for Phoneme Recognition

Figure 1: Shows the preparation of the images for GoogLeNet training. A sliding window moves over the 16kHz STFT-
based spectrogram. The sliding window is shown in grayscale, the resulting 256*256 pixel spectrogram patches are placed
into phoneme classes according to the TIMIT transcription for training, validation and testing.

Inspired by the work of Hubel and Wiesel (Hubel convolutions, they are typically an odd number so that
and Wiesel, 1962), Fukushima developed the the kernel can be centred on top of the image pixel in
Neocognitron network (Fukushima, 1980). Images question. In the inception module there are also 1x1
are dissected by image processing operations for the convolutions which reduce the dimension of the
automated extraction of features. These image feature vector, ensuring that the number of
processing operations were then formalised by Yann parameters to be optimised remains manageable. In
LeCun to be convolutions; it was LeCun that coined fact, this reduced number of parameters is probably
the term CNN. The most notable example of which the principle contribution of the GoogLeNet CNN, it
was the LeNet 5 (LeCun et al., 1990) which was used contains 4 million parameters, whereas its fore-runner
to learn the MNIST handwritten character data set. AlexNet (Krizhevsky, Sutskever and Hinton, 2012)
LeNet 5 was the first network to use convolutions and has 60 million parameters to be optimised. The
subsampling or pooling layers. pooling layer reduces the number of parameters, but
One of the main strengths of the CNN is that since its primary function is to make the network invariant
Ciresan’s seminal GPU implementation (Ciresan et to feature translation. The concatenation layer
al., 2011) in 2011 they are now typically trained in constructs a feature vector for processing by the next
parallel with a GPU, and in fact are now arguably the layer.
most common type of DNN currently being trained.
One subtlety to note is that the larger the size of the
pooling area, the more information is condensed, 3 PHONEME RECOGNITION
which leads to slim networks that fit more easily into
GPU memory (as they are more linear). However, if WITH TIMIT
the pooling area is too large, too much information is
thrown away and predictive performance decreases. We used spectrograms to train a CNN to perform
The state of the art in CNNs is arguably the speech recognition. For this, we decided to use the
GoogLeNet (Szegedy et al., 2015) which was the TIMIT corpus to train the acoustic model (CNN) as it
architecture that won the ImageNet competition in has accurate phoneme transcription (Garofolo et al.,
2011 (ILSVRC, 2011). 1993). The TIMIT speech corpus was designed in
The main contribution of GoogLeNet is that it 1993 as a speech data resource for acoustic phonetic
uses inception modules. Convolutions of different studies and has been used extensively for the
sizes are used within the module and this gives the development and evaluation of ASR studies. TIMIT
network the ability to cope with different types of contains broadband recordings of 630 speakers
features. There are 1x1, 3x3, and 5x5 pixel of eight major dialects of American English, each

191
ICPRAM 2018 - 7th International Conference on Pattern Recognition Applications and Methods

Figure 2: Distribution of phonemes within the TIMIT transcription.

reading ten phonetically rich sentences. The corpus the interval. It would also have required an additional
includes time-aligned orthographic, phonetic and computationally expensive step in the labelling of the
word transcriptions as well as a 16-bit 16 kHz speech spectrogram windows.
waveform file for each utterance. TIMIT was Figure 1 also illustrates the preparation of the
designed to further acoustic-phonetic knowledge and training, validation and testing data. The figure
ASR systems. It was commissioned by DARPA and illustrates how the phonetic transcription is used to
worked on by many sites, including Texas label the 256x256 greyscale spectrogram patches as
Instruments (TI) and Massachusetts Institute of the sliding window passes over each of the TIMIT
Technology (MIT), hence the corpus' name. TIMIT is utterances. The labelled greyscale patches are sorted
the most accurately transcribed speech corpus in into the directory belonging to each of the 61
existence as it contains not only transcriptions of the phoneme classes for each of the training, validation
text but also contains accurate timing of phones. This and testing sets. In the TIMIT corpus we use the
is impressive given that the average English speaker standard core training setup. We use wide-band or
utters 14-15 phones a second. Figure 1 shows a Short-Term Fourier Transform (STFT) spectrograms,
spectrogram and illustrates the accuracy of the word since we want to align acoustic data with phonetic
and phone transcription for one of TIMIT’s core symbols with timing that is as accurate as possible.
training set utterances. The FFT component of the spectrogram generation
Spectrogram images were generated from the uses NVIDIA’s cuFFT library for speed. Figure 2
TIMIT corpus and placed in classes according to shows the distribution of the phones generated
TIMIT’s phone transcription. Spectrograms were according to the TIMIT phone transcription in the
produced for every 160 samples which for 16 kHz training set. For readability purposes, please note that
encoded audio corresponds to 10 ms which is the the bars correspond to the alphabetically ordered
standard resolution to find all the acoustic features the phones in the key below.
audio contains. The contents of the phone ground As can be seen from the figure, the largest class is
truth are parsed and each spectrogram is labelled with ‘s’ and the second largest is ‘h#’ (silence). The latter
the phone to which its centre falls. Alternatively, one occurs at the beginning and end of each TIMIT
could have used the centre of the ground truth interval utterance. The distribution is unbalanced which of
and calculated the Euclidean distance between the course makes phoneme recognition by neural
centre of the phone interval and the window length network architectures challenging. The training data
but it was decided that this would be making is the standard TIMIT core set, and the standard test
assumptions about where the phone is centred within set sub-directories DR1-4 and DR5-8 were used for

192
Convolutional Neural Networks for Phoneme Recognition

Figure 3: TIMIT stochastic gradient descent training.

validation and testing respectively. This partitioning classification of the 61 phones. For top-5 the accuracy
resulted in 1,417,588 spectrogram patches in the is reported as 96.27%, which means that the correct
training set, as well as 222,789 and 294,101 phone was listed in the top five output classifications
spectrograms in the validation and testing sets of the network output, this is interesting because as
respectively. mentioned earlier each spectrogram window contains
4 to 5 phones on average, and preliminary tests
3.1 GoogLeNet Training and confirmed that in the majority of cases the other
Inferencing phones were indeed correctly being identified.
The network is trained using the training data (1.4
The GoogLeNet implementation was trained with million spectrograms) and it uses the validation set
Stochastic Gradient Descent (SGD). Before the Deep (approximately 223 thousand spectrograms) to check
Learning boom, gradient descent was usually training progress. Once this is done there is a separate
performed by using the full set of training samples testing set (approximately 294 thousand images) that
(full batch) to determine the next update of the can be used to test the system, and the standard test
parameters. The problem with this approach is that it set sub-directories DR1-4 and DR5-8 were used for
is not parallelizable, and hence cannot by validation and testing respectively. The validation set
implemented efficiently on GPU. SGD does away is used to check the progress of the training of the
with this approach by computing the gradient of the network. After each iteration of the training within
parameters on a single or few (mini batch) training which training data has been used to learn the network
samples. For large sizes of datasets, such as this one, weights, the validation data checks that the accuracy
SGD performs qualitatively as well as batch methods of this latest iteration of the trained system is still
but outperforms them in computational time. improving. The validation data is kept separate from
A stepped learning rate was used with a 256 data the training data and is only used to monitor the
sample mini batch size, Figure 3 shows the training progress of the training, and to stop training if
accuracy. The network outputs the phone class overfitting occurs. The highest value of the validation
prediction at three different points in the network accuracy is used as the final system.
architecture (loss1, loss2, and loss3). The NVIDIA As can be seen in Figure 3, this is at epoch
DIGITS implementation employed also reports the (iteration) 20, this is the final version of the trained
top-1 and top-5 predictions for each of those loss system. We then use the trained system and perform
(accuracy) outputs. loss3 (the last network output) inferencing over the test set, Figure 4 shows an
reports the highest accuracy which is 71.65% for example of the prediction the system makes with a

193
ICPRAM 2018 - 7th International Conference on Pattern Recognition Applications and Methods

Figure 4: All network outputs for a test utterance.

single sample of this previously unseen test sample. achieved an impressive 82.67% accuracy. It is not
The output of the inferencing process contains many surprising to us that the current state of the art is with
duplicates of phones due to the small increments of a form of CNN (Tóth, 2015) with an 83.5% test
the sliding window position. accuracy. Notably, a team from Microsoft recently
presented a fusion system that achieved the state of
3.2 Post-processing and Rescoring the art accuracy for the Switchboard corpus. Each of
the three ensemble members in the fusion system
Hence, an additional post-processing script was used some form of CNN architecture, particularly at
written to remove the duplicates. It is the convention the feature extraction part of the networks. It is
in the literature when reporting results for the TIMIT becoming clear that CNNs are demonstrating
corpus to re-score the results for a smaller set of superiority over RNNs for acoustic modelling.
phones (Lopes and Perdigao, 2011). The phoneticians Each spectrogram window typically contains 4 or
that scored TIMIT used 61 phone symbols. Many of 5 phones per 256 ms window since the average
the phones in TIMIT are not conventionally used by speaker utters 15 phones per second. The pooling
other speech recognition systems. For example, there layers in the CNN-AM provide flexibility in where
are phone symbols called closures e.g. pcl, kcl, tcl, the feature under question (phones in this case) can be
bcl, dcl, and gcl which simply refer to the closing of within the 256*256 image. This is useful for different
the mouth before release of closure resulting in the p, orientations and scales of images in image
k, t, b, d, or g phones being uttered respectively. Most classification and is also particularly useful for
acoustic models map these to the silence symbol ‘h#’. phoneme recognition where it is likely there will exist
Post-processing code was written to automatically small errors in the training transcription.
remap the output of model inferencing with the new During inferencing (testing), the CNN-AM makes
phone set. The results for the test set were then probabilistic predictions of all the phone classes for
generated with the new remapping and the accuracy each of the 294,101 test spectrograms. This capability
increased from 71.65% (shown in Figure 3) to is provided by the use of softmax nodes at three
77.44% after rescoring. successive output stages of the network (Loss 1 to 3).
Whilst not quite in excess of the 82.3% result We carried out some simple graphical analysis of the
reported by Alex Graves (Graves, Mohamed and output confidences of all the phones, employing
Hinton, 2013) with bidirectional LSTMs, or the DNN colour coding of the outputs for easier readability of
with stochastic depth (Chen, 2016) which achieved a the results. This graphical analysis is presented in
competitive accuracy of 80.9%, it is still comparable. Figure 4, and as can be seen from the loss-3
Zhang et al., (Zhang, 2016) is a RNN-CNN hybrid (accuracy), the network makes crisp classifications of
based on MFCC features. This novel approach uses usually only a single phone at a time. Given that this
conventional MFCC feature extraction with an RNN is unseen data, and that the comparison with the
layer before a deep CNN structure. The hybrid system ground truth is good, we are confident that this

194
Convolutional Neural Networks for Phoneme Recognition

network is an effective way to train an acoustic Scaling up end-to-end speech recognition. In arXiv
model. preprint arXiv:1412.5567.
Hubel, D.H., Wiesel, T.N., 1962. Receptive fields,
binocular interaction and functional architecture in cat's
visual cortex. In J Physiol (London), vol. 160, pp. 106-
4 CONCLUSIONS 154.
ImageNet Large Scale Visual Recognition Challenge
We have presented a novel application of CNNs to (ILSVRC), 2011, http://image-net.org/challenges/
phoneme recognition. We have shown how the LSVRC/2011/index.
Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. Imagenet
TIMIT speech corpus can be used for labelled
classification with deep convolutional neural networks.
spectrogram patches for the CNN-AM training. The In Adv Neural Inf Process Syst (NIPS), pp. 1097-1105.
results whilst not surpassing the current state of the LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard,
art are encouraging, and the usability and R.E., Hubbard, W., Jackel, L.D., 1990. Handwritten
transparency of the output processing have proved digit recognition with a back-propagation network. In
that CNNs are a very viable way to do speech Adv Neural Inf Process Syst (NIPS), pp. 396-404.
recognition. We have also done some initial Lopes, C., Perdigao, F., 2011. Phone recognition on the
experiments with NTIMIT which contains noise from TIMIT database. In Speech Technologies/Book 1, pp.
various telephone networks and as it is telephone 285-302.
NVIDIA DIGITS Interactive Deep Learning GPU Training
speech it has a narrower frequency range [0, 3.3kHz].
System, https://developer.nvidia.com/digits.
Typically, we have found that NTIMIT results are Paulin, M.G., 1998. A method for analysing neural
around 10% less than for TIMIT. However, we have computation using receptive fields in state space. In
found that we are within 1% of the TIMIT networks Neural Networks, vol. 11, no. 7, pp. 1219-1228.
performance in our preliminary tests which suggests Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.,
that the CNN approach is much more noise robust. Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich,
In the near future, we plan to develop strategies to A., 2015. Going deeper with convolutions. In IEEE
acquire large volumes of phonetic transcriptions for Conf Computer Vision Pattern Recognition (CVPR),
training more robust CNN-AM. We are also in the pp. 1-9.
Shamma, S., 2001. On the role of space and time in auditory
process of training a sequence-to-sequence language
processing. In Trends in Cognitive Sciences, vol. 5, no.
model to transform the phonetic output to text. 8, pp. 340–348.
Tóth, L., 2015. Phone recognition with hierarchical
convolutional deep maxout networks. In EURASIP
REFERENCES Journal on Audio, Speech, and Music Processing, vol.
1 , pp.1-13.
Xiong, W., Droppo, J., Huang, X., Seide, F., Seltzer, M.,
Chen, D., Zhang, W., Xu, X., & Xing, X., 2016. Deep
Stolcke, A., Yu, D. and Zweig, G., 2017. The Microsoft
networks with stochastic depth for acoustic modelling.
2016 conversational speech recognition system. In
In Signal and Information Processing Association
IEEE Int. Conf. on Acoustics, Speech and Signal
Annual Summit and Conference (APSIPA), pp. 1-4.
Processing (ICASSP), pp. 5255-5259.
Ciresan, D.C., Meier, U., Masci, J., Gambardella, L.,
Zhang, Z., Sun, Z., Liu, J., Chen, J., Huo, Z., Zhang, X.,
Schmidhuber, J., 2011. Flexible, high performance
2016. Deep Recurrent Convolutional Neural Network:
convolutional neural networks for image classification.
Improving Performance For Speech Recognition, arXiv
In Int Joint Conf Artificial Intelligence (IJCAI), vol. 22,
1611.07174.
no. 1, pp. 1237-1242.
Fukushima, K., 1980. Neocognitron: A self-organizing
neural network model for a mechanism of pattern
recognition unaffected by shift in position. In Biol
Cybern, vol. 36, no. 4, pp. 193-202.
Garofolo, J., Lamel, L., Fisher, W., Fiscus, J., Pallett, D.,
Dahlgren, N., Zue, V., 1993. TIMIT Acoustic-Phonetic
Continuous Speech Corpus LDC93S1. Web Download,
Philadelphia: Linguistic Data Consortium.
Graves, A., Mohamed, A., Hinton, G., 2013. Speech
recognition with deep recurrent neural networks. In
IEEE Int Conf Acoust Speech Signal Process (ICASSP),
pp. 6645-6649.
Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos,
G., Elsen, E., Prenger, R. et al., 2014. Deep speech:

195

Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
No ratings yet
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
10 pages
EHaCON - 2019 Paper 8
No ratings yet
EHaCON - 2019 Paper 8
20 pages
Unit VI Applications of ANN
No ratings yet
Unit VI Applications of ANN
6 pages
Speech Recognition Using Convolutional Neural Netw PDF
No ratings yet
Speech Recognition Using Convolutional Neural Netw PDF
5 pages
Speech Recognition Using Convolutional Neural Netw
No ratings yet
Speech Recognition Using Convolutional Neural Netw
5 pages
Recent Advances in Convolutional Neural Networks-2018
100% (1)
Recent Advances in Convolutional Neural Networks-2018
42 pages
Microcomputerschool 1994 Neural Networks in Speech
No ratings yet
Microcomputerschool 1994 Neural Networks in Speech
19 pages
Artificial Neural Network Application For The Temporal Properties of Acoustic Perception
No ratings yet
Artificial Neural Network Application For The Temporal Properties of Acoustic Perception
12 pages
Modeling of Speech Recognition Based On Deep Learning: Min Zhang
No ratings yet
Modeling of Speech Recognition Based On Deep Learning: Min Zhang
8 pages
Effects of Dataset Sampling Rate For Noise Cancellation Through Deep Learning
No ratings yet
Effects of Dataset Sampling Rate For Noise Cancellation Through Deep Learning
16 pages
Speech Command Recognition with RNNs
No ratings yet
Speech Command Recognition with RNNs
9 pages
IJCRT2204469
No ratings yet
IJCRT2204469
5 pages
SM4068
No ratings yet
SM4068
12 pages
Voice Command Based Wheelchair: Subtitle As Needed (Paper Subtitle)
No ratings yet
Voice Command Based Wheelchair: Subtitle As Needed (Paper Subtitle)
4 pages
L - B S R G C N: Etter Ased Peech Ecognition With Ated ONV ETS
No ratings yet
L - B S R G C N: Etter Ased Peech Ecognition With Ated ONV ETS
10 pages
Phocnet: A Deep Convolutional Neural Network For Word Spotting in Handwritten Documents
No ratings yet
Phocnet: A Deep Convolutional Neural Network For Word Spotting in Handwritten Documents
6 pages
RP 1
No ratings yet
RP 1
6 pages
Deep Speech: Advanced Speech Recognition
No ratings yet
Deep Speech: Advanced Speech Recognition
12 pages
Arora 2020
No ratings yet
Arora 2020
3 pages
Deep Learning for Speech Separation
No ratings yet
Deep Learning for Speech Separation
2 pages
Automatic Image Captioning Using Neural Networks
No ratings yet
Automatic Image Captioning Using Neural Networks
9 pages
CMGAN: Conformer GAN for Speech SE
No ratings yet
CMGAN: Conformer GAN for Speech SE
5 pages
Analogue Speech Recognition Based On Physical Computing: Article
No ratings yet
Analogue Speech Recognition Based On Physical Computing: Article
21 pages
Adieu Features End-To-End Speech Emotion Recognition Usinga Deep Convolutional Recurrent Network
No ratings yet
Adieu Features End-To-End Speech Emotion Recognition Usinga Deep Convolutional Recurrent Network
5 pages
Acoustic Vehicle Speed Estimation
No ratings yet
Acoustic Vehicle Speed Estimation
4 pages
Convolutional Neural Network
100% (1)
Convolutional Neural Network
78 pages
2409 08587v1
No ratings yet
2409 08587v1
5 pages
2021-Titanet Neural Model For Speaker Representation With 1D Depth-Wise
No ratings yet
2021-Titanet Neural Model For Speaker Representation With 1D Depth-Wise
5 pages
Survey on Deep Learning Architectures
No ratings yet
Survey on Deep Learning Architectures
10 pages
2023 - Dragas - A Survey On Low-Latency DNN-Based Speech Enhancement
No ratings yet
2023 - Dragas - A Survey On Low-Latency DNN-Based Speech Enhancement
26 pages
1 s2.0 S0031320317304120 Main
No ratings yet
1 s2.0 S0031320317304120 Main
24 pages
Image Captioning Generator Using Deep Machine Learning
No ratings yet
Image Captioning Generator Using Deep Machine Learning
3 pages
Study On Speech Recognition Method of Artificial Intelligence Deep Learning
No ratings yet
Study On Speech Recognition Method of Artificial Intelligence Deep Learning
6 pages
Literature Review
No ratings yet
Literature Review
12 pages
Deng11 Interspeech
No ratings yet
Deng11 Interspeech
4 pages
Deep Learning for Video Experts
100% (1)
Deep Learning for Video Experts
114 pages
7 CMAC (Cerebellar
No ratings yet
7 CMAC (Cerebellar
6 pages
Artificial Intelligence - State of Art Convolution Neural Network Architectures in A Nutshell
No ratings yet
Artificial Intelligence - State of Art Convolution Neural Network Architectures in A Nutshell
6 pages
Environmental Sound Classificationwith Convolutional Neural Networks
No ratings yet
Environmental Sound Classificationwith Convolutional Neural Networks
6 pages
Adaptation Algorithms For Neural Network-Based Speech Recognition An Overview
No ratings yet
Adaptation Algorithms For Neural Network-Based Speech Recognition An Overview
34 pages
Evaluating Hand-crafted vs. Learned Features
No ratings yet
Evaluating Hand-crafted vs. Learned Features
5 pages
Configuring A Build Pipeline On Azure DevOps For An ASP - Net Core API - CodeProject
No ratings yet
Configuring A Build Pipeline On Azure DevOps For An ASP - Net Core API - CodeProject
18 pages
Acoustic Scene Classification Method
No ratings yet
Acoustic Scene Classification Method
4 pages
Convolutional Neural Network Report
No ratings yet
Convolutional Neural Network Report
5 pages
Neural Networks for Speech Recognition
No ratings yet
Neural Networks for Speech Recognition
155 pages
Deep Learning-Based Image Captioning For Visually
No ratings yet
Deep Learning-Based Image Captioning For Visually
7 pages
Deep Learning For Environmentally Robust Speech Recognition - An Overview of Recent Developments PDF
No ratings yet
Deep Learning For Environmentally Robust Speech Recognition - An Overview of Recent Developments PDF
28 pages
Deep Neural Networks in Speech Recognition
No ratings yet
Deep Neural Networks in Speech Recognition
16 pages
Us8527276 PDF
No ratings yet
Us8527276 PDF
26 pages
Trustworthy - Final Essay
No ratings yet
Trustworthy - Final Essay
21 pages
Deep Learning for Vehicle Detection
No ratings yet
Deep Learning for Vehicle Detection
14 pages
Vincent Lostanlen, Daniel Haider, Han Han, Mathieu Lagrange, Peter Balazs, and Martin Ehler
No ratings yet
Vincent Lostanlen, Daniel Haider, Han Han, Mathieu Lagrange, Peter Balazs, and Martin Ehler
5 pages
Phoneme Recognition Using Time-Delay Neural Networks: "G" in Varying Phonetic Contexts Was Chosen
No ratings yet
Phoneme Recognition Using Time-Delay Neural Networks: "G" in Varying Phonetic Contexts Was Chosen
12 pages
On The Use of Deep Feedforward Neural Networks For Aut - 2016 - Computer Speech
No ratings yet
On The Use of Deep Feedforward Neural Networks For Aut - 2016 - Computer Speech
14 pages
Deep Learning for Portuguese ASR
No ratings yet
Deep Learning for Portuguese ASR
103 pages
NN Vs HMM
No ratings yet
NN Vs HMM
4 pages
Modern Neural Network Technologies Text-to-Image: Scientific Visualization, 2023, Volume 15, Number 2, Pages 66 - 79
No ratings yet
Modern Neural Network Technologies Text-to-Image: Scientific Visualization, 2023, Volume 15, Number 2, Pages 66 - 79
13 pages
Shodan Guide for Security Auditing
No ratings yet
Shodan Guide for Security Auditing
13 pages
Eagram Personality Type Indicator
100% (2)
Eagram Personality Type Indicator
21 pages
Dzone Refcard317 Advancedtimeseries PDF
No ratings yet
Dzone Refcard317 Advancedtimeseries PDF
8 pages
Three Types of Negotiators
No ratings yet
Three Types of Negotiators
7 pages
R Programming: Overview and History
No ratings yet
R Programming: Overview and History
241 pages
Phenomenology for Philosophy Students
No ratings yet
Phenomenology for Philosophy Students
4 pages
Linear Algebra With R
No ratings yet
Linear Algebra With R
26 pages
Introduction to Tensors for Physics Students
No ratings yet
Introduction to Tensors for Physics Students
29 pages
Kaldi Whitepaper PDF
No ratings yet
Kaldi Whitepaper PDF
4 pages
Protover Module Ported to Tor
No ratings yet
Protover Module Ported to Tor
1 page
A Sensitivity Analysis of Convolutional Neural Networks For Sentence Classification
No ratings yet
A Sensitivity Analysis of Convolutional Neural Networks For Sentence Classification
18 pages
Philosophical Dialogue
No ratings yet
Philosophical Dialogue
14 pages
Hiding Routing Information
No ratings yet
Hiding Routing Information
14 pages
Theology Leuven
No ratings yet
Theology Leuven
5 pages
Corporate Storyboard: CSB5806313SYN
No ratings yet
Corporate Storyboard: CSB5806313SYN
4 pages
Business Model Change
100% (1)
Business Model Change
18 pages
Analysis of Competing Hypotheses
No ratings yet
Analysis of Competing Hypotheses
16 pages
Introduction to Algebraic Structures
No ratings yet
Introduction to Algebraic Structures
19 pages
Vision and Scope
No ratings yet
Vision and Scope
6 pages
Waves Platform Security Audit 2017
No ratings yet
Waves Platform Security Audit 2017
22 pages
DM10 Module Update Instructions
No ratings yet
DM10 Module Update Instructions
3 pages
STATCOM - Working Principle, Design and Application - Electrical Concepts
No ratings yet
STATCOM - Working Principle, Design and Application - Electrical Concepts
7 pages
Review Development Economics - 2015 - McLaren
No ratings yet
Review Development Economics - 2015 - McLaren
19 pages
Aerospace Material Specification
No ratings yet
Aerospace Material Specification
6 pages
Aspire DPP: Aspire Study Mca Entrance Classes
No ratings yet
Aspire DPP: Aspire Study Mca Entrance Classes
3 pages
Lab 5 - Sampling Theorem 2023 - 24
No ratings yet
Lab 5 - Sampling Theorem 2023 - 24
6 pages
HLRM90 5S
No ratings yet
HLRM90 5S
2 pages
Ultrasonic Thickness Gaging
No ratings yet
Ultrasonic Thickness Gaging
4 pages
Python Strings PDF
No ratings yet
Python Strings PDF
27 pages
NIH - SW - Hydrological Assessment of Ungauged Catchments
No ratings yet
NIH - SW - Hydrological Assessment of Ungauged Catchments
452 pages
MV Worksheet Solution
No ratings yet
MV Worksheet Solution
4 pages
Xi Special Online Class Schedule With Google Meet Link
No ratings yet
Xi Special Online Class Schedule With Google Meet Link
1 page
Project PDF 1 #Docement.
No ratings yet
Project PDF 1 #Docement.
34 pages
邻近堆载作用对既有桩基承载特性的影响分析阙木泰
No ratings yet
邻近堆载作用对既有桩基承载特性的影响分析阙木泰
85 pages
MSW Logo Intro
No ratings yet
MSW Logo Intro
20 pages
Citoquininas Foloración Pitahaya
No ratings yet
Citoquininas Foloración Pitahaya
14 pages
EM - 2 Experiment No 3
No ratings yet
EM - 2 Experiment No 3
4 pages
Open Source GPGPU Design for Researchers
No ratings yet
Open Source GPGPU Design for Researchers
1 page
Avl Project
No ratings yet
Avl Project
29 pages
How To Include Soil Thermal Instability in Underground Cable Ampacity Calculations (Bates2016)
No ratings yet
How To Include Soil Thermal Instability in Underground Cable Ampacity Calculations (Bates2016)
8 pages
Source Code Program in C For Relocation Loader
100% (2)
Source Code Program in C For Relocation Loader
6 pages
A MODIFIED PENG ROBINSON EQUATION OF STATE FOR ELV 20519 FTP PDF
No ratings yet
A MODIFIED PENG ROBINSON EQUATION OF STATE FOR ELV 20519 FTP PDF
16 pages
Peristaltic Pump Design Guide
No ratings yet
Peristaltic Pump Design Guide
21 pages
Sulphur Ash & Residue Testing Guide
No ratings yet
Sulphur Ash & Residue Testing Guide
4 pages
CLR Using C#
No ratings yet
CLR Using C#
14 pages
Malcolm Joyce - Nuclear Engineering - A Conceptual Introduction To Nuclear Power (Instructor's Solution Manual) (Solutions) - Butterworth-Heinemann (2017)
No ratings yet
Malcolm Joyce - Nuclear Engineering - A Conceptual Introduction To Nuclear Power (Instructor's Solution Manual) (Solutions) - Butterworth-Heinemann (2017)
49 pages
4 Way Hacksaw Machine
67% (3)
4 Way Hacksaw Machine
8 pages
Comp 1
No ratings yet
Comp 1
2 pages
Sparse Coding in Deep Image SR
No ratings yet
Sparse Coding in Deep Image SR
10 pages
JSON Functions in PySpark 1753482553
No ratings yet
JSON Functions in PySpark 1753482553
9 pages

CNN For Phoneme Recognition

Uploaded by

CNN For Phoneme Recognition

Uploaded by

Convolutional Neural Networks for Phoneme Recognition

Keywords: Phoneme Recognition, Convolutional Neural Network, TIMIT.

1 INTRODUCTION responsible for extracting acoustic features from

Figure 2: Distribution of phonemes within the TIMIT transcription.

Figure 3: TIMIT stochastic gradient descent training.

Figure 4: All network outputs for a test utterance.

You might also like