0% found this document useful (0 votes)
119 views6 pages

CNN For Phoneme Recognition

CNN

Uploaded by

Carlangaslangas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
119 views6 pages

CNN For Phoneme Recognition

CNN

Uploaded by

Carlangaslangas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Convolutional Neural Networks for Phoneme Recognition

Cornelius Glackin1, Julie Wall2, Gérard Chollet1, Nazim Dugan1 and Nigel Cannings1
1Intelligent Voice Ltd., London, U.K.
2School of Architecture, Computing and Engineering, University of East London, U.K.

Keywords: Phoneme Recognition, Convolutional Neural Network, TIMIT.

Abstract: This paper presents a novel application of convolutional neural networks to phoneme recognition. The
phonetic transcription of the TIMIT speech corpus is used to label spectrogram segments for training the
convolutional neural network. A window of a fixed size slides over the spectrogram of the TIMIT utterances
and the resulting spectrogram patches are assigned to the appropriate phone class by parsing TIMIT’s phone
transcription. The convolutional neural network is the standard GoogLeNet implementation trained with
stochastic gradient descent with mini batches. After training, phonetic rescoring is performed in the usual way
to map the TIMIT phone set to the smaller standard set. Benchmark results are presented for comparison to
other state-of-the-art approaches. Finally, conclusions and future directions with regard to extending the
approach are discussed.

1 INTRODUCTION responsible for extracting acoustic features from


speech and classifying them to symbol classes.
Traditionally, Automatic Speech Recognition (ASR) Specifically, in the CNN Acoustic Model (CNN-AM)
involves multiple successive layers of feature presented in this paper we use spectrograms as input
extraction to compress the amount of information and phonemes as output classes for training. We will
processed from the raw audio so that the training of use the phonetic transcription of the TIMIT corpus as
the ASR does not take an unreasonably long time. the ‘ground truth’ for training, validation and testing
However, in recent years with increases in the CNN-AM.
computational speed, the adoption of parallel
computation with General Purpose Graphic
Processing Units (GPGPUs), and advances in neural 2 CNN-BASED ACOUSTIC
networks (the so-called Deep Learning trend), many MODELLING
researchers are replacing traditional ASR algorithms
with data-driven approaches that simply take the A CNN is usually employed for the classification of
audio data in its frequency form (e.g. spectrogram) static images, see for example (Krizhevsky,
and process it with a Deep Neural Network (DNN), Sutskever and Hinton, 2012). They are inspired by
or more appropriately, since speech is temporal, a receptive fields in the mammalian brains which are
Recurrent Neural Network (RNN) that can be trained formed by neurons in the V1 processing centres of our
quickly with GPUs. The RNN then converts the cortex responsible for vision; they are also present in
spectrogram directly to phonetic symbols and in some the cochlear nucleus of the auditory processing areas
cases directly to text (Hannun et al., 2014). (Shamma, 2001). The receptive field of a sensory
Convolutional Neural Networks (CNNs) present neuron transforms the firing of that neuron depending
an interesting alternative to the use of DNNs and on its spatial input (Paulin, 1998). Usually there is an
RNNs for ASR. In this paper, we will demonstrate inhibitory region surrounding a receptive field which
how the CNN, which is known for state of the art suppresses any stimulus which is not altered by the
performance for image processing tasks, can be bounds of the receptive field. In this way, receptive
adapted for learning the Acoustic Model (AM) fields behave like feature extractors.
component of an ASR system. The AM model is

190
Glackin, C., Wall, J., Chollet, G., Dugan, N. and Cannings, N.
Convolutional Neural Networks for Phoneme Recognition.
DOI: 10.5220/0006653001900195
In Proceedings of the 7th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2018), pages 190-195
ISBN: 978-989-758-276-9
Copyright © 2018 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved
Convolutional Neural Networks for Phoneme Recognition

Figure 1: Shows the preparation of the images for GoogLeNet training. A sliding window moves over the 16kHz STFT-
based spectrogram. The sliding window is shown in grayscale, the resulting 256*256 pixel spectrogram patches are placed
into phoneme classes according to the TIMIT transcription for training, validation and testing.

Inspired by the work of Hubel and Wiesel (Hubel convolutions, they are typically an odd number so that
and Wiesel, 1962), Fukushima developed the the kernel can be centred on top of the image pixel in
Neocognitron network (Fukushima, 1980). Images question. In the inception module there are also 1x1
are dissected by image processing operations for the convolutions which reduce the dimension of the
automated extraction of features. These image feature vector, ensuring that the number of
processing operations were then formalised by Yann parameters to be optimised remains manageable. In
LeCun to be convolutions; it was LeCun that coined fact, this reduced number of parameters is probably
the term CNN. The most notable example of which the principle contribution of the GoogLeNet CNN, it
was the LeNet 5 (LeCun et al., 1990) which was used contains 4 million parameters, whereas its fore-runner
to learn the MNIST handwritten character data set. AlexNet (Krizhevsky, Sutskever and Hinton, 2012)
LeNet 5 was the first network to use convolutions and has 60 million parameters to be optimised. The
subsampling or pooling layers. pooling layer reduces the number of parameters, but
One of the main strengths of the CNN is that since its primary function is to make the network invariant
Ciresan’s seminal GPU implementation (Ciresan et to feature translation. The concatenation layer
al., 2011) in 2011 they are now typically trained in constructs a feature vector for processing by the next
parallel with a GPU, and in fact are now arguably the layer.
most common type of DNN currently being trained.
One subtlety to note is that the larger the size of the
pooling area, the more information is condensed, 3 PHONEME RECOGNITION
which leads to slim networks that fit more easily into
GPU memory (as they are more linear). However, if WITH TIMIT
the pooling area is too large, too much information is
thrown away and predictive performance decreases. We used spectrograms to train a CNN to perform
The state of the art in CNNs is arguably the speech recognition. For this, we decided to use the
GoogLeNet (Szegedy et al., 2015) which was the TIMIT corpus to train the acoustic model (CNN) as it
architecture that won the ImageNet competition in has accurate phoneme transcription (Garofolo et al.,
2011 (ILSVRC, 2011). 1993). The TIMIT speech corpus was designed in
The main contribution of GoogLeNet is that it 1993 as a speech data resource for acoustic phonetic
uses inception modules. Convolutions of different studies and has been used extensively for the
sizes are used within the module and this gives the development and evaluation of ASR studies. TIMIT
network the ability to cope with different types of contains broadband recordings of 630 speakers
features. There are 1x1, 3x3, and 5x5 pixel of eight major dialects of American English, each

191
ICPRAM 2018 - 7th International Conference on Pattern Recognition Applications and Methods

Figure 2: Distribution of phonemes within the TIMIT transcription.

reading ten phonetically rich sentences. The corpus the interval. It would also have required an additional
includes time-aligned orthographic, phonetic and computationally expensive step in the labelling of the
word transcriptions as well as a 16-bit 16 kHz speech spectrogram windows.
waveform file for each utterance. TIMIT was Figure 1 also illustrates the preparation of the
designed to further acoustic-phonetic knowledge and training, validation and testing data. The figure
ASR systems. It was commissioned by DARPA and illustrates how the phonetic transcription is used to
worked on by many sites, including Texas label the 256x256 greyscale spectrogram patches as
Instruments (TI) and Massachusetts Institute of the sliding window passes over each of the TIMIT
Technology (MIT), hence the corpus' name. TIMIT is utterances. The labelled greyscale patches are sorted
the most accurately transcribed speech corpus in into the directory belonging to each of the 61
existence as it contains not only transcriptions of the phoneme classes for each of the training, validation
text but also contains accurate timing of phones. This and testing sets. In the TIMIT corpus we use the
is impressive given that the average English speaker standard core training setup. We use wide-band or
utters 14-15 phones a second. Figure 1 shows a Short-Term Fourier Transform (STFT) spectrograms,
spectrogram and illustrates the accuracy of the word since we want to align acoustic data with phonetic
and phone transcription for one of TIMIT’s core symbols with timing that is as accurate as possible.
training set utterances. The FFT component of the spectrogram generation
Spectrogram images were generated from the uses NVIDIA’s cuFFT library for speed. Figure 2
TIMIT corpus and placed in classes according to shows the distribution of the phones generated
TIMIT’s phone transcription. Spectrograms were according to the TIMIT phone transcription in the
produced for every 160 samples which for 16 kHz training set. For readability purposes, please note that
encoded audio corresponds to 10 ms which is the the bars correspond to the alphabetically ordered
standard resolution to find all the acoustic features the phones in the key below.
audio contains. The contents of the phone ground As can be seen from the figure, the largest class is
truth are parsed and each spectrogram is labelled with ‘s’ and the second largest is ‘h#’ (silence). The latter
the phone to which its centre falls. Alternatively, one occurs at the beginning and end of each TIMIT
could have used the centre of the ground truth interval utterance. The distribution is unbalanced which of
and calculated the Euclidean distance between the course makes phoneme recognition by neural
centre of the phone interval and the window length network architectures challenging. The training data
but it was decided that this would be making is the standard TIMIT core set, and the standard test
assumptions about where the phone is centred within set sub-directories DR1-4 and DR5-8 were used for

192
Convolutional Neural Networks for Phoneme Recognition

Figure 3: TIMIT stochastic gradient descent training.

validation and testing respectively. This partitioning classification of the 61 phones. For top-5 the accuracy
resulted in 1,417,588 spectrogram patches in the is reported as 96.27%, which means that the correct
training set, as well as 222,789 and 294,101 phone was listed in the top five output classifications
spectrograms in the validation and testing sets of the network output, this is interesting because as
respectively. mentioned earlier each spectrogram window contains
4 to 5 phones on average, and preliminary tests
3.1 GoogLeNet Training and confirmed that in the majority of cases the other
Inferencing phones were indeed correctly being identified.
The network is trained using the training data (1.4
The GoogLeNet implementation was trained with million spectrograms) and it uses the validation set
Stochastic Gradient Descent (SGD). Before the Deep (approximately 223 thousand spectrograms) to check
Learning boom, gradient descent was usually training progress. Once this is done there is a separate
performed by using the full set of training samples testing set (approximately 294 thousand images) that
(full batch) to determine the next update of the can be used to test the system, and the standard test
parameters. The problem with this approach is that it set sub-directories DR1-4 and DR5-8 were used for
is not parallelizable, and hence cannot by validation and testing respectively. The validation set
implemented efficiently on GPU. SGD does away is used to check the progress of the training of the
with this approach by computing the gradient of the network. After each iteration of the training within
parameters on a single or few (mini batch) training which training data has been used to learn the network
samples. For large sizes of datasets, such as this one, weights, the validation data checks that the accuracy
SGD performs qualitatively as well as batch methods of this latest iteration of the trained system is still
but outperforms them in computational time. improving. The validation data is kept separate from
A stepped learning rate was used with a 256 data the training data and is only used to monitor the
sample mini batch size, Figure 3 shows the training progress of the training, and to stop training if
accuracy. The network outputs the phone class overfitting occurs. The highest value of the validation
prediction at three different points in the network accuracy is used as the final system.
architecture (loss1, loss2, and loss3). The NVIDIA As can be seen in Figure 3, this is at epoch
DIGITS implementation employed also reports the (iteration) 20, this is the final version of the trained
top-1 and top-5 predictions for each of those loss system. We then use the trained system and perform
(accuracy) outputs. loss3 (the last network output) inferencing over the test set, Figure 4 shows an
reports the highest accuracy which is 71.65% for example of the prediction the system makes with a

193
ICPRAM 2018 - 7th International Conference on Pattern Recognition Applications and Methods

Figure 4: All network outputs for a test utterance.

single sample of this previously unseen test sample. achieved an impressive 82.67% accuracy. It is not
The output of the inferencing process contains many surprising to us that the current state of the art is with
duplicates of phones due to the small increments of a form of CNN (Tóth, 2015) with an 83.5% test
the sliding window position. accuracy. Notably, a team from Microsoft recently
presented a fusion system that achieved the state of
3.2 Post-processing and Rescoring the art accuracy for the Switchboard corpus. Each of
the three ensemble members in the fusion system
Hence, an additional post-processing script was used some form of CNN architecture, particularly at
written to remove the duplicates. It is the convention the feature extraction part of the networks. It is
in the literature when reporting results for the TIMIT becoming clear that CNNs are demonstrating
corpus to re-score the results for a smaller set of superiority over RNNs for acoustic modelling.
phones (Lopes and Perdigao, 2011). The phoneticians Each spectrogram window typically contains 4 or
that scored TIMIT used 61 phone symbols. Many of 5 phones per 256 ms window since the average
the phones in TIMIT are not conventionally used by speaker utters 15 phones per second. The pooling
other speech recognition systems. For example, there layers in the CNN-AM provide flexibility in where
are phone symbols called closures e.g. pcl, kcl, tcl, the feature under question (phones in this case) can be
bcl, dcl, and gcl which simply refer to the closing of within the 256*256 image. This is useful for different
the mouth before release of closure resulting in the p, orientations and scales of images in image
k, t, b, d, or g phones being uttered respectively. Most classification and is also particularly useful for
acoustic models map these to the silence symbol ‘h#’. phoneme recognition where it is likely there will exist
Post-processing code was written to automatically small errors in the training transcription.
remap the output of model inferencing with the new During inferencing (testing), the CNN-AM makes
phone set. The results for the test set were then probabilistic predictions of all the phone classes for
generated with the new remapping and the accuracy each of the 294,101 test spectrograms. This capability
increased from 71.65% (shown in Figure 3) to is provided by the use of softmax nodes at three
77.44% after rescoring. successive output stages of the network (Loss 1 to 3).
Whilst not quite in excess of the 82.3% result We carried out some simple graphical analysis of the
reported by Alex Graves (Graves, Mohamed and output confidences of all the phones, employing
Hinton, 2013) with bidirectional LSTMs, or the DNN colour coding of the outputs for easier readability of
with stochastic depth (Chen, 2016) which achieved a the results. This graphical analysis is presented in
competitive accuracy of 80.9%, it is still comparable. Figure 4, and as can be seen from the loss-3
Zhang et al., (Zhang, 2016) is a RNN-CNN hybrid (accuracy), the network makes crisp classifications of
based on MFCC features. This novel approach uses usually only a single phone at a time. Given that this
conventional MFCC feature extraction with an RNN is unseen data, and that the comparison with the
layer before a deep CNN structure. The hybrid system ground truth is good, we are confident that this

194
Convolutional Neural Networks for Phoneme Recognition

network is an effective way to train an acoustic Scaling up end-to-end speech recognition. In arXiv
model. preprint arXiv:1412.5567.
Hubel, D.H., Wiesel, T.N., 1962. Receptive fields,
binocular interaction and functional architecture in cat's
visual cortex. In J Physiol (London), vol. 160, pp. 106-
4 CONCLUSIONS 154.
ImageNet Large Scale Visual Recognition Challenge
We have presented a novel application of CNNs to (ILSVRC), 2011, http://image-net.org/challenges/
phoneme recognition. We have shown how the LSVRC/2011/index.
Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. Imagenet
TIMIT speech corpus can be used for labelled
classification with deep convolutional neural networks.
spectrogram patches for the CNN-AM training. The In Adv Neural Inf Process Syst (NIPS), pp. 1097-1105.
results whilst not surpassing the current state of the LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard,
art are encouraging, and the usability and R.E., Hubbard, W., Jackel, L.D., 1990. Handwritten
transparency of the output processing have proved digit recognition with a back-propagation network. In
that CNNs are a very viable way to do speech Adv Neural Inf Process Syst (NIPS), pp. 396-404.
recognition. We have also done some initial Lopes, C., Perdigao, F., 2011. Phone recognition on the
experiments with NTIMIT which contains noise from TIMIT database. In Speech Technologies/Book 1, pp.
various telephone networks and as it is telephone 285-302.
NVIDIA DIGITS Interactive Deep Learning GPU Training
speech it has a narrower frequency range [0, 3.3kHz].
System, https://developer.nvidia.com/digits.
Typically, we have found that NTIMIT results are Paulin, M.G., 1998. A method for analysing neural
around 10% less than for TIMIT. However, we have computation using receptive fields in state space. In
found that we are within 1% of the TIMIT networks Neural Networks, vol. 11, no. 7, pp. 1219-1228.
performance in our preliminary tests which suggests Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.,
that the CNN approach is much more noise robust. Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich,
In the near future, we plan to develop strategies to A., 2015. Going deeper with convolutions. In IEEE
acquire large volumes of phonetic transcriptions for Conf Computer Vision Pattern Recognition (CVPR),
training more robust CNN-AM. We are also in the pp. 1-9.
Shamma, S., 2001. On the role of space and time in auditory
process of training a sequence-to-sequence language
processing. In Trends in Cognitive Sciences, vol. 5, no.
model to transform the phonetic output to text. 8, pp. 340–348.
Tóth, L., 2015. Phone recognition with hierarchical
convolutional deep maxout networks. In EURASIP
REFERENCES Journal on Audio, Speech, and Music Processing, vol.
1 , pp.1-13.
Xiong, W., Droppo, J., Huang, X., Seide, F., Seltzer, M.,
Chen, D., Zhang, W., Xu, X., & Xing, X., 2016. Deep
Stolcke, A., Yu, D. and Zweig, G., 2017. The Microsoft
networks with stochastic depth for acoustic modelling.
2016 conversational speech recognition system. In
In Signal and Information Processing Association
IEEE Int. Conf. on Acoustics, Speech and Signal
Annual Summit and Conference (APSIPA), pp. 1-4.
Processing (ICASSP), pp. 5255-5259.
Ciresan, D.C., Meier, U., Masci, J., Gambardella, L.,
Zhang, Z., Sun, Z., Liu, J., Chen, J., Huo, Z., Zhang, X.,
Schmidhuber, J., 2011. Flexible, high performance
2016. Deep Recurrent Convolutional Neural Network:
convolutional neural networks for image classification.
Improving Performance For Speech Recognition, arXiv
In Int Joint Conf Artificial Intelligence (IJCAI), vol. 22,
1611.07174.
no. 1, pp. 1237-1242.
Fukushima, K., 1980. Neocognitron: A self-organizing
neural network model for a mechanism of pattern
recognition unaffected by shift in position. In Biol
Cybern, vol. 36, no. 4, pp. 193-202.
Garofolo, J., Lamel, L., Fisher, W., Fiscus, J., Pallett, D.,
Dahlgren, N., Zue, V., 1993. TIMIT Acoustic-Phonetic
Continuous Speech Corpus LDC93S1. Web Download,
Philadelphia: Linguistic Data Consortium.
Graves, A., Mohamed, A., Hinton, G., 2013. Speech
recognition with deep recurrent neural networks. In
IEEE Int Conf Acoust Speech Signal Process (ICASSP),
pp. 6645-6649.
Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos,
G., Elsen, E., Prenger, R. et al., 2014. Deep speech:

195

You might also like