Pyramid With Super Resolution For In-The-Wild Facial Expression Recognition

Uploaded by

51fangfanga

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views14 pages

Pyramid With Super Resolution For In-The-Wild Facial Expression Recognition

Uploaded by

51fangfanga

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Received July 1, 2020, accepted July 9, 2020, date of publication July 17, 2020, date of current version July

29, 2020.
Digital Object Identifier 10.1109/ACCESS.2020.3010018

Pyramid With Super Resolution for In-The-Wild

Facial Expression Recognition
THANH-HUNG VO , GUEE-SANG LEE , (Member, IEEE),
HYUNG-JEONG YANG , (Member, IEEE), AND SOO-HYUNG KIM , (Member, IEEE)
Department of Electronic Computer Engineering, Chonnam National University, Gwangju 61186, South Korea
Corresponding author: Soo-Hyung Kim ([email protected])
This work was supported by the National Research Foundation of Korea (NRF) funded by the Ministry of Education through the Basic
Science Research Program under Grant NRF-2018R1D1A3A03000947 and Grant NRF-2020R1A4A1019191.

ABSTRACT Facial Expression Recognition (FER) is a challenging task that improves natural
human-computer interaction. This paper focuses on automatic FER on a single in-the-wild (ITW) image.
ITW images suffer real problems of pose, direction, and input resolution. In this study, we propose a
pyramid with super-resolution (PSR) network architecture to solve the ITW FER task. We also introduce
a prior distribution label smoothing (PDLS) loss function that applies the additional prior knowledge of the
confusion about each expression in the FER task. Experiments on the three most popular ITW FER datasets
showed that our approach outperforms all the state-of-the-art methods.

INDEX TERMS Emotion recognition, image resolution, human computer interaction.

I. INTRODUCTION learning in computer vision could help simplify and automate

Non-verbal communication plays an essential role in those processes. The scope of our study is automatic facial
person-person communication. These non-verbal signals can expression recognition, where emotional expression is in the
add clues, additional information, and meaning to spo- discrete model.
ken (verbal) communication. Some studies estimate that Many studies use traditional image processing and
around 60% to 80% of communication is non-verbal [1]. machine learning for the FER task. Shan et al. used
These signals include facial expressions, eye contact, voice local statistical features, termed Local Binary Patterns,
tone and pitch, gestures, and physical distance, of which for person-independent facial expression recognition [6].
facial expression is the most popular input for analysis. The Ma and Khorasani used one-hidden-layer feed forward
facial expression recognition (FER) task aims to recognize neural network on a two-dimensional discrete cosine
the emotion from the facial image. transform [7]. Lien et al. combined facial feature point track-
In psychology and computer vision, emotion can be ing, dense flow tracking, and gradient component detec-
divided into two kinds of model: discrete and dimensional tion to detect FACS and calculate emotion [8]. In [9],
continuous [2]–[4]. The dimensional continuous model Zhang et al.extracted scale-invariant feature transform and
focuses on arousal and valence, which values from -1.0 to used the deep neural network (DNN) as the classifier. Aleksic
1.0, whereas the discrete emotion theory classifies a few core and Katsaggelos used hidden Markov models for automatic
emotions such as happy, sad, angry, neutral, surprise, disgust, FER [10].
fear and contempt. In our study, we attempted discrete emo- Recently, deep learning (DL) has significantly affected
tion recognition. many fields, such as image, voice, and natural language
Ekman and Friesen developed a Facial Action Coding Sys- processing. In the Boosted Deep Belief Network [14] intro-
tem (FACS) to analyze human facial movements [5]. How- duced by Liu et al., multiple deep belief networks learned
ever, this scheme needs trained humans and is extensively feature representation from patches of image and some of
time consuming. The recent advances of successful machine them were selected to boost. In [15], Liu et al. ensembled
three convolutional neural networks (CNN) subnets and con-
The associate editor coordinating the review of this manuscript and catenated the outputs to predict the final results. Huang [16]
approving it for publication was Waleed Alsabhan . used a custom residual block of the ResNet architecture

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
131988 VOLUME 8, 2020
T.-H. Vo et al.: PSR for ITW FER

and late fusion to combine the results from the VGG and decreases with increasing size. The mean and variance of the
the ResNet models. Zeng et al. extracted image histogram image size in the RAF-DB are 193 and 144, which is a bit
of oriented gradients and passed them through deep sparse large. The AffectNet dataset has larger image sizes, ranging
autoencoders to classify them [17]. Tozadore et al. grouped from 130 pixels to more than 2000 pixels. In the graph,
emotions into several groups to help CNN classify with better we round all images larger than 2000 pixels to the fixed value
accuracy [18]. of 1000 pixels. Similar to the RAD-DB dataset, the number
Despite these successes of in-the-lab datasets, the rise of of image decreases when the size of the image increases. The
the in-the-wild (ITW) dataset in recent years has raised new third most popular ITW datasets for the FER task is the FER+
challenges for researchers. When in-the-lab datasets were dataset [19] extended from the FER2013 [20]. It also faces
collected under control, the data were clean, accurate, and the different-image-size problem. Unfortunately, the original
uniform. In contrast, ITW datasets are noisy, inaccurate, and image size information was omitted when the author of the
variant. We outline the following two observations about ITW FER data published. Most of the studies in this field does
datasets for the FER task. not consider the image-size problem. They simply resized
Observation 1: The images size of the ITW datasets all images to the same size, e.g. 128 × 128 or 224 × 224.
varies. While the size of in-the-lab datasets images is con- The first reason is due to the DL framework itself, because
trolled and nearly constant, ITW dataset images have various in the batch mode, each batch must have the same input
sizes from too small to large. Figure 1 shows the image shape. Implementing different input sizes at the same time
size distribution of the RAF-DB [11], [12] (Fig. 1a) and the takes more effort, and is complicated and computationally
AffectNet [13] dataset (Fig. 1b). These two selected datasets inefficient. While CNN architecture was successful for many
are the most popular ITW datasets for the FER task. Because image classification tasks, it is based on the assumption that
of the differences in width and length, the average of the despite the resizing of the images the network could learn to
two is considered as the size of the image. In both datasets, distinguish by itself. Nearest-neighbor interpolation, bilinear,
the small images occur more frequently and this frequency and bicubic algorithms are popular techniques to scale image
sizes.
Observation 2: The CNNs are usually sensitized with
input image size. While CNN was very successful for many
tasks related to image classification and segmentation, this
architecture suffers from several weaknesses. One of CNN’s
weaknesses is the sensitivity to the size of the input image.
Zooming is one of the data augmentation techniques that
attempts to address this problem. The selected zooming scale
in most of the experiments ranged from 0.9 to 1.2 because
values outside this range degraded and damaged the network.
With global pooling, CNN networks could support different
input sizes, and the size incremental technique was used to
train the networks more quickly and gives coverage easier.
Despite the improvement offered by this process, the network
remains sensitive to the input size. Therefore, the network
trained with this input size works poorly with the same images
but on a different scale. Figure 2 shows the training and
validation loss for VGG16 when training with the RAF-DB
and the FER+ in different scales: 50 × 50, 100 × 100,
150 × 150 and back to 50 × 50 again in RAF-DB and 48 × 48,
96 × 96, 192 × 192 and again 48 × 48 in the FER+ for every
20 epochs in the sequence. We use weights transfer from the
ImageNet [21], and then, we freeze the whole CNN architec-
ture except the fully connected layers. The freeze steps were
trained in 20 epochs at the smallest input image size. At the
point of image size change (epoch 41, 61, 81), the loss of
both training and validation set a significant increase. At the
epoch 81, although the input size returns to the size 48 × 48
that was used to train to the network before, the loss value still
increases because of the characteristics of convolution. The
convolution layer uses a kernel (size 3×3, 5×5, or similar) to
FIGURE 1. The image size distribution of the RAF-DB [11], [12] and scan the ‘‘pixel’’ in the previous layer. Then, even though the
AffectNet [13] datasets. image is the same but in a different scale, the next convolution