0% found this document useful (0 votes)
20 views14 pages

Pyramid With Super Resolution For In-The-Wild Facial Expression Recognition

Uploaded by

51fangfanga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views14 pages

Pyramid With Super Resolution For In-The-Wild Facial Expression Recognition

Uploaded by

51fangfanga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Received July 1, 2020, accepted July 9, 2020, date of publication July 17, 2020, date of current version July

29, 2020.
Digital Object Identifier 10.1109/ACCESS.2020.3010018

Pyramid With Super Resolution for In-The-Wild


Facial Expression Recognition
THANH-HUNG VO , GUEE-SANG LEE , (Member, IEEE),
HYUNG-JEONG YANG , (Member, IEEE), AND SOO-HYUNG KIM , (Member, IEEE)
Department of Electronic Computer Engineering, Chonnam National University, Gwangju 61186, South Korea
Corresponding author: Soo-Hyung Kim ([email protected])
This work was supported by the National Research Foundation of Korea (NRF) funded by the Ministry of Education through the Basic
Science Research Program under Grant NRF-2018R1D1A3A03000947 and Grant NRF-2020R1A4A1019191.

ABSTRACT Facial Expression Recognition (FER) is a challenging task that improves natural
human-computer interaction. This paper focuses on automatic FER on a single in-the-wild (ITW) image.
ITW images suffer real problems of pose, direction, and input resolution. In this study, we propose a
pyramid with super-resolution (PSR) network architecture to solve the ITW FER task. We also introduce
a prior distribution label smoothing (PDLS) loss function that applies the additional prior knowledge of the
confusion about each expression in the FER task. Experiments on the three most popular ITW FER datasets
showed that our approach outperforms all the state-of-the-art methods.

INDEX TERMS Emotion recognition, image resolution, human computer interaction.

I. INTRODUCTION learning in computer vision could help simplify and automate


Non-verbal communication plays an essential role in those processes. The scope of our study is automatic facial
person-person communication. These non-verbal signals can expression recognition, where emotional expression is in the
add clues, additional information, and meaning to spo- discrete model.
ken (verbal) communication. Some studies estimate that Many studies use traditional image processing and
around 60% to 80% of communication is non-verbal [1]. machine learning for the FER task. Shan et al. used
These signals include facial expressions, eye contact, voice local statistical features, termed Local Binary Patterns,
tone and pitch, gestures, and physical distance, of which for person-independent facial expression recognition [6].
facial expression is the most popular input for analysis. The Ma and Khorasani used one-hidden-layer feed forward
facial expression recognition (FER) task aims to recognize neural network on a two-dimensional discrete cosine
the emotion from the facial image. transform [7]. Lien et al. combined facial feature point track-
In psychology and computer vision, emotion can be ing, dense flow tracking, and gradient component detec-
divided into two kinds of model: discrete and dimensional tion to detect FACS and calculate emotion [8]. In [9],
continuous [2]–[4]. The dimensional continuous model Zhang et al.extracted scale-invariant feature transform and
focuses on arousal and valence, which values from -1.0 to used the deep neural network (DNN) as the classifier. Aleksic
1.0, whereas the discrete emotion theory classifies a few core and Katsaggelos used hidden Markov models for automatic
emotions such as happy, sad, angry, neutral, surprise, disgust, FER [10].
fear and contempt. In our study, we attempted discrete emo- Recently, deep learning (DL) has significantly affected
tion recognition. many fields, such as image, voice, and natural language
Ekman and Friesen developed a Facial Action Coding Sys- processing. In the Boosted Deep Belief Network [14] intro-
tem (FACS) to analyze human facial movements [5]. How- duced by Liu et al., multiple deep belief networks learned
ever, this scheme needs trained humans and is extensively feature representation from patches of image and some of
time consuming. The recent advances of successful machine them were selected to boost. In [15], Liu et al. ensembled
three convolutional neural networks (CNN) subnets and con-
The associate editor coordinating the review of this manuscript and catenated the outputs to predict the final results. Huang [16]
approving it for publication was Waleed Alsabhan . used a custom residual block of the ResNet architecture

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
131988 VOLUME 8, 2020
T.-H. Vo et al.: PSR for ITW FER

and late fusion to combine the results from the VGG and decreases with increasing size. The mean and variance of the
the ResNet models. Zeng et al. extracted image histogram image size in the RAF-DB are 193 and 144, which is a bit
of oriented gradients and passed them through deep sparse large. The AffectNet dataset has larger image sizes, ranging
autoencoders to classify them [17]. Tozadore et al. grouped from 130 pixels to more than 2000 pixels. In the graph,
emotions into several groups to help CNN classify with better we round all images larger than 2000 pixels to the fixed value
accuracy [18]. of 1000 pixels. Similar to the RAD-DB dataset, the number
Despite these successes of in-the-lab datasets, the rise of of image decreases when the size of the image increases. The
the in-the-wild (ITW) dataset in recent years has raised new third most popular ITW datasets for the FER task is the FER+
challenges for researchers. When in-the-lab datasets were dataset [19] extended from the FER2013 [20]. It also faces
collected under control, the data were clean, accurate, and the different-image-size problem. Unfortunately, the original
uniform. In contrast, ITW datasets are noisy, inaccurate, and image size information was omitted when the author of the
variant. We outline the following two observations about ITW FER data published. Most of the studies in this field does
datasets for the FER task. not consider the image-size problem. They simply resized
Observation 1: The images size of the ITW datasets all images to the same size, e.g. 128 × 128 or 224 × 224.
varies. While the size of in-the-lab datasets images is con- The first reason is due to the DL framework itself, because
trolled and nearly constant, ITW dataset images have various in the batch mode, each batch must have the same input
sizes from too small to large. Figure 1 shows the image shape. Implementing different input sizes at the same time
size distribution of the RAF-DB [11], [12] (Fig. 1a) and the takes more effort, and is complicated and computationally
AffectNet [13] dataset (Fig. 1b). These two selected datasets inefficient. While CNN architecture was successful for many
are the most popular ITW datasets for the FER task. Because image classification tasks, it is based on the assumption that
of the differences in width and length, the average of the despite the resizing of the images the network could learn to
two is considered as the size of the image. In both datasets, distinguish by itself. Nearest-neighbor interpolation, bilinear,
the small images occur more frequently and this frequency and bicubic algorithms are popular techniques to scale image
sizes.
Observation 2: The CNNs are usually sensitized with
input image size. While CNN was very successful for many
tasks related to image classification and segmentation, this
architecture suffers from several weaknesses. One of CNN’s
weaknesses is the sensitivity to the size of the input image.
Zooming is one of the data augmentation techniques that
attempts to address this problem. The selected zooming scale
in most of the experiments ranged from 0.9 to 1.2 because
values outside this range degraded and damaged the network.
With global pooling, CNN networks could support different
input sizes, and the size incremental technique was used to
train the networks more quickly and gives coverage easier.
Despite the improvement offered by this process, the network
remains sensitive to the input size. Therefore, the network
trained with this input size works poorly with the same images
but on a different scale. Figure 2 shows the training and
validation loss for VGG16 when training with the RAF-DB
and the FER+ in different scales: 50 × 50, 100 × 100,
150 × 150 and back to 50 × 50 again in RAF-DB and 48 × 48,
96 × 96, 192 × 192 and again 48 × 48 in the FER+ for every
20 epochs in the sequence. We use weights transfer from the
ImageNet [21], and then, we freeze the whole CNN architec-
ture except the fully connected layers. The freeze steps were
trained in 20 epochs at the smallest input image size. At the
point of image size change (epoch 41, 61, 81), the loss of
both training and validation set a significant increase. At the
epoch 81, although the input size returns to the size 48 × 48
that was used to train to the network before, the loss value still
increases because of the characteristics of convolution. The
convolution layer uses a kernel (size 3×3, 5×5, or similar) to
FIGURE 1. The image size distribution of the RAF-DB [11], [12] and scan the ‘‘pixel’’ in the previous layer. Then, even though the
AffectNet [13] datasets. image is the same but in a different scale, the next convolution

VOLUME 8, 2020 131989


T.-H. Vo et al.: PSR for ITW FER

final image. Ledig et al. used resblocks to build SRResNet


in [26]. Lim et al. proposed enhanced deep super-resolution
network (EDSR) [27]. The EDSR is a modified version of
the SRResNet that removes all batch normalization layers to
reduce computing by 40% while improving the efficiency.
They also designed a multi-scale network from the base block
with good results. Hu et al. published a Cascaded Multi-Scale
Cross network that includes a sequence of the cascaded
sub-networks [28]. In recent years, the network for the SR has
deepened, and the accuracy has been improved more. While
the SRCNN is lightweight but low accuracy, the EDSR needs
more computing but generates better results.
Our study has two highlight contributions. Firstly, we pro-
pose a Pyramid with Super-Resolution (PSR) network archi-
tecture to deal with the different-image-size problem for the
ITW FER task. Our approach aims to view an image on sev-
eral scales and uses SR for up-scaling. With many small-size-
image problems in real-world FER datasets, the SR improves
the network performance. Viewing the image on many scales
also helps the network to learn not only at a small local but
also at the global view of input. We also discuss the loss
function and apply it to the FER task where the distribution
of confusion labels are known and can be used.
The rest of this paper is organized as follows. We explain
our proposed methods in section II and introduce the
FIGURE 2. The loss value for the training and validation during the prior distribution label smoothing (PDLS) loss function in
training process as the input size changed for the RAF-DB and the FER+
(VGG16 architecture [22]).
section III. Dataset information is presented in section IV.
Section V describes the experimental results and discussion.
Finally, we conclude our study in section VI.
layers learns very different features; therefore, increasing the
kernel size does not help here. II. PYRAMID WITH SUPER-RESOLUTION (PSR) NETWORK
While currently, the super-resolution (SR) step was in We deal with the various image-size problems by using
the pre-processing for input, it could be a part of the DL a pyramid architecture, which is termed as the PSR net-
architecture. SR approaches may be better than the older work. Figure 3 shows the overall PSR network architecture.
algorithms such as nearest-neighbor interpolation, bilinear, There are six blocks in our approach, including spatial trans-
and bicubic to solve the small-image-size problem. The SR former network (STN), scaling, low-level feature extractor,
task is used to make the larger image from a low-resolution high-level feature extractor, fully connected, and the final
image while trying to fill the lost pixels and to avoid the concatenation block. STN is a simulator of an affine transfor-
pixels becoming blurred. From a low-resolution image, e.g. mation in a 2D image. The STN is used for face alignment.
size W × H , the SR task is used to make the larger image The scaling block is the main block, the fundamental idea of
kW × kH where k ≥ 2, with the aim of making the our approach. The details about this block are explained in
new image as clear as possible. While down-scaling the the next subsection. After the scaling block, there are several
image from high-resolution to low-resolution is an easy task, internal outputs, each of which is one image of the original
the reverse direction is not. The missing pixels that are input, but in different scales, and hence has different sizes.
lost from low-resolution need to be recovered. Some recent Low and high-level feature extractors are two usual parts
research has focused on this problem. Dong et al. intro- in most of the CNN. The fully connected block includes
duced the Super-Resolution Convolutional Neural Network several fully connected layers and dropout layers. Finally,
(SRCNN), a deep CNN model that works on low-resolution we combine all branch outputs with a late fusion technique.
and high-resolution feature maps and finally generated a
high-resolution image [23]. The SRCNN is lightweight and A. THE SPATIAL TRANSFORMER NETWORK (STN) BLOCK
outperforms the bicubic interpolation. Very Deep Super Res- The STN was introduced by Jaderberg et al. [29] and
olution (VDSR) has a similar structure as the SRCNN but Dai et al. [30]. The main idea of STN is to align the input
is more in-depth [24]. In [25], Shi et al. makes the efficient by learning the transformer. This block is comprised of three
sub-pixel convolutional neural network (ESPCN) that out- parts: localization net, grid generator, and a sampler [29]. The
performs the SRCNN. ESPCN improves SRCNN by dealing localization net has several convolution layers, and finally,
with the feature maps at low-resolution and upsampling to the a fully connected layer to output θ , where θ is a matrix size

131990 VOLUME 8, 2020


T.-H. Vo et al.: PSR for ITW FER

FIGURE 3. Overall network architecture.

of 2 × 3, a representation of an affine transform in a 2D the efficiency of memory and computing, the input images
image. The grid generator then accepts θ and makes a grid, are kept at the same size. And to use the best information
and finally, the sampler uses this grid and generates the output from the input images, they are passed to the network at the
image. The output image is from the input image with rotate, largest attainable size. The input size may be limited by the
scale, and transforms operators. The input and output of this computational limit and based on each dataset. While passing
block are images with the same size and the same number of the same size images, as in the first observation, many of them
channels. were in low-resolution and were up-scaled by using some
Different from in-the-lab images, ITW images are very traditional algorithm. However, our approach down-scales
different from the head pose direction. We add the STN block them and then up-scales them again using the SR technique.
to help the network learn to align the face and make it easier This block is to view the overall context in the low-resolution
to recognize. images, along with the high-resolution image to consider the
Our implementation details follow the previously pub- original features.
lished paper [29]. Table 1 shows the details of the internal In the scaling block the network branches to three or more
layers of this block. For the convolution layers, the parameters sub-networks. All sub-networks work with the same input
are the input channel, output channel, kernel size, and stride. image but on a different scale. The latest branch received
The kernel size and stride are needed for the maxpool2d layer. the original input images, which had the highest resolution
For the linear layer, only two parameters are needed: the for the network. Due to the computational limit, most studies
number of input nodes and that of output nodes. After the in the field of image classification use the input image from
localization, the feature map is flattened and passed through 100 to maximum 312. For the larger input size, the higher
the fully connected part. Our algorithm calculates the size resolution does not improve the performance. For the batch
of the feature map dynamically based on the input size. So, mode, all images were resized to the central size before being
the block is adaptive to different sizes of the input images. passed through to the network. The larger image size is then
down-scaled, and the smaller images needed to be up-scaled.
TABLE 1. The details of STN block.
We call the original input size W × H . This process of scale
input is the traditional algorithm, such as Nearest-neighbor
interpolation, bilinear, and bicubic. While the down-scaled
image is safe, the up-scaling from small size images to the
larger size using the traditional algorithm is complicated and
inaccurate. Our approach tends to overcome this issue. The
first branch is applied to the lowest resolution image, which
was down-scaled from the original input by the simple oper-
ator using mean pooling to implement. We declare the value
step and kstep for the step scale value between two neighbors.
B. THE SCALING BLOCK By the limit of DL, step is set to 2. A large kstep can be used,
The scaling block is the leading block in our architecture. The but due to the computational limitation, we restrict kstep to
main idea of this block is to view the input image on different only 1 or 2. The size of image for the first branch is
scale from small to large. Belong to that, super-resolution was W H
used to upscale the image size. As in many CNNs, to ensure × kstep
2kstep 2
VOLUME 8, 2020 131991
T.-H. Vo et al.: PSR for ITW FER

Between the first and the last branch, there are kstep of D. FULLY CONNECTED BLOCK AND CONCATENATION
SR branches, each of which is a SR block with the scale size BLOCK
of 2, 4, 8, . . . from the lowest resolution image from the first The fully connected block includes two fully connected layers
branch. The size of ith SR is given by equation 1. (Linear, FC) and several additional layers. The output feature
W H from the high-level block then passes through this block to get
× (1) the vector to represent the score for each label. Depending
2kstep−i 2kstep−i
on the experiment, we use either seven or eight emotions,
In case k = 1, there is only one SR branch in the scaling and then the output vector sizes are set to seven or eight,
block, and the output size is the same as the original input respectively. We also use BatchNorm1d for the last feature
size. In case k = 2, there are two SR branches, which map, and two dropout layers with p values of 0.25 and 0.5 for
have the sizes of [W /2, H /2] and [W , H ]. Our setup always the first and the second FC layers, respectively. The ReLU
ensures that the last SR part has the same size as the original activation function was applied after the first FC layer.
input size. For the SR task, we use the EDSR architecture Similar to the high-level feature extractor block, the fully
introduced by Lim et al. [27]. connected block was also shared among branches.
By learning how to resample the image size, we assume All branches were fused with the weighted late fusion
that this block can add useful information to this particular strategy. The weight of each branch has been determined
task, and thereby increases the accuracy of the prediction according to the contribution to the final score of the whole
model. network.

C. LOW AND HIGH-LEVEL FEATURE EXTRACTOR III. THE PRIOR DISTRIBUTION LABEL
Typically, low and high-level feature extractors are combined SMOOTHING (PDLS) LOSS FUNCTION
in a base network architecture. We choose VGG16 [22] as FER for basic emotion is a classification problem, where each
the base network because this network is still used as the input image is classified into one of seven or eight classes.
base of many recent network for the FER task [31]–[33]. Softmax Cross-Entropy is the most popular loss function for
From the base network, VGG16 [22], we separated into two classification tasks. The cross entropy (CE) loss function is
parts for two levels of input. The low-level feature extractor given in equation 2.
receives the images as input and generates the feature map X
corresponding to the data. This block works at low level CE = − tc ∗ log(σ (zc )) (2)
of features, e.g., edge, corner, and so on. The high-level c∈C
feature extractor receives the feature map from the low-level
where:
part and makes a more in-depth, high-level features for the
input. • CE: cross entropy
While the input is passed through both extractors in this • C: set of classes (labels)
order, we separated them as two to share across branches. • tc : the distribution value of the label c in the ground truth
P
As in the second observation, we know that the CNNs are where c∈C tc = 1
very sensitized with the input size, and here, each branch has • σ (zc ): softmax function for zc
different input sizes. The low-level features for each branch • zc : raw value score for class c from the model
are quite different and cannot be shared because sharing In the real world, it is difficult to get the ground truth
low-level layers damages the network. The high-level feature distribution for the labels for each sample; therefore, the all
block is in another environment. At this level, a high-level in one assumption was used in most cases. In the ideal case,
feature needs to be learned and is less dependent on the size the sample belongs to one and only one class; therefore,
of the input. Then the weight of this block can be shared the one-hot vector is widely used in the classification task
across branches. The shared weights also act in a similar way for labeling, so that equation 2 becomes the simple case of
to multi-task learning where the combination helps each task −log(σ (zk )), where tc = 0 for all c ∈ C but the correct label
obtain better results. k (tk = 1). Then, all parts except the label k are omitted.
The position of the cutting point denotes pos, which is the The Label Smoothing (LS) loss function has been intro-
position of the convolution layers in the base network, where duced in other studies [34], [35], and [36]. The formula for LS
we separate the two parts. A lower pos value means that all is given as equation 3. The main idea here is the contribution
branches share the weights in most of the internal layers, of all incorrect labels. The parameter α was set around 0.9,
while the highest value of pos separates all branches. From meaning that the contribution for other labels is very small;
the second observation, we assume that the low pos value e.g., for FER task, |C| = 8, then the weight for each of them is
degrades the network. Since the base network is VGG16, 0.1/8 ≈ 0.0125 and for the correct label is 0.9125. Although
which has 12 convolution layers, the cutting position pos the weight of the incorrect labels is small, the LS has been
should be in 0 − 12, which is the position of corresponding used successfully in many classification tasks. The advantage
convolution layers. We analyze the effect of the cutting point of the LS over CE with one-hot is that all label scores pre-
(the pos value) in the experiments. dicted by the model are activated. Then the backpropagation

131992 VOLUME 8, 2020


T.-H. Vo et al.: PSR for ITW FER

TABLE 2. The prior distribution of the emotions on the FER task.

process can learn not only how to increase the score for the FER2013 dataset [20], the authors of FER+ also provided the
correct label but also how to decrease for the incorrect ones. labels distribution information for every sample. In FER+,
X (1 − α) each sample was labeled by ten people, who need to classify
LS = −log(σ (zk )) ∗ α + − log(σ (zc )) ∗ (3) each image to eight basic classes plus two additional classes,
|C|
c∈C
unknown and non-face. While the correct label’s distribution
where: for each sample is difficult to obtain, we assumed that the
• |C|: size of label set method for making the FER+ is a good approximation for
• α: parameter control the weight for each part the ground truth distribution. For each sample s ∈ S, S is
In the LS loss function, all labels except the correct one the FER+ dataset, we have the approximate distribution ads .
are given equal, i.e., they have a small role and are all equal. Since unknown and non-face images are omitted, we only use
LS can be used extensively in many tasks when there is no information for eight basic emotions, denote byP E. Then ads
information about the distribution. However, in many tasks is a vector in R8 when 8 is the size |E|, and ads = 1.
like FER, for a particular correct label, the confusion to other Equation 5 is to calculate the average distribution for each
classes are not uniform. The FER task has two advantages: the ground truth emotion k. In this calculation, we used only the
number of labels is small, just seven or eight and, more impor- training set in the FER+.
tantly, we know that for the particular label, the confusion P
s∈Sk ads
for some specific classes is higher than others. For example, dk = (5)
|Sk |
the correct label fear is very likely to be confused with sur-
prise than with disgust. Another example is the disgust facial, where:
which can easily be mistaken as neutral, or sadness than • dk : the average distribution of the label k, dk ∈ R8
anger or fear. If we have this prior knowledge, the smoothing • |Sk |: the size of the subset Sk , Sk ⊂ S, where the ground
part should not be a uniform distribution. So, we proposed truth emotion is k. ∪k∈E Sk = S.
an extended version of LS with additional prior knowledge The final prior distribution dki for the FER task is provided
of the label’s confusion call PDLS. The PDLS loss function in table 2. Each row in the table is dk , and k is one in eight
was given by two parts: the one-hot and the prior distribution, emotion labels. The columns are the confusion labels, and
as shown in equation 4. there are also eight emotion labels. E.g. dneutral,sadness =
PDLS = −
X
(tc ∗ α + dkc ∗ (1 − α)) ∗ log(σ (zc )) (4) 0.114 means when image in neutral, there is 11.4% chance
to confuse it as sadness. The number in the main diagonal is
c∈C
always higher than 0.5 that represents the distribution for its
where:
own emotion. The happiness emotion is very clear and easy
• α is a parameter to control the weight of one-hot and to detect: dhappiness,happiness = 0.918, whereas fear and the
distribution. disgust are difficult to detect and easy to be confused.
• dkc the prior distribution for the correct label k and the
confusion label c. IV. DATASETS
All notations in equation 4 are similar to those in equation 2 There are three popular ITW datasets for the FER task,
and 3. The dkc value is the new operand in this formula, and it including the FER+ [19], RAF-DB [11], [12] and Affect-
1
replaced the uniform distribution |C| in the LS loss function. Net [13] datasets. In this study, the experiments are con-
The d matrix has the following properties: ducted with all of them. The eight discrete emotions for
size = |C| × |C| the classification are neutral, happiness, surprise, sadness,
X anger, disgust, fear and contempt. Some previous datasets
dkc = 1, ∀k ∈ C
and studies used seven of them as they excluded contempt
c∈C
because it is difficult and rare in the real world. The details
argmax(dk1 , dk2 , . . . , dk|C| ) = k, ∀k ∈ C
for each dataset are given below.
The most important part is how to calculate the dkc . Using FER+ dataset. The FER+ dataset [19] is the first
Barsoum et al. [19], when correcting the labels for the ITW dataset among them. The original version is the

VOLUME 8, 2020 131993


T.-H. Vo et al.: PSR for ITW FER

TABLE 3. Number of images in training/testing/validation subsets of the FER+, RAF-DB, and AffectNet datasets.

FER2013 [20] by Goodfellow et al., released for the ICML or fear. Figure 4 shows that the distribution of emotions on
2013 Workshop on Challenges in Representation Learning. training, testing, and validation on the FER+ are similar.
But as the labeling accuracy of the FER2013 dataset is not RAF-DB dataset. Shan Li, Weihong Deng, and Jun-
reliable, Barsoum et al. reassigned the labels [19]. Ten people Ping Du provided the Real-world Affective Faces Database
assigned manually the basic emotion for each image in the (RAF-DB) for emotion recognition [11], [12]. The dataset
FER2013 dataset. The subset of the original images was contains about 30,000 images downloaded from the Internet.
excluded if it is classified as unknown or non-face. The final About 40 trained annotators labeled carefully the image. The
emotion label was assigned based on the voting from the ten dataset has two parts: the single-label subset (basic emotions)
people. The number of people voting for each emotion for and the two-tab subset (compound emotions). We used the
each image was given, which was then used to calculate the single-label subset with seven classes of basic emotions. This
approximate distribution of the emotion over that image. subset has 12,271 samples in the training set and 3,068 in the
The dataset includes all the images, each of which has one test set. The number of samples for each emotion is given in
person’s face aligned. The dataset images were collected from table 3. Notably, the RAF-DB dataset does not include the
the Internet by querying many related expression keywords. contempt expression. Figure 1 shows that images sizes in the
There are many kinds of face in the real-world environment, RAF-DB vary from tiny to large, which makes it difficult for
and their pose and rotation make them more challenging to the DL model to deal with.
recognize. The images were aligned and centered, and they AffectNet dataset. The AffectNet [13] is the largest
were scaled slightly differently. All images are low-resolution dataset for the FER task. The dataset contains more than one
and in grayscale with a size of 48 × 48 pixels. Each corre- million images queried from the Internet by using related
sponding label for each image is also given. The eight basic expression keywords. There are about 450,000 images man-
emotions are used in this dataset. ually annotated by trained persons. It also includes train,
Table 3 and figure 4 show the distribution of train, test validation, and test sets. The test set has not yet been pub-
and validation on the FER+ dataset. The number of neutral lished, so most previous studies used validation set as the
images is highest, 9,030 on the train set, and 1,102 on the test test set [13], [37]–[40]. Because the contempt emotion is
set. The disgust emotion has the lowest number of images: rare in the natural world, some studies [40] used only seven
only 107 on train and 15 on test. The contempt emotion has emotions while other studies [13], [38], [39] analyzed all
a similar number of images with disgust: only 115 on train eight emotions. Another study used both eight and seven
and 13 on test. Disgust, contempt and fear have few images, expressions [37]. Therefore, to compare our results with the
compared with the other five emotions. This is normal in previous studies, we performed experiments with both eight
natural communication where people are usually in neutral classes and seven classes.
and happy state and only rarely experience disgust, contempt Table 3 shows the number of samples for each emotion
class on each subset train, validation, and test on the FER+,
RAF-DB, and AffectNet datasets. The name they use for
labels are a little different but can be mapped to the eight
basic emotions as the emotion column. The FER+ has three
separate subsets for training, validation, and testing, while
two others have only two subsets. The AffectNet dataset
has not published the testing subset, so as for most of the
studies in this dataset, the validation is taken as the testing
subset, and the validation subset during the training process
should be randomly selected from the training subset. Similar
to the RAF-DB, the training subset is randomly separated
and then applied to get the training and validation subsets.
FIGURE 4. The FER+ data distribution of train/test/valid. Only AffectNet exhibits balanced validation (as the testing),

131994 VOLUME 8, 2020


T.-H. Vo et al.: PSR for ITW FER

while the FER+ and RAF-DB are highly unbalanced. Both We use Adam optimization algorithm [43] with an adap-
the FER+ and AffectNet datasets have eight emotions labels, tive learning rate using The One Cycle Policy suggested by
and the RAF-DB has only seven emotion classes without Smith [44]. The learning rates were set to 1e-3 for some later
contempt emotion expression. layers of the network, and 1e-4 for the STN block. The lower
Figure 5 gives some sample images for each class from the learning rate for STN with the transformation aims to keep
three datasets. In this figure, each column presents one emo- this bock with little change.
tion expression. The images in the first two rows (figure 5a) The validation set is used to optimize the hyper-parameters,
is from the FER+ dataset, figure 5b is from RAF-DB, and and then we collected the best models. The hyper-parameters
the rest (figure 5c) are from AffectNet. The last column of for all our experiments include the learning rate and the
RAF-DB is empty because the RAF-DB dataset has seven number of epoch where the network gets the best result.
emotions without contempt expression. Those models were used to evaluate the test set. We applied
Test Time Augmentation on the test step. Eight randomly
rotated, zoomed images are generated from each image and
then passed through the model to get the raw score to predict.
The final raw score is the average of their outputs.
For basic emotions recognition, several metrics are used to
evaluate the results. The first and most widely used metric is
accuracy, or weighted accuracy (WA), which is the number
of correct answers divided by the total number of the test
samples. But, when the number of samples for each class is
highly unbalanced, WA may have poor performance, partic-
ularly FER task, because the emotions in the real world are
usually unbalanced. Some emotions such as neutral, happy,
or sad are more common than disgust, fear, or contempt. In
this case, unweighted accuracy (UA) should be considered for
the additional evaluation of the system. The UA metric is an
unbiased version of WA. The UA is calculated by the average
of the accuracy of each class. For comparison with other
FIGURE 5. Sample images from the (a) FER+, (b) RAF-DB and studies, both WA and UA are adopted in the experiments.
(c) AffectNet datasets. All experiments were run on Ubuntu 18.04, 32G RAM,
GeForce RTX 2080 Ti GPU with 11G GPU RAM.

V. EXPERIMENTS AND RESULTS B. EXPERIMENTAL RESULTS


This section reports our experiments and results. Subsec- We report the experimental results for the RAF-DB, FER+,
tion V-A gives the experimental setup. Results are shown in and AffectNet datasets.
subsection V-B. Finally, subsection V-C presents the discus-
sion about our approach and limitations. 1) RAF-DB DATASET
Table 4 gives the results for the RAF-DB dataset. In previous
A. EXPERIMENTAL SETUP studies, the methods in [38], [39], [45] report results in WA
For all experiments, Fastai [41] and PyTorch [42] were used. metric, and others [46], [47] report UA metric. We report and
Those toolboxes make DL experiments easier, with many compare with previous findings in both WA and UA metrics.
build-in classes, functions, and also pre-trained models to Our approach produces significantly better results than the
reuse. recent studies on both metrics. For WA, we get 88.98%, which
In DL, the network initialization has a significant impact on is improved by more than 2% in absolute terms or 2.4%
the training process. Commonly, weights are initially random. relatively, compared to Wang et al. [39]. In the UA metric,
Having a good initialization strategy helps the networks to our approach is 4.05% better in absolute terms compared to
learn better and more smoothly. In our case, we carefully [46] or 5.28% relatively.
initialize the network weights. The STN block was set to Figure 6 shows the confusion matrix for the RAF-DB. It is
identical transformation. The SR layers were initialized from shown that the model gives very good accuracy for happiness
previously published pre-trained model [27]. The base net- and neutral, but the results for disgust and fear are only
work, VGG16, was trained with different scale input images. 54% and 59%, respectively. Disgust images were predicted
The model weights were then saved and reloaded to our as neutral by 17% and fear was predicted as surprise by 16%.
architecture. The careful initialization step has several advan-
tages. It is easier to train the network, gives quicker network 2) FER+ DATASET
coverage, and makes a more stable network, leading to fewer Table 5 shows the experimental results on the FER+ test set.
variants. The highest accuracy is from the PSR model, which achieved

VOLUME 8, 2020 131995


T.-H. Vo et al.: PSR for ITW FER

FIGURE 6. The confusion matrix on the test set for the RAF-DB, FER+ and AffectNet datasets.

TABLE 4. RAF-DB accuracy comparison (%). TABLE 5. FER+ accuracy comparison (%).

89.75%. Compared to the best previous result in the literature on disgust and fear makes the F1 score and average accuracy
by Albanie et al. [48], our approach is improved by 0.65%. far lower than the average. Future work should consider
The average accuracy for our proposed architecture is focusing on increasing the number of sample of disgust and
69.54% and F1 score (macro) is 74.88%. The low accuracy fear to improve the accuracy for these two expressions.

131996 VOLUME 8, 2020


T.-H. Vo et al.: PSR for ITW FER

FIGURE 7. Cumulative accuracy by size on the test set of RAF-DB dataset with the VGG16 (base-line) and the PSR architecture.

Figure 6c shows the confusions matrix on the test set of TABLE 6. AffectNet accuracy comparison (%).
the PSR architecture: happiness has the highest accuracy
of 96%, followed by neutral, surprise and anger. All four
expressions had accuracy above 90%. The lowest accuracy
was for contempt, 23% accuracy. Due to the lack of contempt
images, the model could not learn to distinguish it from neu-
tral, anger, or sadness. Some emotions have high likelihood
of wrong classification: fear predicted as surprise by 37%,
disgust classified as anger by 33% and sadness classified as
neutral by 22%. These high levels of confusion are typical in
the real world because even for humans, it can be difficult to [1, 2, 1] and the cutting point at the sixth convolution layer,
distinguish these pairs of emotions. with the original input size of 100 pixels. The image size
ranged from 23 pixels to about 1200 pixels. Because the
3) AffectNet DATASET large images were resized to a fixed size at 100 pixels,
We compared both eight and seven classes in the AffectNet we consider only those images smaller than 100 pixels to see
dataset. Table 6 shows the results in classification accuracy how our approach is affected. We omitted the first twenty
(WA). In the classification of eight emotions, our model points because they are unstable to calculate the accuracy.
archived the accuracy of 60.68%, outperformed the current The figure shows that initially, with tiny image sizes less
state-of-the-art 59.58% achieved by Georgescu et al. [37]. than 40 pixels, both base-line and PSR are unstable. But after
In the seven-emotion task, our model archived the accuracy 40 pixels, the PSR architecture is improved and works better
of 63.77%, slightly improved relative to the current highest than the base-line network. The PSR maintained this trend to
one at 63.31% [37]. Figure 6b and figure 6d present the confu- the end of the dataset because, in our approach, we added the
sion matrix for the AffectNet in the seven-class task and eight- super-resolution module with double size for a small image
class task, respectively. The happy expression has the highest in one of the three branches, and another branch for half size
detection rate in both cases, followed by the fear emotion. 100/2 = 50 improved the recognition accuracy.
Surprise, anger, and disgust have a similar performance in Figure 8 shows the accuracy discrepancy by size between
both cases. In the eight-expression task, contempt has the PSR and VGG16 on the test set of the RAF-DB dataset. The
lowest performance just at 49%. blue points are raw values, and yellow ones are the smoothed
Figure 7 shows the cumulative accuracy according to the version. The accuracy discrepancy represents the speed of
size of the original image on the base-line network and the the improvement of the PSR over the baseline network. It is
PSR architecture. The PSR was run with the three branches clear that the improvement had the highest speed when the

VOLUME 8, 2020 131997


T.-H. Vo et al.: PSR for ITW FER

FIGURE 9. The accuracy by original image size on each branch of the PSR
without the STN block on the RAF-DB test-set.
FIGURE 8. The discrepancy of accuracy by size on the test set of the
RAF-DB dataset between PSR and baseline.

TABLE 7. Analysis of the effectiveness of each block on the RAF-DB (%).

original image size ranged from 40 pixels to about 55 pix-


els; it slows down when the size reached between 55 pixels
and 75 pixels, and it becomes lower for 75-85 pixels. After
85 pixels, the improvement continues but at a slow speed. FIGURE 10. The boxplots of performance with different cutting points
Notably, in the experiments for RAF-DB, the original input (accuracy).
size is 100 pixels resolution, then 50 pixels is half of the input
size.
than the original size branch. The discrepancy between SR
4) THE EFFECTIVENESS OF EACH BLOCK and original input size branch is large for the small images,
Table 7 shows a comparison between some variations of the and it decreases as the size increase. The results clearly
PSR on the RAF-DB dataset. The second row presents the reconfirm that the SR branch helps the network improve the
result of the PSR without the STN block, which means that performance when the original image size is small.
there are only the pyramid structure on top of the baseline Figure 10 shows the performance on the RAF-DB dataset
network with three branches (kstep = 2). It is clear that on by the cutting point pos of the convolution layers from
both WA and UA metrics, this network architecture gets better VGG16. The network exhibits the lowest performance at the
results than the VGG16. The improvements are significant point pos = 0, indicating that all the convolution layers are
in both cases of metrics, 2.73% for WA and 3.30% for UA. shared. The accuracy increases as the pos value increases
This implies that our pyramid with SR has an important role. but this improvement ceases after the particular cutting point
When adding STN block to make the full PSR architecture, at pos = 5. After the fifth layer cutting, the accuracy
we can get a little improvement, about 0.36% in WA and remains stable around the particular value. This result sup-
0.81% in WA metrics. We analyzed the effectiveness of the ports the second observation, i.e. the CNN is sensitized with
super-resolution reconstruction module by breaking down the the input size. Sharing some early convolution layers causes
PSR without the STN block to three separate branches to see the network to crash. On the other hand, the deeper layers can
the contribution of each to the final fusion. Figure 9 shows the be shared because the former convolution layers are learning
accuracy of each separated branch and also the fusion of them the low-level features, and the later convolution layers are
on the PSR architecture. As expected, the small size branch working on more abstract, high-level features.
got the lowest accuracy, and the fusion gets the best one when
combining all three branches. The SR branch and original 5) THE SENSITIVITY OF THE NETWORK TO THE DIFFERENT
input size use the same scale input size, one is SR from INPUT IMAGE SIZE
the haft size and another is the original input size. Although Figure 11 shows the comparison between PSR and
using the same scale size, the SR branch performs better VGG16 about the sensitivity when changing the input image

131998 VOLUME 8, 2020


T.-H. Vo et al.: PSR for ITW FER

FIGURE 12. Sample images in low-resolution in the RAF-DB dataset


where PSR recognizes better than the baseline network.

UA metric, compared to the state-of-the-art results. The most


FIGURE 11. Visualization of training loss of the PSR and baseline during substantial improvement in accuracy has been obtained on the
the training process of the RAF-DB when changing the original input size RAF-DB dataset. On the AffectNet dataset, PSR improves the
in the sequence 50, 100, 150, and back to 50 pixels again.
accuracy by 1.01% and 0.46%, compared to the best previous
study, respectively. Although the given input is in a small
TABLE 8. The loss function comparison (accuracy %).
size (48 × 48) as the FER+ dataset, our PSR model gener-
ated better results. Among the three datasets, the RAF-DB
exhibited the most improvement because the RAF-DB has
many image sizes from 23-100. The AffectNet dataset shows
less improvement. For the FER+ dataset, the dataset includes
the resized and cropped version of images; using the original
size on the RAF-DB dataset. The training process is similar version, if it were available, PSR would give better results.
as in figure 2a with the first 20 freeze steps were omitted. Notable, the different accuracy discrepancies versus the sec-
The changing points are in epoch 20, 40, and 60. The graph ond best algorithm in tables 4, 5, and 6 might be due to
exhibits that the PSR is less sensitive than the baseline. After each of these tables having a different set of algorithms..
the changing point, the loss of PSR architecture is slightly Overall, the pyramid with SR has a significant improvement
increasing. But the VGG16 has a large increase of loss values. for the FER task on the ITW dataset. The SR branch helps
The results confirm that our approach has the robustness the network performance on the low-resolution image and
for the ITW FER task where the original image size varies, then combining with other branches makes the whole network
although the CNNs usually sensitized to the input image. better. The STN block also has some improvement.
As in the second observation, the DL networks are sen-
6) THE COMPARISON OF THE THREE DIFFERENT LOSS sitized with the image input size, and the low-level block
FUNCTIONS in each branch is very different. The result shown in
Table 8 compares three loss functions, including the CE, figure 10 supports our assumption. When the pos value is
LS and PDLS on the RAF-DB dataset. For each type of decreased, indicated that more layers are shared, including
loss function, we conducted on experiment on the baseline some low-level convolution layers, the network is degraded.
architecture, VGG16, and our proposed network architecture. When the pos value is increased, indicating that the low-level
In both cases of the VGG16 and PSR network architecture, features are less shared, the network exhibits better results.
the CE loss function gets the lowest accuracy. For the baseline Due to the trade-off between the performance and the com-
network, the LS is slightly better than PDLS, by 0.12%. For puting cost in real practice, the results in figure 10 are useful
the PSR architecture, however, the PDLS is slightly better for selecting the cutting point.
than LS with a margin of 0.42%. The experimental results in table 8 reconfirmed that LS
Figure 12 shows some sample images from the RAF-DB loss function is better than CE as in many previous studies
dataset that PSR predict the correct emotions while the base- [34]–[36]. Both the LS and PDLS have better performance
line network gave the incorrect emotions. All three images are compared to CE, and in the case of the PSR architecture,
in the low-resolution which the size ranged between 45 and PDLS shows a significant boost. The PDLS loss function
56 pixels. gave a slight improvement over the original LS function in
the FER task, but it varies case by case, it depends on the
C. DISCUSSION network architecture. In the case of VGG16, the experiments
The experiments have demonstrated the significant improve- show that the PDLS is nearly equal to LS, which suggests
ment of our approach in FER task on all the three datasets. that future improvements should be needed. The results from
Compared to the base network, VGG16, our pyramid archi- PSR model suggests that either LS or PDLS is good for the
tecture with additional SR block and late fusion greatly loss function in the FER task, instead of CE.
improves the performance. On the RAF-DB dataset, our Despite the significant improvements presented in our
accuracy is better by about 2% in WA metric and 4.05% in study, some limitations warrant further research. The first is

VOLUME 8, 2020 131999


T.-H. Vo et al.: PSR for ITW FER

the step of the scale-up from the lowest resolution. The pyra- [11] S. Li and W. Deng, ‘‘Reliable crowdsourcing and deep locality-preserving
mid architecture has viewed the input on several scales, but a learning for unconstrained facial expression recognition,’’ IEEE Trans.
Image Process., vol. 28, no. 1, pp. 356–370, Jan. 2019.
step is an integer number larger than one, and 2 is a starting [12] S. Li, W. Deng, and J. P. Du, ‘‘Reliable crowdsourcing and deep locality-
value. But, the double scale is still a tremendous value. While preserving learning for expression recognition in the wild,’’ in Proc. 30th
the scale 1.2 is a good point for most of the augmentation IEEE Conf. Comput. Vis. Pattern Recognit., Oct. 2017, pp. 2584–2593.
1 [13] A. Mollahosseini, B. Hasani, and M. H. Mahoor, ‘‘AffectNet: A database
techniques, and 1.2 (≈ 0.83) in the reverse case, we suggest for facial expression, valence, and arousal computing in the wild,’’ IEEE
that the scale step should be 1.22 = 1.44, or approximated Trans. Affect. Comput., vol. 10, no. 1, pp. 18–31, Jan. 2019.
as 1.5. For the traditional algorithm, a decimal scaling value [14] P. Liu, S. Han, Z. Meng, and Y. Tong, ‘‘Facial expression recognition via
a boosted deep belief network,’’ in Proc. IEEE Conf. Comput. Vis. Pattern
is possible, but it cannot be used for the DL approaches. Recognit., Jun. 2014, pp. 1805–1812.
The second weakness is the baseline network architecture. [15] K. Liu, M. Zhang, and Z. Pan, ‘‘Facial expression recognition with CNN
Although several network architectures, more reliable than ensemble,’’ in Proc. Int. Conf. Cyberworlds (CW), Sep. 2016, pp. 163–166.
[16] C. Huang, ‘‘Combining convolutional neural networks for emotion recog-
VGG16, such as ResNet [49] and SENet [50] have been nition,’’ in Proc. IEEE MIT Undergraduate Res. Technol. Conf. (URTC),
reported, we chose the VGG16 as the base network. Although Nov. 2017, pp. 1–4.
our approach is general, we can apply many kinds of CNN, [17] N. Zeng, H. Zhang, B. Song, W. Liu, Y. Li, and A. M. Dobaie, ‘‘Facial
expression recognition via learning deep sparse autoencoders,’’ Neurocom-
and the re-implementation is needed for each base network. puting, vol. 273, pp. 643–649, Jan. 2018.
Our approach is not a simple module, so extra efforts must be [18] D. C. Tozadore, C. M. Ranieri, G. V. Nardari, R. A. F. Romero, and
taken to implement case by case. The innovation of another V. C. Guizilini, ‘‘Effects of emotion grouping for recognition in human-
robot interactions,’’ in Proc. 7th Brazilian Conf. Intell. Syst. (BRACIS),
architecture is left for future work. Oct. 2018, pp. 438–443.
[19] E. Barsoum, C. Zhang, C. C. Ferrer, and Z. Zhang, ‘‘Training deep
networks for facial expression recognition with crowd-sourced label dis-
VI. CONCLUSION
tribution,’’ in Proc. 18th ACM Int. Conf. Multimodal Interact., 2016,
In this study, we addressed the various different-image-size pp. 279–283.
problem in the FER task for ITW datasets, where the original [20] I. J. Goodfellow, ‘‘Challenges in representation learning: A report on three
machine learning contests,’’ Neural Netw., vol. 64, pp. 59–63, Apr. 2015.
input image size varies. Although the CNNs could work on [21] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, ‘‘ImageNet:
the image with a small rotate and scale, they are worthless A large-scale hierarchical image database,’’ in Proc. IEEE Conf. Comput.
when the scale is enormous. The main contribution of this Vis. Pattern Recognit., Jun. 2009, pp. 248–255.
[22] K. Simonyan and A. Zisserman, ‘‘Very deep convolutional networks for
study is the development of a pyramid network architec-
large-scale image recognition,’’ 2014, arXiv:1409.1556. [Online]. Avail-
ture with several branches, each of which works on one able: http://arxiv.org/abs/1409.1556
level of input scale. The proposed network is based on the [23] C. Dong, ‘‘Learning a deep convolutional network for image super-
VGG16 model, but it can be extended to another baseline net- resolution,’’ in Computer Vision, vol. 8692, D. Fleet, Ed. Cham, Switzer-
land: Springer, 2014, pp. 184–199.
work architecture. In the PSR architecture, the SR method is [24] J. Kim, J. K. Lee, and K. M. Lee, ‘‘Accurate image super-resolution using
applied for up-scaling the low-resolution input. Experiments very deep convolutional networks,’’ in Proc. IEEE Conf. Comput. Vis.
on three ITW FER datasets show that our proposed method Pattern Recognit. (CVPR), Jun. 2016, pp. 1646–1654.
[25] W. Shi, J. Caballero, F. Huszar, J. Totz, A. P. Aitken, R. Bishop, D. Rueck-
outperforms all the current state-of-the-art methods. ert, and Z. Wang, ‘‘Real-time single image and video super-resolution
using an efficient sub-pixel convolutional neural network,’’ in Proc. IEEE
Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 1874–1883.
REFERENCES [26] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta,
[1] A. Mehrabian, Nonverbal Communication. New Brunswick, NJ, USA: A. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi, ‘‘Photo-realistic
Aldine Transaction, 1972. single image super-resolution using a generative adversarial network,’’
[2] P. Ekman, ‘‘Are there basic emotions?’’ Psychol. Rev., vol. 99, no. 3, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017,
pp. 550–553, 1992. pp. 4681–4690.
[3] P. Ekman, ‘‘Basic emotions.,’’ in Handbook Cognition Emotion. New York, [27] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee, ‘‘Enhanced deep residual
NY, USA: Wiley, 1999, pp. 45–60. networks for single image super-resolution,’’ in Proc. IEEE Conf. Comput.
[4] J. A. Russell, ‘‘A circumplex model of affect.,’’ J. Personality Social Vis. Pattern Recognit. Workshops (CVPRW), Jul. 2017, pp. 1132–1140.
Psychol., vol. 39, no. 6, pp. 1161–1178, 1980. [28] Y. Hu, X. Gao, J. Li, Y. Huang, and H. Wang, ‘‘Single image
super-resolution via cascaded multi-scale cross network,’’ 2018,
[5] P. Ekman and W. Friesen, Facial Action Coding System, vol. 1.
arXiv:1802.08808. [Online]. Available: http://arxiv.org/abs/1802.08808
Mountain View, CA, USA: Consulting Psychologists Press, 1978.
[29] M. Jaderberg, K. Simonyan, A. Zisserman, others, and K. Kavukcuoglu,
[6] C. Shan, S. Gong, and P. W. McOwan, ‘‘Facial expression recognition ‘‘Spatial transformer networks,’’ in Proc. Adv. Neural Inf. Process. Syst.,
based on local binary patterns: A comprehensive study,’’ Image Vis. Com- 2015, pp. 2017–2025.
put., vol. 27, no. 6, pp. 803–816, May 2009. [30] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei, ‘‘Deformable
[7] L. Ma and K. Khorasani, ‘‘Facial expression recognition using con- convolutional networks,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV),
structive feedforward neural networks,’’ IEEE Trans. Syst., Man, Oct. 2017, pp. 764–773.
Cybern. B. Cybern., vol. 34, no. 3, pp. 1588–1595, Jun. 2004. [31] M. Hu, H. Wang, X. Wang, J. Yang, and R. Wang, ‘‘Video facial emo-
[8] J. J. Lien, T. Kanade, J. F. Cohn, and C.-C. Li, ‘‘Automated facial expres- tion recognition based on local enhanced motion history image and
sion recognition based on FACS action units,’’ in Proc. 3rd IEEE Int. Conf. CNN-CTSLSTM networks,’’ J. Vis. Commun. Image Represent., vol. 59,
Autom. Face Gesture Recognit., 1998, pp. 390–395. pp. 176–185, Feb. 2019.
[9] T. Zhang, W. Zheng, Z. Cui, Y. Zong, J. Yan, and K. Yan, ‘‘A deep neural [32] S. Li, W. Zheng, Y. Zong, C. Lu, C. Tang, X. Jiang, J. Liu, and W. Xia, ‘‘Bi-
network-driven feature learning method for multi-view facial expression modality fusion for emotion recognition in the wild,’’ in Proc. Int. Conf.
recognition,’’ IEEE Trans. Multimedia, vol. 18, no. 12, pp. 2528–2536, Multimodal Interact., Oct. 2019, pp. 589–594.
Dec. 2016. [33] A. Sepas-Moghaddam, A. Etemad, F. Pereira, and P. L. Correia, ‘‘Facial
[10] P. S. Aleksic and A. K. Katsaggelos, ‘‘Automatic facial expression recog- emotion recognition using light field images with deep attention-based
nition using facial animation parameters and multistream HMMs,’’ IEEE bidirectional LSTM,’’ in Proc. ICASSP - IEEE Int. Conf. Acoust., Speech
Trans. Inf. Forensics Security, vol. 1, no. 1, pp. 3–11, Mar. 2006. Signal Process. (ICASSP), May 2020, pp. 3367–3371.

132000 VOLUME 8, 2020


T.-H. Vo et al.: PSR for ITW FER

[34] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, ‘‘Rethinking THANH-HUNG VO received the B.Eng. degree
the inception architecture for computer vision,’’ in Proc. IEEE Conf. from the Ho Chi Minh City University of Technol-
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 2818–2826. ogy, in 2010, and the M.Eng. degree from Vietnam
[35] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, ‘‘Learning transferable National University, Vietnam, in 2013, all in com-
architectures for scalable image recognition,’’ in Proc. IEEE/CVF Conf. puter science. He is currently pursuing the Ph.D.
Comput. Vis. Pattern Recognit., Jun. 2018, pp. 8697–8710. degree with the Pattern Recognition Laboratory,
[36] R. Müller, S. Kornblith, and G. Hinton, ‘‘When does label smoothing School of Electronics and Computer Engineering,
help?’’ in Proc. NIPS, 2019, pp. 4694–4703. Chonnam National University, South Korea. Since
[37] M.-I. Georgescu, R. T. Ionescu, and M. Popescu, ‘‘Local learning with deep 2011, he has been working as a Lecturer with the
and handcrafted features for facial expression recognition,’’ IEEE Access,
Ho Chi Minh City University of Technology. His
vol. 7, pp. 64827–64836, 2018.
research interests include natural language processing, speech, and computer
[38] J. Zeng, S. Shan, and X. Chen, ‘‘Facial expression recognition with incon-
sistently annotated datasets,’’ in Proc. Eur. Conf. Comput. Vis., vol. 11217,
vision applying machine learning, and deep learning techniques.
2018, pp. 227–243.
GUEE-SANG LEE (Member, IEEE) received the
[39] K. Wang, X. Peng, J. Yang, D. Meng, and Y. Qiao, ‘‘Region attention
networks for pose and occlusion robust facial expression recognition,’’ B.S. degree in electrical engineering and the
IEEE Trans. Image Process., vol. 29, pp. 4057–4069, 2020. M.S. degree in computer engineering from Seoul
[40] W. Hua, F. Dai, L. Huang, J. Xiong, and G. Gui, ‘‘HERO: Human emotions National University, South Korea, in 1980 and
recognition for realizing intelligent Internet of Things,’’ IEEE Access, 1982, respectively, and the Ph.D. degree in com-
vol. 7, pp. 24321–24332, 2019. puter science from Pennsylvania State Univer-
[41] J. Howard. (2018). Fastai. [Online]. Available: https://github. sity, in 1991. He is currently a Professor with
com/fastai/fastai the Department of Electronics and Computer
[42] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, Engineering, Chonnam National University, South
A. Desmaison, L. Antiga, and A. Lerer, ‘‘Automatic differentiation in Korea. His primary research interests include
PyTorch,’’ in Proc. NIPS, 2017, pp. 1–4. image processing, computer vision, and video technology
[43] D. P. Kingma and J. Ba, ‘‘Adam: A method for stochastic
optimization,’’ 2014, arXiv:1412.6980. [Online]. Available:
http://arxiv.org/abs/1412.6980 HYUNG-JEONG YANG (Member, IEEE) received
[44] L. N. Smith, ‘‘Cyclical learning rates for training neural networks,’’ the B.S., M.S., and Ph.D. degrees from Chonbuk
in Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV), Mar. 2017, National University, South Korea. She is currently
pp. 464–472. a Professor with the Department of Electronics and
[45] J. Cai, Z. Meng, A. S. Khan, Z. Li, J. O’Reilly, and Y. Tong, Computer Engineering, Chonnam National Uni-
‘‘Probabilistic attribute tree in convolutional neural networks for facial versity, Gwangju, South Korea. Her main research
expression recognition,’’ 2018, arXiv:1812.07067. [Online]. Available:
interests include multimedia data mining, medical
http://arxiv.org/abs/1812.07067
data analysis, social network service data mining,
[46] Y. Fan, J. C. Lam, and V. O. Li, ‘‘Multi-region ensemble convolutional
and video data understanding.
neural network for facial expression recognition,’’ in Artificial Neural
Networks and Machine Learning (Lecture Notes in Computer Science),
vol. 11139. Berlin, Germany: Springer, 2018, pp. 84–94.
SOO-HYUNG KIM (Member, IEEE) received the
[47] F. Lin, R. Hong, W. Zhou, and H. Li, ‘‘Facial expression recognition with
data augmentation and compact feature learning,’’ in Proc. 25th IEEE Int. B.S. degree in computer engineering from Seoul
Conf. Image Process. (ICIP), Oct. 2018, pp. 1957–1961. National University, in 1986, and the M.S. and
[48] S. Albanie, A. Nagrani, A. Vedaldi, and A. Zisserman, ‘‘Emotion recog- Ph.D. degrees in computer science from the Korea
nition in speech using cross-modal transfer in the wild,’’ in Proc. ACM Advanced Institute of Science and Technology,
Multimedia Conf. Multimedia Conf., 2018, pp. 292–301. in 1988 and 1993, respectively. Since 1997, he has
[49] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for image been a Professor with the School of Electronics
recognition,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), and Computer Engineering, Chonnam National
Jun. 2016, pp. 770–778. University, South Korea. His research interests
[50] J. Hu, L. Shen, S. Albanie, G. Sun, and E. Wu, ‘‘Squeeze-and-excitation include pattern recognition, document image pro-
networks,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 8, cessing, medical image processing, and deep learning applications.
pp. 2011–2023, Aug. 2020.

VOLUME 8, 2020 132001

You might also like