Pyramid With Super Resolution For In-The-Wild Facial Expression Recognition
Pyramid With Super Resolution For In-The-Wild Facial Expression Recognition
29, 2020.
Digital Object Identifier 10.1109/ACCESS.2020.3010018
ABSTRACT Facial Expression Recognition (FER) is a challenging task that improves natural
human-computer interaction. This paper focuses on automatic FER on a single in-the-wild (ITW) image.
ITW images suffer real problems of pose, direction, and input resolution. In this study, we propose a
pyramid with super-resolution (PSR) network architecture to solve the ITW FER task. We also introduce
a prior distribution label smoothing (PDLS) loss function that applies the additional prior knowledge of the
confusion about each expression in the FER task. Experiments on the three most popular ITW FER datasets
showed that our approach outperforms all the state-of-the-art methods.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
131988 VOLUME 8, 2020
T.-H. Vo et al.: PSR for ITW FER
and late fusion to combine the results from the VGG and decreases with increasing size. The mean and variance of the
the ResNet models. Zeng et al. extracted image histogram image size in the RAF-DB are 193 and 144, which is a bit
of oriented gradients and passed them through deep sparse large. The AffectNet dataset has larger image sizes, ranging
autoencoders to classify them [17]. Tozadore et al. grouped from 130 pixels to more than 2000 pixels. In the graph,
emotions into several groups to help CNN classify with better we round all images larger than 2000 pixels to the fixed value
accuracy [18]. of 1000 pixels. Similar to the RAD-DB dataset, the number
Despite these successes of in-the-lab datasets, the rise of of image decreases when the size of the image increases. The
the in-the-wild (ITW) dataset in recent years has raised new third most popular ITW datasets for the FER task is the FER+
challenges for researchers. When in-the-lab datasets were dataset [19] extended from the FER2013 [20]. It also faces
collected under control, the data were clean, accurate, and the different-image-size problem. Unfortunately, the original
uniform. In contrast, ITW datasets are noisy, inaccurate, and image size information was omitted when the author of the
variant. We outline the following two observations about ITW FER data published. Most of the studies in this field does
datasets for the FER task. not consider the image-size problem. They simply resized
Observation 1: The images size of the ITW datasets all images to the same size, e.g. 128 × 128 or 224 × 224.
varies. While the size of in-the-lab datasets images is con- The first reason is due to the DL framework itself, because
trolled and nearly constant, ITW dataset images have various in the batch mode, each batch must have the same input
sizes from too small to large. Figure 1 shows the image shape. Implementing different input sizes at the same time
size distribution of the RAF-DB [11], [12] (Fig. 1a) and the takes more effort, and is complicated and computationally
AffectNet [13] dataset (Fig. 1b). These two selected datasets inefficient. While CNN architecture was successful for many
are the most popular ITW datasets for the FER task. Because image classification tasks, it is based on the assumption that
of the differences in width and length, the average of the despite the resizing of the images the network could learn to
two is considered as the size of the image. In both datasets, distinguish by itself. Nearest-neighbor interpolation, bilinear,
the small images occur more frequently and this frequency and bicubic algorithms are popular techniques to scale image
sizes.
Observation 2: The CNNs are usually sensitized with
input image size. While CNN was very successful for many
tasks related to image classification and segmentation, this
architecture suffers from several weaknesses. One of CNN’s
weaknesses is the sensitivity to the size of the input image.
Zooming is one of the data augmentation techniques that
attempts to address this problem. The selected zooming scale
in most of the experiments ranged from 0.9 to 1.2 because
values outside this range degraded and damaged the network.
With global pooling, CNN networks could support different
input sizes, and the size incremental technique was used to
train the networks more quickly and gives coverage easier.
Despite the improvement offered by this process, the network
remains sensitive to the input size. Therefore, the network
trained with this input size works poorly with the same images
but on a different scale. Figure 2 shows the training and
validation loss for VGG16 when training with the RAF-DB
and the FER+ in different scales: 50 × 50, 100 × 100,
150 × 150 and back to 50 × 50 again in RAF-DB and 48 × 48,
96 × 96, 192 × 192 and again 48 × 48 in the FER+ for every
20 epochs in the sequence. We use weights transfer from the
ImageNet [21], and then, we freeze the whole CNN architec-
ture except the fully connected layers. The freeze steps were
trained in 20 epochs at the smallest input image size. At the
point of image size change (epoch 41, 61, 81), the loss of
both training and validation set a significant increase. At the
epoch 81, although the input size returns to the size 48 × 48
that was used to train to the network before, the loss value still
increases because of the characteristics of convolution. The
convolution layer uses a kernel (size 3×3, 5×5, or similar) to
FIGURE 1. The image size distribution of the RAF-DB [11], [12] and scan the ‘‘pixel’’ in the previous layer. Then, even though the
AffectNet [13] datasets. image is the same but in a different scale, the next convolution
of 2 × 3, a representation of an affine transform in a 2D the efficiency of memory and computing, the input images
image. The grid generator then accepts θ and makes a grid, are kept at the same size. And to use the best information
and finally, the sampler uses this grid and generates the output from the input images, they are passed to the network at the
image. The output image is from the input image with rotate, largest attainable size. The input size may be limited by the
scale, and transforms operators. The input and output of this computational limit and based on each dataset. While passing
block are images with the same size and the same number of the same size images, as in the first observation, many of them
channels. were in low-resolution and were up-scaled by using some
Different from in-the-lab images, ITW images are very traditional algorithm. However, our approach down-scales
different from the head pose direction. We add the STN block them and then up-scales them again using the SR technique.
to help the network learn to align the face and make it easier This block is to view the overall context in the low-resolution
to recognize. images, along with the high-resolution image to consider the
Our implementation details follow the previously pub- original features.
lished paper [29]. Table 1 shows the details of the internal In the scaling block the network branches to three or more
layers of this block. For the convolution layers, the parameters sub-networks. All sub-networks work with the same input
are the input channel, output channel, kernel size, and stride. image but on a different scale. The latest branch received
The kernel size and stride are needed for the maxpool2d layer. the original input images, which had the highest resolution
For the linear layer, only two parameters are needed: the for the network. Due to the computational limit, most studies
number of input nodes and that of output nodes. After the in the field of image classification use the input image from
localization, the feature map is flattened and passed through 100 to maximum 312. For the larger input size, the higher
the fully connected part. Our algorithm calculates the size resolution does not improve the performance. For the batch
of the feature map dynamically based on the input size. So, mode, all images were resized to the central size before being
the block is adaptive to different sizes of the input images. passed through to the network. The larger image size is then
down-scaled, and the smaller images needed to be up-scaled.
TABLE 1. The details of STN block.
We call the original input size W × H . This process of scale
input is the traditional algorithm, such as Nearest-neighbor
interpolation, bilinear, and bicubic. While the down-scaled
image is safe, the up-scaling from small size images to the
larger size using the traditional algorithm is complicated and
inaccurate. Our approach tends to overcome this issue. The
first branch is applied to the lowest resolution image, which
was down-scaled from the original input by the simple oper-
ator using mean pooling to implement. We declare the value
step and kstep for the step scale value between two neighbors.
B. THE SCALING BLOCK By the limit of DL, step is set to 2. A large kstep can be used,
The scaling block is the leading block in our architecture. The but due to the computational limitation, we restrict kstep to
main idea of this block is to view the input image on different only 1 or 2. The size of image for the first branch is
scale from small to large. Belong to that, super-resolution was W H
used to upscale the image size. As in many CNNs, to ensure × kstep
2kstep 2
VOLUME 8, 2020 131991
T.-H. Vo et al.: PSR for ITW FER
Between the first and the last branch, there are kstep of D. FULLY CONNECTED BLOCK AND CONCATENATION
SR branches, each of which is a SR block with the scale size BLOCK
of 2, 4, 8, . . . from the lowest resolution image from the first The fully connected block includes two fully connected layers
branch. The size of ith SR is given by equation 1. (Linear, FC) and several additional layers. The output feature
W H from the high-level block then passes through this block to get
× (1) the vector to represent the score for each label. Depending
2kstep−i 2kstep−i
on the experiment, we use either seven or eight emotions,
In case k = 1, there is only one SR branch in the scaling and then the output vector sizes are set to seven or eight,
block, and the output size is the same as the original input respectively. We also use BatchNorm1d for the last feature
size. In case k = 2, there are two SR branches, which map, and two dropout layers with p values of 0.25 and 0.5 for
have the sizes of [W /2, H /2] and [W , H ]. Our setup always the first and the second FC layers, respectively. The ReLU
ensures that the last SR part has the same size as the original activation function was applied after the first FC layer.
input size. For the SR task, we use the EDSR architecture Similar to the high-level feature extractor block, the fully
introduced by Lim et al. [27]. connected block was also shared among branches.
By learning how to resample the image size, we assume All branches were fused with the weighted late fusion
that this block can add useful information to this particular strategy. The weight of each branch has been determined
task, and thereby increases the accuracy of the prediction according to the contribution to the final score of the whole
model. network.
C. LOW AND HIGH-LEVEL FEATURE EXTRACTOR III. THE PRIOR DISTRIBUTION LABEL
Typically, low and high-level feature extractors are combined SMOOTHING (PDLS) LOSS FUNCTION
in a base network architecture. We choose VGG16 [22] as FER for basic emotion is a classification problem, where each
the base network because this network is still used as the input image is classified into one of seven or eight classes.
base of many recent network for the FER task [31]–[33]. Softmax Cross-Entropy is the most popular loss function for
From the base network, VGG16 [22], we separated into two classification tasks. The cross entropy (CE) loss function is
parts for two levels of input. The low-level feature extractor given in equation 2.
receives the images as input and generates the feature map X
corresponding to the data. This block works at low level CE = − tc ∗ log(σ (zc )) (2)
of features, e.g., edge, corner, and so on. The high-level c∈C
feature extractor receives the feature map from the low-level
where:
part and makes a more in-depth, high-level features for the
input. • CE: cross entropy
While the input is passed through both extractors in this • C: set of classes (labels)
order, we separated them as two to share across branches. • tc : the distribution value of the label c in the ground truth
P
As in the second observation, we know that the CNNs are where c∈C tc = 1
very sensitized with the input size, and here, each branch has • σ (zc ): softmax function for zc
different input sizes. The low-level features for each branch • zc : raw value score for class c from the model
are quite different and cannot be shared because sharing In the real world, it is difficult to get the ground truth
low-level layers damages the network. The high-level feature distribution for the labels for each sample; therefore, the all
block is in another environment. At this level, a high-level in one assumption was used in most cases. In the ideal case,
feature needs to be learned and is less dependent on the size the sample belongs to one and only one class; therefore,
of the input. Then the weight of this block can be shared the one-hot vector is widely used in the classification task
across branches. The shared weights also act in a similar way for labeling, so that equation 2 becomes the simple case of
to multi-task learning where the combination helps each task −log(σ (zk )), where tc = 0 for all c ∈ C but the correct label
obtain better results. k (tk = 1). Then, all parts except the label k are omitted.
The position of the cutting point denotes pos, which is the The Label Smoothing (LS) loss function has been intro-
position of the convolution layers in the base network, where duced in other studies [34], [35], and [36]. The formula for LS
we separate the two parts. A lower pos value means that all is given as equation 3. The main idea here is the contribution
branches share the weights in most of the internal layers, of all incorrect labels. The parameter α was set around 0.9,
while the highest value of pos separates all branches. From meaning that the contribution for other labels is very small;
the second observation, we assume that the low pos value e.g., for FER task, |C| = 8, then the weight for each of them is
degrades the network. Since the base network is VGG16, 0.1/8 ≈ 0.0125 and for the correct label is 0.9125. Although
which has 12 convolution layers, the cutting position pos the weight of the incorrect labels is small, the LS has been
should be in 0 − 12, which is the position of corresponding used successfully in many classification tasks. The advantage
convolution layers. We analyze the effect of the cutting point of the LS over CE with one-hot is that all label scores pre-
(the pos value) in the experiments. dicted by the model are activated. Then the backpropagation
process can learn not only how to increase the score for the FER2013 dataset [20], the authors of FER+ also provided the
correct label but also how to decrease for the incorrect ones. labels distribution information for every sample. In FER+,
X (1 − α) each sample was labeled by ten people, who need to classify
LS = −log(σ (zk )) ∗ α + − log(σ (zc )) ∗ (3) each image to eight basic classes plus two additional classes,
|C|
c∈C
unknown and non-face. While the correct label’s distribution
where: for each sample is difficult to obtain, we assumed that the
• |C|: size of label set method for making the FER+ is a good approximation for
• α: parameter control the weight for each part the ground truth distribution. For each sample s ∈ S, S is
In the LS loss function, all labels except the correct one the FER+ dataset, we have the approximate distribution ads .
are given equal, i.e., they have a small role and are all equal. Since unknown and non-face images are omitted, we only use
LS can be used extensively in many tasks when there is no information for eight basic emotions, denote byP E. Then ads
information about the distribution. However, in many tasks is a vector in R8 when 8 is the size |E|, and ads = 1.
like FER, for a particular correct label, the confusion to other Equation 5 is to calculate the average distribution for each
classes are not uniform. The FER task has two advantages: the ground truth emotion k. In this calculation, we used only the
number of labels is small, just seven or eight and, more impor- training set in the FER+.
tantly, we know that for the particular label, the confusion P
s∈Sk ads
for some specific classes is higher than others. For example, dk = (5)
|Sk |
the correct label fear is very likely to be confused with sur-
prise than with disgust. Another example is the disgust facial, where:
which can easily be mistaken as neutral, or sadness than • dk : the average distribution of the label k, dk ∈ R8
anger or fear. If we have this prior knowledge, the smoothing • |Sk |: the size of the subset Sk , Sk ⊂ S, where the ground
part should not be a uniform distribution. So, we proposed truth emotion is k. ∪k∈E Sk = S.
an extended version of LS with additional prior knowledge The final prior distribution dki for the FER task is provided
of the label’s confusion call PDLS. The PDLS loss function in table 2. Each row in the table is dk , and k is one in eight
was given by two parts: the one-hot and the prior distribution, emotion labels. The columns are the confusion labels, and
as shown in equation 4. there are also eight emotion labels. E.g. dneutral,sadness =
PDLS = −
X
(tc ∗ α + dkc ∗ (1 − α)) ∗ log(σ (zc )) (4) 0.114 means when image in neutral, there is 11.4% chance
to confuse it as sadness. The number in the main diagonal is
c∈C
always higher than 0.5 that represents the distribution for its
where:
own emotion. The happiness emotion is very clear and easy
• α is a parameter to control the weight of one-hot and to detect: dhappiness,happiness = 0.918, whereas fear and the
distribution. disgust are difficult to detect and easy to be confused.
• dkc the prior distribution for the correct label k and the
confusion label c. IV. DATASETS
All notations in equation 4 are similar to those in equation 2 There are three popular ITW datasets for the FER task,
and 3. The dkc value is the new operand in this formula, and it including the FER+ [19], RAF-DB [11], [12] and Affect-
1
replaced the uniform distribution |C| in the LS loss function. Net [13] datasets. In this study, the experiments are con-
The d matrix has the following properties: ducted with all of them. The eight discrete emotions for
size = |C| × |C| the classification are neutral, happiness, surprise, sadness,
X anger, disgust, fear and contempt. Some previous datasets
dkc = 1, ∀k ∈ C
and studies used seven of them as they excluded contempt
c∈C
because it is difficult and rare in the real world. The details
argmax(dk1 , dk2 , . . . , dk|C| ) = k, ∀k ∈ C
for each dataset are given below.
The most important part is how to calculate the dkc . Using FER+ dataset. The FER+ dataset [19] is the first
Barsoum et al. [19], when correcting the labels for the ITW dataset among them. The original version is the
TABLE 3. Number of images in training/testing/validation subsets of the FER+, RAF-DB, and AffectNet datasets.
FER2013 [20] by Goodfellow et al., released for the ICML or fear. Figure 4 shows that the distribution of emotions on
2013 Workshop on Challenges in Representation Learning. training, testing, and validation on the FER+ are similar.
But as the labeling accuracy of the FER2013 dataset is not RAF-DB dataset. Shan Li, Weihong Deng, and Jun-
reliable, Barsoum et al. reassigned the labels [19]. Ten people Ping Du provided the Real-world Affective Faces Database
assigned manually the basic emotion for each image in the (RAF-DB) for emotion recognition [11], [12]. The dataset
FER2013 dataset. The subset of the original images was contains about 30,000 images downloaded from the Internet.
excluded if it is classified as unknown or non-face. The final About 40 trained annotators labeled carefully the image. The
emotion label was assigned based on the voting from the ten dataset has two parts: the single-label subset (basic emotions)
people. The number of people voting for each emotion for and the two-tab subset (compound emotions). We used the
each image was given, which was then used to calculate the single-label subset with seven classes of basic emotions. This
approximate distribution of the emotion over that image. subset has 12,271 samples in the training set and 3,068 in the
The dataset includes all the images, each of which has one test set. The number of samples for each emotion is given in
person’s face aligned. The dataset images were collected from table 3. Notably, the RAF-DB dataset does not include the
the Internet by querying many related expression keywords. contempt expression. Figure 1 shows that images sizes in the
There are many kinds of face in the real-world environment, RAF-DB vary from tiny to large, which makes it difficult for
and their pose and rotation make them more challenging to the DL model to deal with.
recognize. The images were aligned and centered, and they AffectNet dataset. The AffectNet [13] is the largest
were scaled slightly differently. All images are low-resolution dataset for the FER task. The dataset contains more than one
and in grayscale with a size of 48 × 48 pixels. Each corre- million images queried from the Internet by using related
sponding label for each image is also given. The eight basic expression keywords. There are about 450,000 images man-
emotions are used in this dataset. ually annotated by trained persons. It also includes train,
Table 3 and figure 4 show the distribution of train, test validation, and test sets. The test set has not yet been pub-
and validation on the FER+ dataset. The number of neutral lished, so most previous studies used validation set as the
images is highest, 9,030 on the train set, and 1,102 on the test test set [13], [37]–[40]. Because the contempt emotion is
set. The disgust emotion has the lowest number of images: rare in the natural world, some studies [40] used only seven
only 107 on train and 15 on test. The contempt emotion has emotions while other studies [13], [38], [39] analyzed all
a similar number of images with disgust: only 115 on train eight emotions. Another study used both eight and seven
and 13 on test. Disgust, contempt and fear have few images, expressions [37]. Therefore, to compare our results with the
compared with the other five emotions. This is normal in previous studies, we performed experiments with both eight
natural communication where people are usually in neutral classes and seven classes.
and happy state and only rarely experience disgust, contempt Table 3 shows the number of samples for each emotion
class on each subset train, validation, and test on the FER+,
RAF-DB, and AffectNet datasets. The name they use for
labels are a little different but can be mapped to the eight
basic emotions as the emotion column. The FER+ has three
separate subsets for training, validation, and testing, while
two others have only two subsets. The AffectNet dataset
has not published the testing subset, so as for most of the
studies in this dataset, the validation is taken as the testing
subset, and the validation subset during the training process
should be randomly selected from the training subset. Similar
to the RAF-DB, the training subset is randomly separated
and then applied to get the training and validation subsets.
FIGURE 4. The FER+ data distribution of train/test/valid. Only AffectNet exhibits balanced validation (as the testing),
while the FER+ and RAF-DB are highly unbalanced. Both We use Adam optimization algorithm [43] with an adap-
the FER+ and AffectNet datasets have eight emotions labels, tive learning rate using The One Cycle Policy suggested by
and the RAF-DB has only seven emotion classes without Smith [44]. The learning rates were set to 1e-3 for some later
contempt emotion expression. layers of the network, and 1e-4 for the STN block. The lower
Figure 5 gives some sample images for each class from the learning rate for STN with the transformation aims to keep
three datasets. In this figure, each column presents one emo- this bock with little change.
tion expression. The images in the first two rows (figure 5a) The validation set is used to optimize the hyper-parameters,
is from the FER+ dataset, figure 5b is from RAF-DB, and and then we collected the best models. The hyper-parameters
the rest (figure 5c) are from AffectNet. The last column of for all our experiments include the learning rate and the
RAF-DB is empty because the RAF-DB dataset has seven number of epoch where the network gets the best result.
emotions without contempt expression. Those models were used to evaluate the test set. We applied
Test Time Augmentation on the test step. Eight randomly
rotated, zoomed images are generated from each image and
then passed through the model to get the raw score to predict.
The final raw score is the average of their outputs.
For basic emotions recognition, several metrics are used to
evaluate the results. The first and most widely used metric is
accuracy, or weighted accuracy (WA), which is the number
of correct answers divided by the total number of the test
samples. But, when the number of samples for each class is
highly unbalanced, WA may have poor performance, partic-
ularly FER task, because the emotions in the real world are
usually unbalanced. Some emotions such as neutral, happy,
or sad are more common than disgust, fear, or contempt. In
this case, unweighted accuracy (UA) should be considered for
the additional evaluation of the system. The UA metric is an
unbiased version of WA. The UA is calculated by the average
of the accuracy of each class. For comparison with other
FIGURE 5. Sample images from the (a) FER+, (b) RAF-DB and studies, both WA and UA are adopted in the experiments.
(c) AffectNet datasets. All experiments were run on Ubuntu 18.04, 32G RAM,
GeForce RTX 2080 Ti GPU with 11G GPU RAM.
FIGURE 6. The confusion matrix on the test set for the RAF-DB, FER+ and AffectNet datasets.
TABLE 4. RAF-DB accuracy comparison (%). TABLE 5. FER+ accuracy comparison (%).
89.75%. Compared to the best previous result in the literature on disgust and fear makes the F1 score and average accuracy
by Albanie et al. [48], our approach is improved by 0.65%. far lower than the average. Future work should consider
The average accuracy for our proposed architecture is focusing on increasing the number of sample of disgust and
69.54% and F1 score (macro) is 74.88%. The low accuracy fear to improve the accuracy for these two expressions.
FIGURE 7. Cumulative accuracy by size on the test set of RAF-DB dataset with the VGG16 (base-line) and the PSR architecture.
Figure 6c shows the confusions matrix on the test set of TABLE 6. AffectNet accuracy comparison (%).
the PSR architecture: happiness has the highest accuracy
of 96%, followed by neutral, surprise and anger. All four
expressions had accuracy above 90%. The lowest accuracy
was for contempt, 23% accuracy. Due to the lack of contempt
images, the model could not learn to distinguish it from neu-
tral, anger, or sadness. Some emotions have high likelihood
of wrong classification: fear predicted as surprise by 37%,
disgust classified as anger by 33% and sadness classified as
neutral by 22%. These high levels of confusion are typical in
the real world because even for humans, it can be difficult to [1, 2, 1] and the cutting point at the sixth convolution layer,
distinguish these pairs of emotions. with the original input size of 100 pixels. The image size
ranged from 23 pixels to about 1200 pixels. Because the
3) AffectNet DATASET large images were resized to a fixed size at 100 pixels,
We compared both eight and seven classes in the AffectNet we consider only those images smaller than 100 pixels to see
dataset. Table 6 shows the results in classification accuracy how our approach is affected. We omitted the first twenty
(WA). In the classification of eight emotions, our model points because they are unstable to calculate the accuracy.
archived the accuracy of 60.68%, outperformed the current The figure shows that initially, with tiny image sizes less
state-of-the-art 59.58% achieved by Georgescu et al. [37]. than 40 pixels, both base-line and PSR are unstable. But after
In the seven-emotion task, our model archived the accuracy 40 pixels, the PSR architecture is improved and works better
of 63.77%, slightly improved relative to the current highest than the base-line network. The PSR maintained this trend to
one at 63.31% [37]. Figure 6b and figure 6d present the confu- the end of the dataset because, in our approach, we added the
sion matrix for the AffectNet in the seven-class task and eight- super-resolution module with double size for a small image
class task, respectively. The happy expression has the highest in one of the three branches, and another branch for half size
detection rate in both cases, followed by the fear emotion. 100/2 = 50 improved the recognition accuracy.
Surprise, anger, and disgust have a similar performance in Figure 8 shows the accuracy discrepancy by size between
both cases. In the eight-expression task, contempt has the PSR and VGG16 on the test set of the RAF-DB dataset. The
lowest performance just at 49%. blue points are raw values, and yellow ones are the smoothed
Figure 7 shows the cumulative accuracy according to the version. The accuracy discrepancy represents the speed of
size of the original image on the base-line network and the the improvement of the PSR over the baseline network. It is
PSR architecture. The PSR was run with the three branches clear that the improvement had the highest speed when the
FIGURE 9. The accuracy by original image size on each branch of the PSR
without the STN block on the RAF-DB test-set.
FIGURE 8. The discrepancy of accuracy by size on the test set of the
RAF-DB dataset between PSR and baseline.
the step of the scale-up from the lowest resolution. The pyra- [11] S. Li and W. Deng, ‘‘Reliable crowdsourcing and deep locality-preserving
mid architecture has viewed the input on several scales, but a learning for unconstrained facial expression recognition,’’ IEEE Trans.
Image Process., vol. 28, no. 1, pp. 356–370, Jan. 2019.
step is an integer number larger than one, and 2 is a starting [12] S. Li, W. Deng, and J. P. Du, ‘‘Reliable crowdsourcing and deep locality-
value. But, the double scale is still a tremendous value. While preserving learning for expression recognition in the wild,’’ in Proc. 30th
the scale 1.2 is a good point for most of the augmentation IEEE Conf. Comput. Vis. Pattern Recognit., Oct. 2017, pp. 2584–2593.
1 [13] A. Mollahosseini, B. Hasani, and M. H. Mahoor, ‘‘AffectNet: A database
techniques, and 1.2 (≈ 0.83) in the reverse case, we suggest for facial expression, valence, and arousal computing in the wild,’’ IEEE
that the scale step should be 1.22 = 1.44, or approximated Trans. Affect. Comput., vol. 10, no. 1, pp. 18–31, Jan. 2019.
as 1.5. For the traditional algorithm, a decimal scaling value [14] P. Liu, S. Han, Z. Meng, and Y. Tong, ‘‘Facial expression recognition via
a boosted deep belief network,’’ in Proc. IEEE Conf. Comput. Vis. Pattern
is possible, but it cannot be used for the DL approaches. Recognit., Jun. 2014, pp. 1805–1812.
The second weakness is the baseline network architecture. [15] K. Liu, M. Zhang, and Z. Pan, ‘‘Facial expression recognition with CNN
Although several network architectures, more reliable than ensemble,’’ in Proc. Int. Conf. Cyberworlds (CW), Sep. 2016, pp. 163–166.
[16] C. Huang, ‘‘Combining convolutional neural networks for emotion recog-
VGG16, such as ResNet [49] and SENet [50] have been nition,’’ in Proc. IEEE MIT Undergraduate Res. Technol. Conf. (URTC),
reported, we chose the VGG16 as the base network. Although Nov. 2017, pp. 1–4.
our approach is general, we can apply many kinds of CNN, [17] N. Zeng, H. Zhang, B. Song, W. Liu, Y. Li, and A. M. Dobaie, ‘‘Facial
expression recognition via learning deep sparse autoencoders,’’ Neurocom-
and the re-implementation is needed for each base network. puting, vol. 273, pp. 643–649, Jan. 2018.
Our approach is not a simple module, so extra efforts must be [18] D. C. Tozadore, C. M. Ranieri, G. V. Nardari, R. A. F. Romero, and
taken to implement case by case. The innovation of another V. C. Guizilini, ‘‘Effects of emotion grouping for recognition in human-
robot interactions,’’ in Proc. 7th Brazilian Conf. Intell. Syst. (BRACIS),
architecture is left for future work. Oct. 2018, pp. 438–443.
[19] E. Barsoum, C. Zhang, C. C. Ferrer, and Z. Zhang, ‘‘Training deep
networks for facial expression recognition with crowd-sourced label dis-
VI. CONCLUSION
tribution,’’ in Proc. 18th ACM Int. Conf. Multimodal Interact., 2016,
In this study, we addressed the various different-image-size pp. 279–283.
problem in the FER task for ITW datasets, where the original [20] I. J. Goodfellow, ‘‘Challenges in representation learning: A report on three
machine learning contests,’’ Neural Netw., vol. 64, pp. 59–63, Apr. 2015.
input image size varies. Although the CNNs could work on [21] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, ‘‘ImageNet:
the image with a small rotate and scale, they are worthless A large-scale hierarchical image database,’’ in Proc. IEEE Conf. Comput.
when the scale is enormous. The main contribution of this Vis. Pattern Recognit., Jun. 2009, pp. 248–255.
[22] K. Simonyan and A. Zisserman, ‘‘Very deep convolutional networks for
study is the development of a pyramid network architec-
large-scale image recognition,’’ 2014, arXiv:1409.1556. [Online]. Avail-
ture with several branches, each of which works on one able: http://arxiv.org/abs/1409.1556
level of input scale. The proposed network is based on the [23] C. Dong, ‘‘Learning a deep convolutional network for image super-
VGG16 model, but it can be extended to another baseline net- resolution,’’ in Computer Vision, vol. 8692, D. Fleet, Ed. Cham, Switzer-
land: Springer, 2014, pp. 184–199.
work architecture. In the PSR architecture, the SR method is [24] J. Kim, J. K. Lee, and K. M. Lee, ‘‘Accurate image super-resolution using
applied for up-scaling the low-resolution input. Experiments very deep convolutional networks,’’ in Proc. IEEE Conf. Comput. Vis.
on three ITW FER datasets show that our proposed method Pattern Recognit. (CVPR), Jun. 2016, pp. 1646–1654.
[25] W. Shi, J. Caballero, F. Huszar, J. Totz, A. P. Aitken, R. Bishop, D. Rueck-
outperforms all the current state-of-the-art methods. ert, and Z. Wang, ‘‘Real-time single image and video super-resolution
using an efficient sub-pixel convolutional neural network,’’ in Proc. IEEE
Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 1874–1883.
REFERENCES [26] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta,
[1] A. Mehrabian, Nonverbal Communication. New Brunswick, NJ, USA: A. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi, ‘‘Photo-realistic
Aldine Transaction, 1972. single image super-resolution using a generative adversarial network,’’
[2] P. Ekman, ‘‘Are there basic emotions?’’ Psychol. Rev., vol. 99, no. 3, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017,
pp. 550–553, 1992. pp. 4681–4690.
[3] P. Ekman, ‘‘Basic emotions.,’’ in Handbook Cognition Emotion. New York, [27] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee, ‘‘Enhanced deep residual
NY, USA: Wiley, 1999, pp. 45–60. networks for single image super-resolution,’’ in Proc. IEEE Conf. Comput.
[4] J. A. Russell, ‘‘A circumplex model of affect.,’’ J. Personality Social Vis. Pattern Recognit. Workshops (CVPRW), Jul. 2017, pp. 1132–1140.
Psychol., vol. 39, no. 6, pp. 1161–1178, 1980. [28] Y. Hu, X. Gao, J. Li, Y. Huang, and H. Wang, ‘‘Single image
super-resolution via cascaded multi-scale cross network,’’ 2018,
[5] P. Ekman and W. Friesen, Facial Action Coding System, vol. 1.
arXiv:1802.08808. [Online]. Available: http://arxiv.org/abs/1802.08808
Mountain View, CA, USA: Consulting Psychologists Press, 1978.
[29] M. Jaderberg, K. Simonyan, A. Zisserman, others, and K. Kavukcuoglu,
[6] C. Shan, S. Gong, and P. W. McOwan, ‘‘Facial expression recognition ‘‘Spatial transformer networks,’’ in Proc. Adv. Neural Inf. Process. Syst.,
based on local binary patterns: A comprehensive study,’’ Image Vis. Com- 2015, pp. 2017–2025.
put., vol. 27, no. 6, pp. 803–816, May 2009. [30] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei, ‘‘Deformable
[7] L. Ma and K. Khorasani, ‘‘Facial expression recognition using con- convolutional networks,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV),
structive feedforward neural networks,’’ IEEE Trans. Syst., Man, Oct. 2017, pp. 764–773.
Cybern. B. Cybern., vol. 34, no. 3, pp. 1588–1595, Jun. 2004. [31] M. Hu, H. Wang, X. Wang, J. Yang, and R. Wang, ‘‘Video facial emo-
[8] J. J. Lien, T. Kanade, J. F. Cohn, and C.-C. Li, ‘‘Automated facial expres- tion recognition based on local enhanced motion history image and
sion recognition based on FACS action units,’’ in Proc. 3rd IEEE Int. Conf. CNN-CTSLSTM networks,’’ J. Vis. Commun. Image Represent., vol. 59,
Autom. Face Gesture Recognit., 1998, pp. 390–395. pp. 176–185, Feb. 2019.
[9] T. Zhang, W. Zheng, Z. Cui, Y. Zong, J. Yan, and K. Yan, ‘‘A deep neural [32] S. Li, W. Zheng, Y. Zong, C. Lu, C. Tang, X. Jiang, J. Liu, and W. Xia, ‘‘Bi-
network-driven feature learning method for multi-view facial expression modality fusion for emotion recognition in the wild,’’ in Proc. Int. Conf.
recognition,’’ IEEE Trans. Multimedia, vol. 18, no. 12, pp. 2528–2536, Multimodal Interact., Oct. 2019, pp. 589–594.
Dec. 2016. [33] A. Sepas-Moghaddam, A. Etemad, F. Pereira, and P. L. Correia, ‘‘Facial
[10] P. S. Aleksic and A. K. Katsaggelos, ‘‘Automatic facial expression recog- emotion recognition using light field images with deep attention-based
nition using facial animation parameters and multistream HMMs,’’ IEEE bidirectional LSTM,’’ in Proc. ICASSP - IEEE Int. Conf. Acoust., Speech
Trans. Inf. Forensics Security, vol. 1, no. 1, pp. 3–11, Mar. 2006. Signal Process. (ICASSP), May 2020, pp. 3367–3371.
[34] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, ‘‘Rethinking THANH-HUNG VO received the B.Eng. degree
the inception architecture for computer vision,’’ in Proc. IEEE Conf. from the Ho Chi Minh City University of Technol-
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 2818–2826. ogy, in 2010, and the M.Eng. degree from Vietnam
[35] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, ‘‘Learning transferable National University, Vietnam, in 2013, all in com-
architectures for scalable image recognition,’’ in Proc. IEEE/CVF Conf. puter science. He is currently pursuing the Ph.D.
Comput. Vis. Pattern Recognit., Jun. 2018, pp. 8697–8710. degree with the Pattern Recognition Laboratory,
[36] R. Müller, S. Kornblith, and G. Hinton, ‘‘When does label smoothing School of Electronics and Computer Engineering,
help?’’ in Proc. NIPS, 2019, pp. 4694–4703. Chonnam National University, South Korea. Since
[37] M.-I. Georgescu, R. T. Ionescu, and M. Popescu, ‘‘Local learning with deep 2011, he has been working as a Lecturer with the
and handcrafted features for facial expression recognition,’’ IEEE Access,
Ho Chi Minh City University of Technology. His
vol. 7, pp. 64827–64836, 2018.
research interests include natural language processing, speech, and computer
[38] J. Zeng, S. Shan, and X. Chen, ‘‘Facial expression recognition with incon-
sistently annotated datasets,’’ in Proc. Eur. Conf. Comput. Vis., vol. 11217,
vision applying machine learning, and deep learning techniques.
2018, pp. 227–243.
GUEE-SANG LEE (Member, IEEE) received the
[39] K. Wang, X. Peng, J. Yang, D. Meng, and Y. Qiao, ‘‘Region attention
networks for pose and occlusion robust facial expression recognition,’’ B.S. degree in electrical engineering and the
IEEE Trans. Image Process., vol. 29, pp. 4057–4069, 2020. M.S. degree in computer engineering from Seoul
[40] W. Hua, F. Dai, L. Huang, J. Xiong, and G. Gui, ‘‘HERO: Human emotions National University, South Korea, in 1980 and
recognition for realizing intelligent Internet of Things,’’ IEEE Access, 1982, respectively, and the Ph.D. degree in com-
vol. 7, pp. 24321–24332, 2019. puter science from Pennsylvania State Univer-
[41] J. Howard. (2018). Fastai. [Online]. Available: https://github. sity, in 1991. He is currently a Professor with
com/fastai/fastai the Department of Electronics and Computer
[42] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, Engineering, Chonnam National University, South
A. Desmaison, L. Antiga, and A. Lerer, ‘‘Automatic differentiation in Korea. His primary research interests include
PyTorch,’’ in Proc. NIPS, 2017, pp. 1–4. image processing, computer vision, and video technology
[43] D. P. Kingma and J. Ba, ‘‘Adam: A method for stochastic
optimization,’’ 2014, arXiv:1412.6980. [Online]. Available:
http://arxiv.org/abs/1412.6980 HYUNG-JEONG YANG (Member, IEEE) received
[44] L. N. Smith, ‘‘Cyclical learning rates for training neural networks,’’ the B.S., M.S., and Ph.D. degrees from Chonbuk
in Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV), Mar. 2017, National University, South Korea. She is currently
pp. 464–472. a Professor with the Department of Electronics and
[45] J. Cai, Z. Meng, A. S. Khan, Z. Li, J. O’Reilly, and Y. Tong, Computer Engineering, Chonnam National Uni-
‘‘Probabilistic attribute tree in convolutional neural networks for facial versity, Gwangju, South Korea. Her main research
expression recognition,’’ 2018, arXiv:1812.07067. [Online]. Available:
interests include multimedia data mining, medical
http://arxiv.org/abs/1812.07067
data analysis, social network service data mining,
[46] Y. Fan, J. C. Lam, and V. O. Li, ‘‘Multi-region ensemble convolutional
and video data understanding.
neural network for facial expression recognition,’’ in Artificial Neural
Networks and Machine Learning (Lecture Notes in Computer Science),
vol. 11139. Berlin, Germany: Springer, 2018, pp. 84–94.
SOO-HYUNG KIM (Member, IEEE) received the
[47] F. Lin, R. Hong, W. Zhou, and H. Li, ‘‘Facial expression recognition with
data augmentation and compact feature learning,’’ in Proc. 25th IEEE Int. B.S. degree in computer engineering from Seoul
Conf. Image Process. (ICIP), Oct. 2018, pp. 1957–1961. National University, in 1986, and the M.S. and
[48] S. Albanie, A. Nagrani, A. Vedaldi, and A. Zisserman, ‘‘Emotion recog- Ph.D. degrees in computer science from the Korea
nition in speech using cross-modal transfer in the wild,’’ in Proc. ACM Advanced Institute of Science and Technology,
Multimedia Conf. Multimedia Conf., 2018, pp. 292–301. in 1988 and 1993, respectively. Since 1997, he has
[49] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for image been a Professor with the School of Electronics
recognition,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), and Computer Engineering, Chonnam National
Jun. 2016, pp. 770–778. University, South Korea. His research interests
[50] J. Hu, L. Shen, S. Albanie, G. Sun, and E. Wu, ‘‘Squeeze-and-excitation include pattern recognition, document image pro-
networks,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 8, cessing, medical image processing, and deep learning applications.
pp. 2011–2023, Aug. 2020.