0% found this document useful (0 votes)
35 views15 pages

A Survey On Learning-Based Approaches For Modeling and Classification

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views15 pages

A Survey On Learning-Based Approaches For Modeling and Classification

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1

A Survey on Learning-Based Approaches for


Modeling and Classification of
Human–Machine Dialog Systems
Fuwei Cui, Qian Cui, and Yongduan Song , Fellow, IEEE

Abstract— With the rapid development from traditional


machine learning (ML) to deep learning (DL) and reinforce-
ment learning (RL), dialog system equipped with learning
mechanism has become the most effective solution to address
human–machine interaction problems. The purpose of this arti-
cle is to provide a comprehensive survey on learning-based
human–machine dialog systems with a focus on the various dialog
models. More specifically, we first introduce the fundamental
process of establishing a dialog model. Second, we examine the
features and classifications of the system dialog model, expound
some representative models, and also compare the advantages
and disadvantages of different dialog models. Third, we comb
the commonly used database and evaluation metrics of the
dialog model. Furthermore, the evaluation metrics of these dialog
models are analyzed in detail. Finally, we briefly analyze the
existing issues and point out the potential future direction on the
human–machine dialog systems.
Index Terms— Artificial intelligence (AI), deep learning (DL),
dialog model, machine learning (ML), reinforcement learning
(RL), sequence to sequence (Seq2Seq) model.
Fig. 1. Composition of dialog system.

I. I NTRODUCTION
With the development of speech recognition [4]–[6], speech
O NE of the original research purposes of the human–
machine dialog system is to pass the Turing test. And
human beings have been studying dialog systems for half a
synthesis [7], [8], natural language processing [9], [10], and
information retrieval (IR) [11], [12], especially deep learning
century. Early dialog systems were based on artificial rules, (DL) [13]–[15] and reinforcement learning (RL) [54], some
such as Eliza (1966) [1], Parry (1975) [2] which passed the data-driven-based models that use DL or RL have been pro-
Turing test, and Alice (2009) [3] which won the Loebprize posed, such as IR models, generation models, RL models,
three times recently. Although the rule-based dialog system has and hybrid models. And so far, many human–machine dialog
achieved good results, the establishment of rules is laborious, products have emerged, such as Cortana and Microsoft Xiaob-
and its transferability is poor. Most importantly, many rules ing in 2014, Baidu Duer and Ali Xiaomi in 2015, Apple Siri
eventually lead to the software system either too costly or and Google Assistant in 2016, Tencent tinkling in 2017 and
seldom practical. so on. Although the existing dialog system can communicate
with human beings in some occasions, the system itself is not
Manuscript received July 27, 2019; revised January 18, 2020 and intelligent enough, calling for the implementation of much
March 5, 2020; accepted March 28, 2020. This work was supported by the
National Natural Science Foundation of China under Grant 61803053, Grant strong artificial intelligence (AI). Therefore, it is crucial to
61833013, Grant 61860206008, and Grant 61773081. (Corresponding author: carry out extensive research on the dialog system, which makes
Yongduan Song.) it necessary to have a general grasp of the current research
Fuwei Cui is with the Institute of Advanced Control System, College
of Electronic Information Engineering, Beijing Jiaotong University, Beijing situation of the dialog system.
100044, China (e-mail: 17111051@[Link]). The dialog system is generally composed of speech recog-
Qian Cui and Yongduan Song are with the State Key Laboratory of nition, natural language processing, speech synthesis (early
Power Transmission Equipment and System Security and New Technology,
and Chongqing Key Laboratory of Intelligent Unmanned Systems, School template-based dialog model without speech recognition and
of Automation, Chongqing University, Chongqing 400044, China (e-mail: speech synthesis module), as conceptually shown in Fig. 1,
qiancui@[Link]; ydsong@[Link]). where the function of the speech recognition module is to
Color versions of one or more of the figures in this article are available
online at [Link] convert human speech signal into text signal for the natural
Digital Object Identifier 10.1109/TNNLS.2020.2985588 language understanding module. Next, the natural language
2162-237X © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See [Link] for more information.

Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 27,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

TABLE I
T ERMINOLOGIES AND A BBREVIATIONS

Fig. 2. Flow chart of building dialog model.

rules to build a template (we call a template also a conversation


model).
To build a data-driven dialog model, first we need to prepare
a corpus, which is the basis of building the dialog model.
Note that the quality of the corpus directly affects the training
effect of the dialog model. There are two ways to develop
a corpus. One is to use the open corpus on the Internet
directly. In Section IV-A, some corpus commonly used for
dialog model training are listed. The other is to crawl from
the Internet. Second, we need to preprocess the data, the main
operations include removing stop words, word segment (not
involved in English though) and so on. Next, the processed
data are put into the dialog model for training, and different
dialog models are selected according to different business
understanding module first inputs the transformed text signal scenarios. Finally, the trained dialog model is used to predict,
into the dialog model, then recognizes the human intention, receive the user’s input and generate the response.
and finally generates the corresponding reply. Lastly, the With the comprehensive introduction of the process of
function of speech synthesis is to convert the text signals building the dialog model in this section, we then are ready to
returned by the natural language understanding module into discuss the core of the dialog model, analyze and compare the
speech signals, and then output them. basic principles, research status, advantages and disadvantages,
The core of the dialog system is to build a dialog model, existing problems and future possible research directions of the
and building dialog model is the main task of this module. existing dialog models in the sequel.
In this article, we review the current research status of dialog
models, including the construction, classification, database,
III. C LASSIFICATION OF D IALOG M ODELS
evaluation metrics, analysis of evaluation metrics, challenges
and possible future research directions, and attempt to sketch As mentioned above, the dialog system has become an
a more comprehensive and clear outline for the study of the increasing active research area in natural language process-
dialog model, in order to provide a useful reference for the ing. According to the different implementation technologies,
related research in this field. For easy reference, terminologies we can classify most existing models into the one based on
and abbreviations that appear more than once in the article are an artificial template, and the one based on IR, generation,
listed in Table I. RL or combination of them. And the classification system of
the dialog model is shown in Fig. 3, which is discussed and
analyzed in what follows.
II. P ROCESS OF E STABLISHING D IALOG M ODEL
There are two typical methods for building a dialog model,
i.e., nondata-driven and data-driven. The general process for A. Dialog Model Based on Nondata-Driven
building a dialog model is illustrated in Fig. 2. In the early works for the development of dialog system,
To build a dialog model with the nondata-driven method, computer and storage technology have its limit, and there
we should first be familiar with the business scenarios of the is a lack of corresponding data sets. The corresponding
application of the dialog model, then extract the corresponding dialog system in this period can only be constructed by
rules through business analysis, and finally integrate all the the nondata-driven method, which generally uses traditional

Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 27,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CUI et al.: SURVEY ON LEARNING-BASED APPROACHES OF HUMAN–MACHINE DIALOG SYSTEMS 3

Fig. 3. Classification of dialog model. Fig. 4. Traditional ML algorithms used in dialog model (SVD is singular
value decomposition).

machine learning (ML) algorithms, such as some pattern processes. After receiving the user’s input, first extracting
matching algorithms. Moreover, the commonly used single keywords from the input statement, replacing them and remov-
pattern matching algorithms are Boyer–Moore (BM) [81] algo- ing noise. Then, matching keywords through rules, locating
rithm and KMP (discovered by Knuth, Morris, and Pratt) [82], the position of questions in the template. Finally, getting the
and Multipattern matching algorithms as well as Wu–Manber response through the template. Although Alice can understand
(WM) [83] algorithm and Aho–Corasick (AC) [84] algorithm. the context and expand knowledge easily, it has limitations in
The dialog model based on the nondata-driven method is dealing with synonyms.
mainly a template-based dialog model, which is realized by In a word, the template-based model has the advantage of
setting rules manually. When the user’s question matches the high controllability and reliability. But it needs to establish
preset rules, the corresponding response will be triggered. For rules manually, which is time-consuming and costly, and has
example, when a user greets a dialog robot, the user may poor adaptability to changes in user input wording, and it also
ask “hola, hello, hi, hey.” The dialog robot will randomly needs designers to know the real business scenarios very well.
select one of several preset greetings to respond to. And the When facing new business scenarios, it needs to be redesigned
main workflow of the template-based dialog model includes and the portability is poor. Particularly, with the increasing
receiving user input, question normalization processing, ques- amount of data, the whole knowledge base may conflict and
tion query reasoning, and template processing. Among them, lead to system collapse.
receiving user input is mainly to get text signal; question
normalization processing is mainly to replace strings that
need to be replaced in question sentences, such as replacing B. Dialog Model Based on Data-Driven
“I’m” with “I am”; question query reasoning is to match the The data-driven dialog model needs a dialog data set, which
normalized question sentences with the rules in the rule base to is also called corpus, the data in which is processed to train
get the best matching result; template processing is to process the model. And the traditional ML, DL or RL algorithms are
the special tags existing in the matching results to get the final often used to train model, which can use some algorithms or
response results, and then return the results to the user. neural networks to learn the information from the data set.
Early dialog models are mainly template-based dialog mod- The traditional ML algorithms, as shown in Fig. 4, are
els, such as Eliza mentioned earlier, whose template was writ- mainly text matching algorithms, which match the similarity
ten by the script of the dialog, and the script is composed of between two texts by extracting keywords or semantic infor-
keywords and corresponding transformation rules. When there mation.
is user input, first checking the keywords in the input statement The DL algorithms, as shown in Fig. 5, can transform
and selecting the keywords with the highest ranking, then the initial “low-level” feature representation into “high-level”
finding the corresponding conversion rules, and generating the feature representation by constructing a multilayer artificial
response statement through the rules. In fact, Eliza is used in neural network to extract and filter the input information layer
the field of psychological counseling, acting like a counselor. by layer. It’s like the process that human neurons transmit
Furthermore, it demonstrates the possibility of template-based information through neural networks, reflecting the ability of
communication between human and machine. human abstract learning. And the commonly used artificial
Artificial linguistic internet computer entity (Alice), the neural networks are convolutional neural network (CNN),
recent template-based dialog system, is written in an AI recurrent neural network (RNN) and deep belief network
markup language (AIML) [16], which is an XML language (DBN).
that can create rules for robot dialog. In reality, the lan- The RL algorithm, as shown in Fig. 6, which is an algo-
guage adopts the "Stimulus-Response" theory and is developed rithm that imitates the learning behavior of human or animal,
in JAVA language. More specifically, there are some basic originates from the utility rule of behavioral psychology.

Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 27,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

The representation-based matching method first generates


the representation of the text and then calculates the matching
degree. Huang et al. [17] propose a deep structured semantic
model (DSSM), which maps high-dimensional sparse text
features to low-dimensional dense features by using deep
neural networks, i.e. replacing the traditional bag-of-words
model with word hash, so as to achieve the goal of dimension-
ality reduction. Although the model can extract the semantic
vector of sentence granularity better, the representation based
on sentence granularity is slightly rough, and the temporal
relationship between words is not considered. Wan et al. [18]
propose a multiview long short-term memory (MV-LSTM)
model, which first extracts the position representation of
Fig. 5. DL algorithms used in dialog model (deep neural networks contain sentences by bidirectional LSTM. Then, calculating the degree
deep CNN, RNN, DBN, and other deep neural networks).
of matching between two sentences by using interaction tensor
and stores them in the matching matrix, Next, choosing
the former k interactions of the matching matrix by k-max
pooling. Finally, calculating the matching score by multilayer
perceptron. The model can describe sentence information
in fine granularity and extract temporal information of sen-
tences. Furthermore, the similarity function with parameters
in formula (1) is used to calculate similarity, which captures
more diverse interactions information between two positional
sentence representations, compared with cosine and bilinear
similarity function. Because the result s(•) is an interaction
tensor, which can represent more diverse information than the
interaction matrices of the result of cosine or bilinear function.
Fig. 6. RL algorithms used in dialog model.
So it is more in line with the characteristics of language
diversity
By interacting with the environment through trial and error,    
μ
it learns how to achieve the best state and action in order s (μ, ν) = f μ M ν + Wμν
T [1:c]
+b (1)
ν
to get the greatest reward. The theoretical basis of RL is the
Markov decision process (MDP) [46]. And the key elements where s(•) is the similarity function, f (z) = max(0, z), μ and
of RL include action, state, strategy and reword. ν are two vectors, M i (i ∈[1:c]) is one slice of the tensor
1) Model Based on IR: Model based on IR is widely parameters, Wμν and b are the parameters of the linear part.
used in industrial production. More specifically, the principle The interaction-based matching method calculates matching
of the model is: first, extracting the keywords or semantic features directly and then extracts deep matching information
representation of the question, then matching the similarity on the basis of matching features. Pang et al. [19] propose
with the question in the corpus, and finally, selecting the a match pyramid (MP) model, which constructs a matching
response corresponding to the question with the highest sim- matrix in three ways and regards the matching matrix as
ilarity as the final output by sorting algorithm. Therefore, a picture. And CNN is used to convolute and pool the
the core problem of IR-based model can be abstracted as a matching matrix to extract features. Specifically, these three
text-matching problem. And the commonly used methods of ways are indicator function, dot product, and cosine similarity
text matching include: 1) text matching based on keyword, respectively. By considering the relationship between words
the commonly used keywords contain term frequency-inverse in sentences in various ways, the MP model can extract
document frequency (TF-IDF) and best matching (BM25); different granularity information and improve the matching
however, there are some limitations in keyword-based match- ability of the model. Wu et al. [20] propose a sequential match
ing, which cannot use the semantic information of text; 2) network (SMN) model. Firstly, constructing a matching matrix
text matching based on shallow semantic, latent semantic from word and sentence granularity and extracting important
analysis (LSA) or latent semantic indexing (LSI), can solve matching information through convolution and pooling. Then,
the problem of synonymy at the semantic level, but cannot filtering noise through a gated recurrent unit (GRU) layer.
solve the problem of polysemy and ignore the order of words; Finally, obtaining a matching score through a hidden layer.
and 3) text semantic matching based on DL mainly includes a Therefore, the model can take full account of dialog relations
representation-based matching method and interaction-based and important context information, extract features from word
matching method. Next, we mainly introduce text semantic and sentence granularity, and achieve good results in the multi-
matching based on DL. The summary of the IR-based dialog turn dialog. Zhou et al. [21] propose a deep attention matching
models is shown in Table II and the summary of the use of (DAM) model, which is inspired by Transformer [22]. Firstly,
evaluation metrics is shown in Fig. 7. self-attention and cross-attention are used to extract matching

Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 27,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CUI et al.: SURVEY ON LEARNING-BASED APPROACHES OF HUMAN–MACHINE DIALOG SYSTEMS 5

TABLE II
S UMMARY OF THE IR-BASED D IALOG M ODELS

Fig. 7. Summary of the use of evaluation metrics in IR-based dialog models. The abscissa is the evaluation metric. The ordinate is occurrence frequencies
of evaluation metrics appearing in articles. The legend on the right is the reference number.

information between context information and response from template manually, and saves time and effort; and on the
word level to sentence level. Then, matching information is other hand, the model responding to the language is fluent
aggregated into a 3D matching graph. After convolution and and logical, with no grammatical errors. However, the model
pooling, matching information is further extracted. Finally, cannot deal with questions that do not exist in the corpus and
calculating the matching score by a single-layer perceptron. cannot be applied in open domain dialogs.
In particular, this model breaks the structure of RNN and CNN 2) Model Based on Generation: Model based on the gen-
before and achieves the best result in a multiturn dialog. eration normally adopts the structure of Seq2Seq [25], which
In a word, the method of text semantic matching based on generally includes encoder and decoder: the encoder mainly
DL, which is superior to the former two methods in perfor- encodes the input questions and extracts the semantic informa-
mance, can not only extract semantic information of different tion; the decoder uses the extracted semantic information to
granularity but also consider the temporal characteristics of decode and generates replies. And the encoder and decoder
sentences. are generally composed of RNN, which are LSTM [26]
A few years ago, a model of applying generative adversarial and GRU [27]. The summary of the generation-based dialog
networks (GAN) to the dialog area based on IR emerged [24]. models is shown in Table III and the summary of the use of
Wang et al. [23] propose an information retrieval with GAN evaluation metrics is shown in Fig. 8.
(IRGAN) model, which is inspired by GAN. IRGAN unifies Model based on the generation is usually trained by using
the generative retrieval model and the discriminant retrieval short text or question–answer corpus, and the model is also
model, overcoming the major shortcomings, and optimizing the basic Seq2Seq model. Shang et al. [28] propose a new
the two models iteratively by the minimax algorithm. It is model of neural responding machine (NRM) for short text
worth mentioning that this model provides a new research idea conversation, which is based on the neural machine translation
for the development of the retrieval model, which surpasses (NMT) [29] model with a classical attention mechanism.
the strong benchmark model in the application of web search, At the decoder end of the Seq2Seq model, global strategy,
recommendation system and question answering system. local strategy, and hybrid strategy are proposed, and the
The main reason for the wide utilization of the model based improved model with three strategies is compared with the pre-
on IR is that, on the one hand, it does not need to build vious IR [30] and statistical machine translation (SMT) [31] on

Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 27,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

TABLE III
S UMMARY OF THE G ENERATION -BASED D IALOG M ODEL

Fig. 8. Summary of the use of evaluation metrics in generation-based dialog models.

the Chinese data of Weibo. The evaluation results show that the factors into the BS of traditional Seq2Seq model to influence
results of the IR-based model are similar to those of the model ranking results. At the same time, RL is used to automatically
with a global strategy, and the model with a hybrid strategy adjust the diversity rate of different inputs, which makes the
has the best results. But the model only carries out single-turn output results more diverse. Vijayakumar et al. [33] propose
dialog, and without adding intention or emotional information. a diversity BS algorithm, which first groups the beamwidth,
Because the response generated by the dialog system based then guarantees the difference between different groups by
on the basic Seq2Seq model is too simple, contains too little adding a diversity penalty, so that the generated replies remain
information, and tends to generate general replies, some schol- diversity. Shao et al. [34] add a self-attention unit to ensure
ars begin to explore the direction of diversity and rich content the length and coherence of the conversation and use a random
of replies. As we all know, the traditional Seq2Seq model uses BS algorithm to search candidate conversations in the solution
the beam search (BS) algorithm when choosing the response. space, then rearrange them to get the final results. The above
Although this algorithm can reduce the space and time three articles have achieved good results in resolving the
occupied by the search, the difference between the sentences diversity of replies, but replies not only need diversity but also
output by the BS algorithm is very small, which cannot reflect need to make replies more meaningful and avoid universal
the diversity of the language. Li et al. [32] introduced penalty replies. Mou et al. [35] propose a sequence to backward and

Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 27,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CUI et al.: SURVEY ON LEARNING-BASED APPROACHES OF HUMAN–MACHINE DIALOG SYSTEMS 7

model: (1) adding emotional information to words embed-


ding; (2) designing loss function with emotional factors; (3)
designing emotional diversity decoding, mainly improving BS
to affectively diverse beam search (ADBS). Asghar et al. [39]
improve the basis of the ECM model and the Affect-LM model
[40]. It not only improves the Affect-LM model which only
considers the language model, but also explores the emotional
factors in decoding, and improves the method of specifying
emotions in the ECM model through emotional embedding to
make it more realistic.
Most of the above articles are about single-turn dialogs.
Normal dialogs often require multiturn dialogs. Therefore, the
Fig. 9. Emotional chatting memory (ECM) model. study of multiturn dialogs is also a research hotspot of dialogs
system. Serban et al. [41] extend the hierarchical recurrent
encoder–decoder (HRED) [42] model and applies the HRED
forward sequences (Seq2BF) model, which consists of three model to the multiturn dialog. The model adds a context
parts. The first part is the prediction keyword, using pointwise hiding layer between the encoding layer and decoding layer
mutual information (PMI) to predict a noun as a keyword, the of the traditional Seq2Seq model, which stores and transmits
second part is the backward Seq2Seq model, which generates context dialog information for use in the generation of a
the responses of the first half of the sentence backward by new turn of dialog. However, the response of the model is
using keywords, and the last part is the forward Seq2Seq relatively single. Therefore, Vlad Serban et al. [43] improve
model, which reverses the first half of the second part and the HRED model and proposes a latent variable hierarchical
inputs them into the forward model to generate the second recurrent encoder–decoder (VHRED) model, which introduces
half of the sentence reply. Therefore, the model can explicitly a Gauss random variable into the context hidden layer to
generate replies with keywords, and the reply content contains increase the diversity of responses. In addition, Yao, et al. [44]
more accurate information. Xing et al. [36] propose a topic propose an attention with intention (AWI) model which is
aware sequence-to-sequence (TA-Seq2Seq) model, which inspired by the thought of discourse structure [45], which
is an improvement based on the topic augmented joint includes language structure, intention structure, and attention
attention-based Seq2Seq (TAJA-Seq2Seq) [37] model, which state. And the AWI model also adopts the idea of layering.
combines topic attention and information attention together to Unlike HRED model and VHRED model, AWI model adds
influence decoding output. Among them, topics are captured attention mechanism to make the model pay more attention
through the TwitterLDA model. The model uses prior subject to important information, thus improving the effect of the
information when decoding the first output tag, so the quality model. Recently, latent variable models based on variational
of the first tag will be higher, and the quality of the first tag autoencoders (VAEs) [47] shows better performance for dialog
will affect the decoding quality of the whole sentence, so the generation. Some researchers use VAEs with hierarchical
quality of the response generated by the model will be higher. RNNs for multiturn dialog generation, which get better per-
And the TwitterLDA model is trained using data other than formance, but VAEs suffer from the degeneration problem.
the training set, thus increasing the diversity of responses and To solve the degeneration problem, Park et al. [48] propose
the amount of information. a variational hierarchical conversation RNNs (VHCR) model,
Human-to-human communication is emotional, so human- which alleviates the degradation problem to some extent by
to-machine communication should also be emotional. exploiting an utterance drop regularization. Shen et al. [49]
Zhou et al. [38] propose an emotional chatting memory (ECM) make some improvements based on the VHCR and propose
model based on the memory network, which is the pioneer- a conversational semantic relationship RNN (CSRR) model,
ing work of introducing emotional factors into a generative which can generate the query and response that are consistent
dialog system based on DL. This model generates responses on the topic but different on the content.
with context information and emotional information through The advantage of the generation model is that it does not
emotional embedding, internal memory, and external memory, need to label and extract features manually, and it can generate
as shown in Fig. 9. More specifically, the function of the reasonable responses to questions that do not exist in the
emotional embedding part is to place the emotional category corpus. Because most of the generation models use neural
embedding vectors and the word embedding in decoder at the network, and neural network has a certain learning ability. But
same level, and embedding them statically in the Seq2Seq the learning ability of neural network needs to be improved,
model; the function of the internal memory part is to memorize so the fluency and logic of the generated replies are not good,
the emotional state dynamically; the function of the external and the generation model makes grammatical error easily and
memory part is to introduce the external emotional memory is difficult to train.
mechanism so that the model can select words according to 3) Model Based on RL: RL is widely used in
the external emotional dictionary or nonemotional dictionary. robots [50], [51], games [52], [53] and network security [54].
Asghar et al. [39] make three improvements on the basis of In dialog system, action refers to generating dialog, state
traditional Seq2Seq model to introduce emotion into dialog refers to human–machine conversation, strategy refers to

Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 27,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

deciding what kind of dialog to respond to according to the the CNN can extract external knowledge. After training, this
current state, and reward refers to evaluation metrics of the part extracts local information corresponding to the input
outcome of dialog. sequence. The other part is the sequence modeling, which is
RL is generally applied to dialog state tracking and dia- an RNN model. The last hidden layer state of this model will
log strategy selection, which belong to the natural language add the local information extracted from the previous part.
processing module in a dialog system. By combining with the In particular, this article is the first attempt to construct an
traditional dialog model, the dialog effect of the model can be end-to-end learning system by using the aligned data of two
improved. Li et al. [55] use deep RL to solve the problem of different data sources to automatically generate context-aware
multiturn dialog. By combining deep RL with the traditional responses. Compared with the pure sequence model based on
Seq2Seq model, designing three kinds of returns to solve the RNN, the model in this article has a 55% improvement in the
problems of dull response, repetitive response and ungrammat- degree of perplexity. The article [61] is also a dialog model
ical response. Avoiding the problem that the Seq2Seq model combining external knowledge, but the difference between the
tends to generate general and security response. Making the model in [61] and that in [60] lies in the different way of com-
response generated by the model richer and more diverse. bining external knowledge. Vougiouklis et al. [60] combine the
As the training process of RL is a process of finding the historical information and external knowledge at the decoder
optimal value through trial and error, the convergence speed end; however, Ghazvininejad et al. [61] combine the historical
of the model will be slow, so it is very important to improve information and external knowledge at the encoder end, and
the learning efficiency of RL. Lipton et al. [56] propose a the external knowledge is searched from the knowledge base
bayes-by-backprop Q-network (BBQN) model, which is based through keyword-based retrieval. When dealing with entities
on deep Q-network (DQN) [57], to solve the problem of that do not appear in training data, the model can also give
convergence speed. The uncertainty of Q value is solved by appropriate responses based on external knowledge, so that the
combining Bayesian neural network, and BBQN model can model can be enriched by external knowledge without retrain-
significantly improve the exploration speed of deep Q-learning ing the model. In the same year, Madotto et al. [62] propose
agents in the dialog system. In the same way, Zhao and a memory-to-sequence (Mem2Seq) model, which consists of
Eskenazi [58] use the end-to-end model and deep RL to end-to-end memory networks (MemNN) [63] encoder and
solve the problem of language understanding and dialog state memory decoder. More specifically, the function of MemNN
strategy selection in a task-oriented dialog system and propose is to represent the historical information of dialog in vector;
a hybrid algorithm combining RL and supervised learning to memory decoder generates replies by reading, writing, and
speed up learning efficiency. copying memory. And the content of memory is made up of
The advantage of the RL model is that it can solve the dialog history information and knowledge-based (KB) infor-
dependence problem of multiple modules in the pipeline dialog mation. If the expected word is in the KB information, it is
model, has a good effect of multiturn dialog, and generates selected from the KB information, otherwise it is selected from
rich and logical responses. But the exploring efficiency of RL the dialog history information.
needs to be improved. The advantage of the hybrid model is that it can give full
4) Model Based on Hybrid: The hybrid model integrates play to the advantages of multiple models and make the whole
several parts of template-based model, IR-based model, model perform well in fluency, logicality, controllability, and
generation-based model, RL-based model, and external knowl- handling of unappeared questions. But the design of the model
edge, so as to exert their respective advantages and improve is complex, which requires designers to be familiar with the
the overall effect of the model. above models, and the training time and prediction time of the
Qiu et al. [59] integrate the IR-based model and generation- model will increase.
based model into a dialog model, which includes three parts:
IR-based model, generation-based model, and rerank model. IV. C OMMON DATA SETS AND E VALUATION M ETRICS
First, the IR-based model is used to retrieve the set of
candidate question–answer pairs, and the question Q and A. Data sets
the candidate answer R are formed into a pair. Then, the Data sets are the basic condition of the dialog system,
confidence scores are calculated by the rerank model. And and high quality data sets can make the training model
the score of the question–answer pair with the highest score more effective. By consulting a large number of documents,
is O. Finally comparing the size of O to the threshold value we summarize some data sets commonly used in single-
T, if O > T, the final answer is R. On the contrary, the turn or multiturn dialogs in English or Chinese, as shown in
final answer is R’ generated by the generation-based model. Table IV, and give a brief introduction to the composition of
This model can give full play to the fluency and logic of the data sets.
the IR-based model, and can also use the generate-based The following is a more detailed description of each data set.
model to deal with questions that have not appeared in the Cornell Movie-Dialogues Corpus [64] is a dialog of
database. Vougiouklis et al. [60] add external knowledge to 9035 roles extracted from 617 movies, with 10 292 movie
the dialog generation and combines external knowledge with roles, having 220 579 conversations. The corpus also contains
the generation-based model. The dialog model in this article the title information, role information, the actual text of each
includes two parts: one part is the sentence modeling, which dialog, the structure of the dialog, and the original source of
first uses Wikipedia data set to pretrain the CNN, so that the dialog.

Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 27,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CUI et al.: SURVEY ON LEARNING-BASED APPROACHES OF HUMAN–MACHINE DIALOG SYSTEMS 9

TABLE IV
S UMMARY OF THE C OMMON D ATA S ETS

Ubuntu Dialog Corpus [65] is a dialog extracted from the data are a collection of online resources, including scene
Ubuntu chat log. The data set consists of 930 000 dialogs and dialogs designed to train robots, Cornell movie dialogs and
7 100 000 utterances. And the main data sets include training 170 000 pairs of Reddit data cleaned up. In particular, the data
set, validation set, and test set. More specifically, there are set website has a program for automatically generating Reddit
1 000 000 examples in the training set, 50% positive (label data, which can generate millions of pairs of Reddit data.
1) and 50% negative (label 0), and the data of training set
is composed of contexts, utterances, and labels; there are
B. Evaluation Metrics
19 560 data in the validation set and 18 920 data in the test set,
and the data in the validation set and the test set are composed The evaluation metrics of the dialog model are divided
of contexts, ground truth utterances, and nine interference into objective evaluation metrics and subjective evaluation
responses. metrics. The objective evaluation metrics mainly includes the
Douban Dialog Corpus is the dialog data collected by evaluation metrics of the IR-based model and the generation-
Wu et al. [20] on the Douban. The training set consists of based model, and the subjective evaluation metrics mainly
1-M dialog data; the validation set consists of 50-K dialog includes the human metrics. The classification of evaluation
data; the test set consists of 10-K dialog data. In addition, metrics is shown in Fig. 10.
the minimum number of turns per task is 3, and the average 1) Evaluation Metrics of IR Model: The evaluation metrics
number of turns per task is about 6. commonly used in retrieval models are recall@k (R@k), mean
Short-Text Conversation Data set [66], [67] is a short text average precision (MAP) and mean reciprocal rank (MRR).
content extracted from posts and comments under posts on R@k refers to the choice of k most likely responses to
Sina Weibo. There are 4.8 M “postresponse” pairs in the a given question to see if the correct response is in the k
training set and 422 posts in the test set, each of which has responses. Usually, people use R@1 to make k equal to 1,
about 30 responses. because when building a data set, there is usually only one
Aligning Reddit and Wikipedia Data set [60] is composed correct answer for each question in the test set.
of conversational sequence in Reddit and aligned sentences in MAP [68] is a commonly used evaluation metrics in the field
Wikipedia. It contains a total of 15 K comments sequence and of object detection and text classification. Precision refers to
75 K aligned Wikipedia sentences. And the data set is mainly the proportion of positive classes in classification results. And
used as the external knowledge base of the dialog system. AP refers to the average of all precision maxima when recall
Papaya Conversational Data set consists of two parts: core values are between 0 and 1. Because the AP we obtained above
data and peripheral data. More specifically, the core data is the AP corresponding to a topic, the average of AP for all
are manufactured to maintain the consistent personality of topics is the MAP, which can be calculated by the following
the chat robot, who can be trained as a polite, patient, and formula:
Q
humorous role by using core data, and users can modify the q=1 AP
role information according to their own needs; the peripheral MAP = (2)
Q

Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 27,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Fig. 10. Classification of evaluation metrics.

where Q is a collection of sample queries, |Q| represents the Recall-oriented understudy for gisting evaluation
number of queries. (ROUGE) [71] is an automatic summary evaluation method,
MRR [69] evaluates the performance of the retrieval model which evaluates by calculating the n-gram phrase recall
by ranking the correct retrieval results in retrieval results. rate between a candidate output and a reference output
When calculating, the reciprocal ranking of the correct answers set. ROUGE, which is an improved version of BLEU,
of a query in the search results is taken as its accuracy, and then concentrates on recall rate, while BLEU concentrates on
averaging the accuracy of all queries. The MRR is defined as accuracy. Both of them can reflect word order, but the words
|Q| reflected by ROUGE can be discontinuous, and the words
1  1 reflected by BLEU must be continuous. The calculation
MRR = (3)
|Q| I =1 ranki formula of ROUGE is shown as follows:
 
S∈Ê{Ref Summaries} gn ∈ÊS Countmatch (gn )
where ranki is the position of the first relevant result in the ROUGE =   (6)
ranking of the query i . S∈Ê{Ref Summaries} gn ∈ÊS Count (gn )
2) Evaluation Metrics of Generating Model: The evaluation where Ref is the abbreviation for reference, g is the abbre-
metrics of the generated model are divided into word overlap viation for gram, n is the length of n-gram phrases, and
matrix metrics and word vector metrics. More specifically, Countmatch (gn ) is the maximum number of n-gram phrases
word overlap matrix mainly reflects the quality of generating appearing in both candidate output and reference output sets.
responses by counting the number of occurrences of some METEOR [72] improves on the basis of BLEU by adding
phrases in sentences, and word vector calculates the similarity the alignment relationship between the generated response
between words at the semantic level to express the similarity and the true response to make it more relevant to manual
between sentences. discrimination. And the METEOR can be described as
a) Word Overlap Matrix Metrics: Bilingual evaluation METEOR = (1-Pen) Fmean (7)
understudy (BLEU) [70] was originally designed to evaluate
where
the quality of machine translation results and is now used in
the field of dialog systems. The BLEU is represented by Pen = γ (frag)β (8)
   Pm Rm
 Fmean = (9)
k min h (k, ri ) , h k, rˆi α Pm + (1 − α) Rm
Pn r, r̂ =  (4)
k h (k, ri ) where Pm and Rm refer to the precision and recall rate,
 N
 respectively, frag is the fragment fraction, α, β, and γ are
BLEU-N = b r, r̂ exp βn log Pn r, r̂ (5) the weight coefficients, Pen is the penalty term, and Fmean is
n=1 a statistic used to evaluate the classification model.
where h(k, ri ), h(k, rˆi ) represent the number of times each b) Word Vector Metrics: Greedy matching (GM)
phrase appears in true and generated responses respectively. expresses the similarity of two sentences by calculating the
Pn (r, r̂ ) reflects the accuracy of each n-gram phrase in the similarity of word embedding vectors in two sentences. The
GM is defined as follows:
whole data set. Since the value of n is generally 1–4, it is 
necessary to the weighted sum Pn under different values of  maxω̂∈r̂ cos_sim (eω , eω̂ )
G r, r̂ = ω∈r (10)
n, and βn is the weight coefficient. In order to avoid short |r |
 
response, a penalty factor b(r, r̂) is added to the evaluation  G r, r̂ + G r̂ , r
metrics to improve the effect of the model. GM r, r̂ = (11)
2

Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 27,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CUI et al.: SURVEY ON LEARNING-BASED APPROACHES OF HUMAN–MACHINE DIALOG SYSTEMS 11

where r and r̂ are true response and generating response, V. A NALYSIS OF THE E VALUATION M ETRICS
respectively. ω and ω̂ are words in r and r̂, respectively. A. Analysis
eω and eω̂ are embedded vectors corresponding to ω and ω̂,
The evaluation metrics of the IR model, generating model,
respectively. First, each word in the true responses is computed
and human are shown in Figs. 7 and 8. If we know the specific
its maximum cosine similarity in the generated responses, and
evaluation metric values of each model, we can compare
mean value is calculated. Then, the same calculation is done
the models. But we just count the frequency of each metric
for each word in the generated response. Finally, the final
in the corresponding literature. There are two reasons why
matching value is obtained by averaging the two words.
we do this, on the one hand, because the data sets used in
Embedding average (EA) is a method of calculating sen-
these articles are generally different, the evaluation metrics
tence vectors by word vectors in sentences. It has been applied
of different data sets are not comparable; on the other hand,
to dialog systems and text similarity tasks. And it can be
because there is no unified evaluation metrics adopted by
obtained by the following formula:
all articles, the evaluation metrics of each article are very

eω different. So it is not feasible to compare different models
e¯r =  ω∈r (12) by the value of evaluation metrics.
ω ∈r eω
 Next, we give the analysis of the evaluation metrics for each
EA = cos e¯r , e¯r̂ (13)
type of model as follows.
where e¯r denotes the mean of word vectors of all phrases in a 1) Model Based on IR: As shown in Fig. 7, first, we can
sentence and taking the cosine similarity of the mean of word catch sight of MRR being widely used in the articles
vectors of two sentences as the metrics of evaluating sentence using IR-based model, and we can infer that MRR is
similarity. more suitable for different data and tasks in IR-based
Vector extrema (VE) [73] is a method of calculating the model. Second, we can see that literature [23] has the
similarity between two sentences on the sentence level vector. most metrics, because this article has three tasks, but
In this method, the maximum 1-D vector in a sentence is other articles just have one or two tasks. Then, we also
selected as the sentence vector. The formula is as follows: can find that the literature [20], [21], and [29] have the
same metrics, the reason is that these three articles use
maxω∈r eωd , if eωd > |minω ∈r eω d | the same data set as shown in Table II and face the
erd = (14) same task. So, we can find that the evaluation metric is
minω∈r eωd , otherwise
related to the task, and completing different task requires
VE = cos (erd , er̂d ) (15) different data sets. Finally, the human evaluation metrics
are not found in Fig. 7, the IR-based model does not
where d is the dimension of word vector and eωd is the first
need it, because the responses of this model are usually
d dimension of ω embedded vector.
logical and fluent.
Perplexity (PPL) is a metrics based on information theory to
2) Model Based on Generation: As shown in Fig. 8, we can
measure the quality of a probability model to predict samples.
find PPL occurs more frequently, which may be more
The calculation formula is as follows:
applicable for more data sets and tasks. There are many
N
human evaluation metrics used in these articles. Because
PPL = b− N logb p(xi )
1
I =1 (16)
the responses of generation-based model are not fluent
where p(x i ) refers to the probability of the occurrence of the and logical, there are many problems needed to solve
word x. N is the number of all words. The value of b is which just like the style column shows in Table III and
usually 2 or e. And the smaller the value of PPL, the better other problems not shown. It is a challenge to solve these
the prediction effect of the language model. problems. And it is even more challenging to design an
3) Human Evaluation Metrics: The existing objective eval- metric to evaluate the model of solving these problems
uation metrics reflect the relationship between questions and in a unified way. So this leads to the phenomenon of
responses to a certain extent, but there is no metrics that sparse metrics in Fig. 8.
can solve the evaluation problem of the dialog system very In summary, the distribution of evaluation metrics of these
well. Therefore, many researchers and enterprises prefer to two kinds of models is sparse, and no unified metric can
use human evaluation metrics to evaluate the results of the evaluate these two kinds of models very well. The design
dialog system. Human evaluation metrics are generally from of evaluation metrics of the IR-based model is relatively
grammar, contextual relevance, diversity, and other aspects. easy; however, the design of evaluation metrics based on the
The higher the score, the higher the quality of the response. generation model is relatively more complex and most of the
On the contrary, the lower the quality of the response. evaluation metrics are manually evaluated.
The advantage of human evaluation is that it can fully reflect
the real feelings of human beings and make the design of the B. Concluding Remarks
dialog system develop in line with the daily conversation habits From the above analysis, some conclusions can be drawn
of human beings. But human evaluation needs to find volun- as follows.
teers and design questionnaires, which is time-consuming and 1) The evaluation metrics of both IR-based model and
laborious. generation-based model can reflect the quality of the

Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 27,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

12 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

response to a certain extent. But so many metrics are robots can generate personalized responses with emo-
confused that a unified evaluation of the dialog model tions, but they still give people a sense of inauthenticity,
cannot be formed. which is related to the lack of better personification.
2) It is one-sided to evaluate the dialog system by refer- 5) On the Reasoning Ability: The products related to a
ring to the evaluation metrics of other tasks of natural dialog system that can be seen on the market give
language processing, which may mislead the training of people a feeling of not smart. The main reason is that
the dialog model. Therefore, it is necessary to design the existing dialog system cannot reasoning like human
automatic evaluation metrics highly related to human beings, which is the most critical factor restricting the
evaluation metrics. development of the dialog system to a higher level of
3) Through comparison, it is found that the human evalu- intelligence.
ation metric is more often used in the generated-based
dialog model.
B. Future Research Directions
4) Data set is also an important factor affecting the evalua-
tion results. And so many data sets with different quality In the face of the above problems, combined with the devel-
are easy to interfere with the evaluation results of dialog opment trend and actual needs of human–machine interaction
models. So different tasks require a standard data set to system, the following aspects of research will be the potential
facilitate performance comparison between models. directions in this field in the future.
1) Construction of High Quality Large Data set: The exist-
VI. E XISTING I SSUES AND F UTURE R ESEARCH ing models are all based on data-driven, and the quality
D IRECTIONS and scale of data are very important to the models.
Although dialog models have made considerable progress Therefore, how to build a large data set, especially high
in recent years, several typical issues still remain to be quality data set is an important research direction.
better addressed due to the complexity of natural language 2) Establishment of Dialog Evaluation System: The eval-
processing. In particular, we list the following typical issues uation metrics designed in the future should be able to
and outline the potential research directions toward solving evaluate the reasonableness of response from different
them. granularity of characters, words, and sentences. Design-
ers need to fully consider the influence of different
granularity factors and design more relevant evaluation
A. Existing Typical Issues metrics to human beings, which can improve the effect
1) On the High Quality Large Data set: The size and quality of dialog system. Kannan and Vinyals [75] try to use
of the data set determine the response effect of the dialog GAN to evaluate the dialog system. Lowe et al. [76]
model. And the larger the data set and the higher the design an automatic dialog evaluation model (ADEM)
quality, the more useful information it contains, and the on the basis of hierarchical RNN. These two works
better the response effect of the dialog system will be. provide some references for the design of new evaluation
But, most of the existing data sets are generally of low metrics.
quality and contain little useful information. 3) Improving the Generalization Ability of the Model:
2) On the Unified Automatic Evaluation Metrics: The exist- In order to improve the generalization ability of the
ing objective evaluation metrics have no clear correlation dialog model, there are two directions worth studying.
with human evaluation [74], which cannot meet the One is to dynamically integrate the dialog system with
needs of dialog model evaluation, and human evaluation the knowledge graph so that the dialog system can
metrics are time-consuming. Therefore, it is necessary continuously learn new knowledge in the process of
to design automatic evaluation metrics highly related to communicating with human beings and store the learned
human evaluation. knowledge in the knowledge graph. The dialog system
3) On the Generalization Ability of the Model: When one can call the knowledge in the knowledge graph at any
trained dialog model faces the new business scenarios time. And adding knowledge in the knowledge graph
and new data set, the generalization ability of the model as prior knowledge into the dialog model can improve
may be poor. Especially, with the continuous updating the response effect of the dialog model. The other is to
and enlargement of data sets, it is time-consuming and integrate the multidomain knowledge which can reduce
computational cost-consuming to continuously retrain the errors caused by task switching and improve the
the dialog model. And how to integrate multidomain generalization ability of the model and use transfer
knowledge for the training and use of dialog system or learning which can save time and labor costs and be
how to easily transfer the knowledge learned from the more in line with the human learning process.
dialog model to new business scenarios is a problem 4) Design of Anthropomorphic Model: The anthropomor-
needed to solve, which is a key problem to improve the phic design of the dialog system is also a direction
generalization ability of the model. worthy of study. Researchers need to take into account
4) On the Better Personification: Everyone has his own the various factors that make up human personality and
unique personality, which is a combination of emotions, integrate them into the dialog system so that the dialog
values, personality, and other factors. Existing dialog robot can highly personify the human personality and

Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 27,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CUI et al.: SURVEY ON LEARNING-BASED APPROACHES OF HUMAN–MACHINE DIALOG SYSTEMS 13

give people a more real feeling. We also need to maintain [10] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning
the consistency of personality. Some of the existing with neural networks,” in Proc. Adv. Neural Inf. Process. Syst., Montreal,
QC, Canada, 2014, pp. 3104–3112.
methods are controlled by training data sets and rules. [11] G. Salton, and M. J. McGill, Introduction to Modern Information
We can learn from the thought of control theory and Retrieval. New York, NY, USA: McGraw-Hill, 1986.
control the output of the neural network by designing [12] A. M. Elkahky, Y. Song, and X. He, “A multi-view deep learning
approach for cross domain user modeling in recommendation systems,”
appropriate controllers [77]. in Proc. 24th Int. Conf. World Wide Web (WWW), Florence, Italy, 2015,
5) Research on Inductive Reasoning Theory: The reasoning pp. 278–288.
ability of the dialog model has always been a chal- [13] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm
lenging research problem. In recent years, graph neural for deep belief nets,” Neural Comput., vol. 18, no. 7, pp. 1527–1554,
2014.
network [78]–[80] can combine end-to-end learning with [14] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by
inductive reasoning, which is expected to solve the jointly learning to align and translate,” 2014, arXiv:1409.0473. [Online].
problem that DL cannot reasoning. The research of Available: [Link]
[15] G. Mesnil, X. He, L. Deng, and Y. Bengio, “August. Investiga-
inductive reasoning theory is applied in the field of tion of recurrent-neural-network architectures and learning methods
a dialog system, so that the dialog system can also for spoken language understanding,” in Proc. Interspeech, 2013,
reasoning is the key to achieve strong AI. pp. 3771–3775.
[16] H. Zhang, R. Kishore, R. Sharman, and R. Ramesh, “Agile integration
modeling language (AIML): A conceptual modeling grammar for agile
VII. C ONCLUSION integrative business information systems,” Decis. Support Syst., vol. 44,
no. 1, pp. 266–284, Nov. 2007.
Dialog model is crucial to building dialog systems. In this [17] P.-S. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck, “Learning
article, a comprehensive survey on the learning-based dialog deep structured semantic models for Web search using clickthrough
data,” in Proc. 22nd ACM Int. Conf. Conf. Inf. Knowl. Manage. (CIKM),
models is presented. In particular, the current development San Francisco, CA, USA, 2013, pp. 2333–2338.
status and construction process of human–machine dialog [18] S. Wan et al., “A deep architecture for semantic matching with multiple
system is reviewed, and the typical dialog models and methods positional sentence representations,” in Proc. 30th AAAI Conf. Artif.
as well as the corresponding advantages and disadvantages are Intell., Phoenix, AZ, USA, 2016, pp. 2835–2841.
[19] L. Pang et al., “Text matching as image recognition,” in Proc. 30th AAAI
compared and analyzed. Also, the challenging issues on the Conf. Artif. Intell., Phoenix, AZ, USA, 2016, pp. 2793–2799.
high quality large data set, the unified automatic evaluation [20] Y. Wu, W. Wu, C. Xing, M. Zhou, and Z. Li, “Sequential match-
metrics, the generalization ability of the model, the better ing network: A new architecture for multi-turn response selection in
retrieval-based chatbots,” in Proc. 55th Annu. Meeting Assoc. Comput.
personification and the reasoning ability, together with the Linguistics, Vancouver, BC, Canada, 2017, pp. 496–505.
potential research directions are briefly discussed. [21] X. Zhou et al., “Multi-turn response selection for chatbots with deep
All-round research on learning-based dialog model is not attention matching network,” in Proc. 56th Annu. Meeting Assoc. Com-
put. Linguistics, Melbourne, VIC, Australia, 2018, pp. 1118–1127.
only of great theoretical significance but also of important
[22] A. Vaswani et al., “Attention is all you need,” 2017, arXiv:1706.03762.
social value. On the one hand, the research in this field will [Online]. Available: [Link]
help to understand the nature of human dialog and promote the [23] J. Wang et al., “IRGAN: A minimax game for unifying generative and
development of a dialog system and its related technologies. discriminative information retrieval models,” in Proc. 40th Int. ACM
SIGIR Conf. Res. Develop. Inf. Retr. (SIGIR), Shinjuku City, Japan,
On the other hand, the research will bring bright prospects for 2017, pp. 515–524.
the collaborative development of linguistics and AI. [24] I. J. Goodfellow et al., “Generative adversarial networks,” in Proc. Adv.
Neural Inf. Process. Syst., vol. 3, 2014, pp. 2672–2680.
[25] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning
R EFERENCES with neural networks,” 2014, arXiv:1409.3215. [Online]. Available:
[Link]
[1] J. Weizenbaum, “ELIZA—A computer program for the study of natural
[26] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
language communication between man and machine,” Commun. ACM,
Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
vol. 9, no. 1, pp. 36–45, 1983.
[2] K. M. Colby, Artificial Paranoia; a Computer Simulation of Paranoid [27] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of
Processes. New York, NY, USA: Pergamon, 1975. gated recurrent neural networks on sequence modeling,” in Proc. Deep
[3] S. R. Wallace, “The anatomy of A.L.I.C.E.,” in Parsing the Turing Test. Learn. Represent. Learn. Workshop (NIPS), Montreal, QC, Canada,
2014, pp. 1–9. [Online]. Available: [Link]
Berlin, Germany: Springer, 2009, pp. 181–210.
[4] G. Hinton et al., “Deep neural networks for acoustic modeling in speech [28] L. Shang, Z. Lu, and H. Li, “Neural responding machine for short-text
recognition: The shared views of four research groups,” IEEE Signal conversation,” in Proc. 53rd Annu. Meeting Assoc. Comput. Linguistics
Process. Mag., vol. 29, no. 6, pp. 82–97, Nov. 2012. 7th Int. Joint Conf. Natural Lang. Process., Beijing, China, 2015,
[5] L. Deng et al., “Recent advances in deep learning for speech research pp. 1577–1586.
at microsoft,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., [29] Z. Zhang et al., “Modeling multi-turn conversation with deep utterance
Vancouver, BC, Canada, May 2013, pp. 8604–8608. aggregation,” in Proc. 27th Int. Conf. Comput. Linguistics, Santa Fe,
[6] W. Xiong et al., “Toward human parity in conversational speech recogni- NM, USA, 2018, pp. 3740–3752.
tion,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 25, no. 12, [30] Z. Ji, Z. Lu, and H. Li, “An information retrieval approach to
pp. 2410–2423, Dec. 2017. short text conversation,” 2014, arXiv:1408.6988. [Online]. Available:
[7] Y. Qian, Y. Fan, W. Hu, and F. K. Soong, “On the training aspects [Link]
of deep neural network (DNN) for parametric TTS synthesis,” in Proc. [31] A. Ritter, C. Cherry, and W. B. Dolan, “Data-driven response gener-
IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Florence, ation in social media,” in Proc. EMNLP, Edinburgh, Scotland, 2011,
Italy, May 2014, pp. 3829–3833. pp. 27–31.
[8] A. van den Oord et al., “WaveNet: A generative model for [32] J. Li, W. Monroe, and D. Jurafsky, “A simple, fast diverse decoding
raw audio,” 2016, arXiv:1609.03499. [Online]. Available: [Link] algorithm for neural generation,” 2016, arXiv:1611.08562. [Online].
org/abs/1609.03499 Available: [Link]
[9] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin, “A neural proba- [33] A. K Vijayakumar et al., “Diverse beam search: Decoding diverse solu-
bilistic language model,” J. Mach. Learn. Res., vol. 3, pp. 1137–1155, tions from neural sequence models,” 2016, arXiv:1610.02424. [Online].
Feb. 2003. Available: [Link]

Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 27,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

14 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

[34] Y. Shao, S. Gouws, D. Britz, A. Goldie, B. Strope, and R. Kurzweil, [56] Z. C. Lipton, X. Li, J. Gao, L. Li, F. Ahmed, and L. Deng, “BBQ-
“Generating high-quality and informative conversation responses with networks: Efficient exploration in deep reinforcement learning for task-
sequence-to-sequence models,” in Proc. Conf. Empirical Methods Nat- oriented dialogue systems,” 2016, arXiv:1608.05081. [Online]. Avail-
ural Lang. Process., Copenhagen, Denmark, 2017, pp. 2200–2209. able: [Link]
[35] L. Mou et al., “Sequence to backward and forward sequences: A content- [57] V. Mnih et al., “Human-level control through deep reinforcement learn-
introducing approach to generative short-text conversation,” in Proc. ing,” Nature, vol. 518, no. 7540, pp. 529–533, Feb. 2015.
COLING, Osaka, Japan, 2016, pp. 11–17. [58] T. Zhao and M. Eskenazi, “Towards end-to-end learning for dialog
[36] C. Xing et al., “Topic aware neural response generation,” in Proc. AAAI, state tracking and management using deep reinforcement learning,” in
San Francisco, CA, USA, 2017, pp. 3351–3357. Proc. 17th Annu. Meeting Special Interest Group Discourse Dialogue,
[37] C. Xing et al., “Topic aware neural response generation,” 2016, Los Angeles, CA, USA, 2016, pp. 1–10.
arXiv:1606.08340. [Online]. Available: [Link] [59] M. Qiu et al., “AliMe chat: A sequence to sequence and rerank
[38] H. Zhou et al., “Emotional chatting machine: Emotional conversa- based chatbot engine,” in Proc. 55th Annu. Meeting Assoc. Comput.
tion generation with internal and external memory,” in Proc. AAAI, Linguistics, Vancouver, BC, Canada, 2017, pp. 498–503.
New Orleans, LA, USA, 2018, pp. 730–738. [60] P. Vougiouklis, J. Hare, and E. Simperl, “A neural network approach
[39] N. Asghar, P. Poupart, J. Hoey, X. Jiang, and L. Mou, “Affective for knowledge-driven response generation,” in Proc. COLING, Osaka,
neural response generation,” 2017, arXiv:1709.03968. [Online]. Avail- Japan, 2016, pp. 3370–3380.
able: [Link] [61] M. Ghazvininejad et al., “A knowledge-grounded neural conversation
[40] S. Ghosh, M. Chollet, E. Laksana, L.-P. Morency, and S. Scherer, model,” in Proc. 32nd AAAI Conf. Artif. Intell., New Orleans, LA, USA,
“Affect-LM: A neural language model for customizable affective text 2018, pp. 5110–5117.
generation,” in Proc. 55th Annu. Meeting Assoc. Comput. Linguistics, [62] A. Madotto, C.-S. Wu, and P. Fung, “Mem2Seq: Effectively incor-
Vancouver, BC, Canada, 2017, pp. 634–642. porating knowledge bases into end-to-end task-oriented dialog sys-
[41] I. V. Serban, A. Sordoni, Y. Bengio, A. Courville, and J. Pineau, tems,” in Proc. 56th Annu. Meeting Assoc. Comput. Linguistics,
“Building end-to-end dialogue systems using generative hierarchical Melbourne, VIC, Australia, 2018, pp. 1468–1478. [Online]. Available:
neural network models,” in Proc. 30th AAAI Conf. Artif. Intell., Phoenix, [Link]
AZ, USA, 2016, pp. 3776–3783. [63] S. Sukhbaatar, J. Weston, and R. Fergus, “End-to-end memory net-
[42] A. Sordoni, Y. Bengio, H. Vahabi, C. Lioma, J. G. Simonsen, works,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), Montreal, QC,
and J.-Y. Nie, “A hierarchical recurrent encoder-decoder for gen- Canada, 2015, pp. 2440–2448.
erative context-aware query suggestion,” in Proc. 24th ACM Int. [64] C. Danescu-Niculescu-Mizil, M. Gamon, and S. Dumais, “Mark my
Conf. Inf. Knowl. Manage. (CIKM), Melbourne, VIC, Australia, 2015, words!: Linguistic style accommodation in social media,” in Proc. 20th
pp. 553–562. Int. Conf. World Wide Web (WWW). Hyderabad, India: ACM, 2011,
[43] I. V. Serban et al., “A hierarchical latent variable encoder-decoder model pp. 745–754.
for generating dialogues,” in Proc. AAAI, San Francisco, CA, USA, [65] R. Lowe, N. Pow, I. Serban, and J. Pineau, “The ubuntu dia-
2017, pp. 3295–3301. logue corpus: A large dataset for research in unstructured multi-
[44] K. Yao, G. Zweig, and B. Peng, “Attention with intention for a turn dialogue systems,” 2015, arXiv:1506.08909. [Online]. Available:
neural network conversation model,” 2015, arXiv:1510.08565. [Online]. [Link]
Available: [Link] [66] H. Wang, Z. D. Lu, H. Li, and E. H. Chen, “A dataset for research
[45] B. J. Grosz, “Attention, intentions, and the structure of discourse,” on short-text conversation,” in Proc. Empirical Methods Natural Lang.
Comput. Linguistics, vol. 12, pp. 175–204, Jul. 1986. Process. (EMNLP), Seattle, WA, USA, 2013, pp. 935–945.
[46] E. Levin, R. Pieraccini, and W. Eckert, “Using Markov decision process [67] M. Wang, Z. Lu, H. Li, and Q. Liu, “Syntax-based deep matching of
for learning dialogue strategies,” in Proc. IEEE Int. Conf. Acoust., short texts,” in Proc. 24th Int. Joint Conf. Artif. Intell., Buenos Aires,
Speech Signal Process. (ICASSP), Seattle, WA, USA, May 1998, Argentina, 2015, pp. 1354–1361.
pp. 201–204. [68] R. Baeza-Yates, Modern Information Retrieval. Beijing, China: Machin-
[47] D. Kingma and M. Welling, “Auto-encoding variational Bayes,” in ery Industry Press, 2004.
Proc. 2nd Int. Conf. Learn. Represent., Banff, AB, Canada, 2014, [69] E. Voorhees, “The TREC-8 question answering track report,” in Proc.
pp. 1–14. 8th Text Retr. Conf. (TREC), 1999, pp. 79–82.
[48] Y. Park, J. Cho, and G. Kim, “A hierarchical latent structure for [70] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: A method
variational conversation modeling,” in Proc. Conf. North Amer. Chapter for automatic evaluation of machine translation,” in Proc. 40th Annu.
Assoc. Comput. Linguistics, Hum. Lang. Technol., Melbourne, VIC, Meeting Assoc. Comput. Linguistics (ACL), Philadelphia, PA, USA,
Australia, 2018, pp. 1792–1801. 2001, pp. 311–318.
[49] L. Shen, Y. Feng, and H. Zhan, “Modeling semantic relationship in [71] C.-Y. Lin and E. Hovy, “Manual and automatic evaluation of sum-
multi-turn conversations with hierarchical latent variables,” in Proc. maries,” in Proc. ACL Workshop Text Summarization Branches (WAS),
57th Annu. Meeting Assoc. Comput. Linguistics, Florence, Italy, 2019, Barcelona, Spain, 2004, pp. 74–81.
pp. 5497–5502. [72] A. Lavie and A. Agarwal, “Meteor: An automatic metric for MT eval-
[50] Z. Yang, K. Merrick, L. Jin, and H. A. Abbass, “Hierarchical uation with high levels of correlation with human judgments,” in Proc.
deep reinforcement learning for continuous action control,” IEEE 2nd Workshop Stat. Mach. Transl. (StatMT), Prague, Czech Republic,
Trans. Neural Netw. Learn. Syst., vol. 29, no. 11, pp. 5174–5184, 2007, pp. 228–231.
Nov. 2018. [73] G. Forgues, J. Pineau, J.-M. Larchevêque, and R. Tremblay, “Bootstrap-
[51] B. Gokce and H. L. Akin, “Implementation of reinforcement learning by ping dialog systems with word embeddings,” in Proc. Mod. Mach. Learn.
transfering sub-goal policies in robot navigation,” in Proc. 21st Signal Natural Lang. Process. Workshop (NIPS), 2014, pp. 1–5.
Process. Commun. Appl. Conf. (SIU), Haspolat, Turkey, Apr. 2013, [74] C.-W. Liu, R. Lowe, I. Serban, M. Noseworthy, L. Charlin, and J. Pineau,
pp. 1–4. “How NOT to evaluate your dialogue system: An empirical study of
[52] A. Jeerige, D. Bein, and A. Verma, “Comparison of deep reinforcement unsupervised evaluation metrics for dialogue response generation,” in
learning approaches for intelligent game playing,” in Proc. IEEE 9th Proc. Conf. Empirical Methods Natural Lang. Process., Austin, TX,
Annu. Comput. Commun. Workshop Conf. (CCWC), Las Vegas, NV, USA, 2016, pp. 2122–2132.
USA, Jan. 2019, pp. 0366–0371. [75] A. Kannan and O. Vinyals, “Adversarial evaluation of dia-
[53] M. Xu, H. Shi, and Y. Wang, “Play games using reinforcement logue models,” 2017, arXiv:1701.08198. [Online]. Available: http://
learning and artificial neural networks with experience replay,” in [Link]/abs/1701.08198
Proc. IEEE/ACIS 17th Int. Conf. Comput. Inf. Sci. (ICIS), Singapore, [76] R. Lowe, M. Noseworthy, I. V. Serban, N. Angelard-Gontier, Y. Bengio,
Jun. 2018, pp. 855–859. and J. Pineau, “Towards an automatic turing test: Learning to evaluate
[54] Z. Ni and S. Paul, “A multistage game in smart grid security: A rein- dialogue responses,” in Proc. 55th Annu. Meeting Assoc. Comput.
forcement learning solution,” IEEE Trans. Neural Netw. Learn. Syst., Linguistics, Vancouver, BC, Canada, 2017, pp. 1116–1126.
vol. 30, no. 9, pp. 2684–2695, Sep. 2019. [77] Y.-D. Song, X. Huang, and Z.-J. Jia, “Dealing with the issues crucially
[55] J. Li, W. Monroe, A. Ritter, M. Galley, J. Gao, and D. Jurafsky, related to the functionality and reliability of NN-associated control for
“Deep reinforcement learning for dialogue generation,” 2016, nonlinear uncertain systems,” IEEE Trans. Neural Netw. Learn. Syst.,
arXiv:1606.01541. [Online]. Available: [Link] vol. 28, no. 11, pp. 2614–2625, Nov. 2017.

Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 27,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CUI et al.: SURVEY ON LEARNING-BASED APPROACHES OF HUMAN–MACHINE DIALOG SYSTEMS 15

[78] F. Scarselli, M. Gori, A. Chung Tsoi, M. Hagenbuchner, and Qian Cui received the B.S. degree from the Wuhan
G. Monfardini, “The graph neural network model,” IEEE Trans. Neural Institute of Technology, Wuhan, China, in 2017.
Netw., vol. 20, no. 1, pp. 61–80, Jan. 2009. She is currently pursuing the Ph.D. degree with
[79] T. N. Kipf and M. Welling, “Semi-supervised classification with the School of Automation, Chongqing University,
graph convolutional networks,” in Proc. 5th Int. Conf. Learn. Rep- Chongqing, China.
resent. (ICLR), Toulon, France, 2016, pp. 1–14. [Online]. Available: Her research interests include robust adaptive con-
[Link] trol, neural networks, and learning systems.
[80] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and
Y. Bengio, “Graph attention networks,” in Proc. 6th Int. Conf. Learn.
Represent. (ICLR), Vancouver, BC, Canada, 2017, pp. 1–12. [Online].
Available: [Link]
[81] R. S. Boyer and J. S. Moore, “A fast string searching algorithm,”
Yongduan Song (Fellow, IEEE) received the Ph.D.
Commun. ACM, vol. 20, no. 10, pp. 762–772, Oct. 1977.
degree in electrical and computer engineering from
[82] D. E. Knuth, J. H. Morris, Jr., and V. R. Pratt, “Fast pattern matching
Tennessee Technological University, Cookeville,
in strings,” SIAM J. Comput., vol. 6, no. 2, pp. 323–350, Jun. 1977.
TN, USA, in 1992.
[83] S. Wu and U. Manber, “Fast text searching: Allowing errors,” Commun.
He held a tenured Full Professor with North Car-
ACM, vol. 35, no. 10, pp. 83–91, Oct. 1992.
olina A&T State University, Greensboro, NC, USA,
[84] A. V. Aho and M. J. Corasick, “Efficient string matching: An aid to
from 1993 to 2008, and a Langley Distinguished
bibliographic search,” Commun. ACM, vol. 18, no. 6, pp. 333–340,
Professor with the National Institute of Aerospace,
Jun. 1975.
Hampton, VA, USA, from 2005 to 2008. He is
currently the Dean of the School of Automation,
Chongqing University, Chongqing, China. He was
one of the six Langley Distinguished Professors with the National Institute
of Aerospace (NIA), Hampton, VA, USA, and the Founding Director of
Cooperative Systems with NIA. His current research interests include intelli-
gent systems, guidance navigation and control, and bio-inspired adaptive and
cooperative systems.
Fuwei Cui received the M.S. degree from the North Prof. Song was a recipient of several competitive research awards from the
China University of Technology, Beijing, China, National Science Foundation, the National Aeronautics and Space Admin-
in 2017. He is currently pursuing the Ph.D. degree istration, the U.S. Air Force Office, the U.S. Army Research Office, and
in control science and engineering with Beijing the U.S. Naval Research Office. He is an IEEE Fellow and has served/been
Jiaotong University, Beijing, China. serving as an Associate Editor for several prestigious international journals,
His specific areas of research interest mainly focus including the IEEE T RANSACTIONS ON AUTOMATIC C ONTROL, the IEEE
on deep learning, natural language processing, and T RANSACTIONS ON N EURAL N ETWORKS AND L EARNING S YSTEMS , the
human–computer interaction systems. IEEE T RANSACTIONS ON I NTELLIGENT T RANSPORTATION S YSTEMS ,
the IEEE T RANSACTIONS ON S YSTEMS , M AN AND C YBERNETICS , and
the IEEE T RANSACTIONS ON D EVELOPMENTAL AND C OGNITIVE
S YSTEMS , etc.

Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 27,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.

You might also like