0% found this document useful (0 votes)
14 views5 pages

Duong 2019

Uploaded by

desx redj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views5 pages

Duong 2019

Uploaded by

desx redj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Question answering based on ensemble classifier for

university enrolment advising


Huu-Thanh Duong (1), Vinh Truong Hoang (2)
Faculty of Information Technology
Ho Chi Minh City Open University, Vietnam
(1) (2)
[email protected], [email protected]

Abstract—Enrolment Advising (EA) for Vietnamese cost and human resources, and help the students and their
Universities takes place many sites in Vietnam every year, from parents easily search the answers for their worries of
the high schools to the universities. Many sessions have been Vietnamese University enrolment yearly.
organized, both online and offline. It costs a lot of human The remaining of this paper is organized as follows.
resources and money for inviting the experts, printing the
Section II presents the related works. Section III presents the
leaflets, banner, poster, etc. In this context, we try to reduce the
costs for EA by an application of machine learning. So, we proposed approach and section IV presents the experimental
propose to develop an automatic university enrolment advising results. Finally, the conclusions and future works are
system by using ensemble classifier. presented in section V.

Keywords—machine learning, question answering, KNN, SVM, II. RELATED WORKS


Naïve Bayes, term weight, full-text search, ensemble classifier. There are several teams for Vietnamese Question
Answering studying in the recent years. Le-Hong et al. [1]
I. INTRODUCTION
built a factoid Question Answering for Vietnamese, they
Yearly, University Enrolment is an important event incorporated the statistical models and ontology-based
attracted a lot of students and their parents. Enrolment methods to obtain high quality mapping between natural
Advising becomes a burden of society, many activities take language and the entities in processing, and this system can
place to support the students in order to understand the exam answer a wide range of general knowledge questions. Song et
regulations, how to do the exam, the advice to select the al. built a system to answer the questions about a schedule of
careers, etc., and these ones cost so much to implement. Thus, the power outage and cut the water in Vietnamese [14]. The
many students’ worries are repeated in years, and the answers authors extracted the collection of users’ queries from
based on the experiment of experts and enrolment regulations community web services and their corresponding collection of
of the Ministry of Education and Training. The development feedbacks and comments [15]. When the system receives a
of an automatic university enrolment advising based on the new question, it will calculate the similarity between that one
expert’s answers is needed to address this problem. and set of extracted questions as the candidate answers, after
We observe that our problem concerns the question that, it applied a supervised learning to estimate the score of
answering system. Thus, this system can receive the question classification for the candidate answers; the final results is the
of a user in natural language and then respond a succinct answer with the highest score. This idea is rather close to our
answer. The question answering system usually has three problem.
phases: question processing, passage retrieval, and answer The first task, we need to build a dataset in enrolment field.
selection. In this paper, we focus on answer selection to select We incorporate many libraries such as python scrapy,
the plausible answers from the dataset by using a machine selenium, regular expression, v.v. to extract the data from the
learning approach. reputation website about enrolment advising such as tuoitre.vn,
We first collected the pairs question/answer about college thanhnien.vn, dantricom.vn, thituyensinh.vn, etc. We also
enrolment from the reputation websites, where the question is recognize facing many challenges when handling in
the interest of student about CE, and the answer is the Vietnamese, it inherently has many different processing points
response of the human experts in the last five years. A dataset in English that have been supported a lot of stuff with big
is created so that each pair question/answer is tokenized researching communities and many achievements in question
Vietnamese words and indexed for searching later to reduce answering in particular, and in machine learning in general.
searching space. Next, the answers of the natural questions of The first obvious point is the word boundary; it’s totally
users (students and their parents) based on voting between unlike with English which the word is divided by space
some machine learning algorithms are constructed. character. The words in Vietnamese can be a single word,
Our main contribution reviews the well-known classifiers duplicate words or compound words and have many
and applies them to build a question answering system for ambiguous situations. This problem is seen as word
enrolment advising which has a lot of practical applications in segmentation. Practically, there are several tools, libraries of
Vietnamese society. To our knowledge, no such the system Vietnamese natural processing communities with rather a high
like that ever provided in Vietnam, this will help to reduce

978-1-5386-7512-0/19/$31.00 @2019 IEEE 35


accuracy, such as pyvi1 in Python. This library is good enough Naïve Bayes is also a supervised machine learning
(with F1 score as 98%) and easy to implement for solving the algorithm based on probabilities and Bayes theory by giving a
word segmentation problem in Vietnamese. prediction on the predefined dataset. It assumes having n
The tf×idf weight is used to characterize the document classes in a set of {c1, c2, c3, …, cn}, and data point x = (x1,
because it’s an easy way to implement and good to represent x2,…, xd), where xi present to a word in a dictionary, d is the
the word weight, where tf is the frequency of words appearing number of dimensions. The probability is to x belongs to class
in a document, and df is the number of a document containing ci using Naïve Bayes classifier calculated as follows:
the word, and idf is an inversion of df. We consider that the
word appears many times (with tf high), it might be a ܲሺ࢞ȁܿ௜ ሻܲሺܿ௜ ሻ
keyword of that document, but that one also appears many ܲሺܿ௜ ȁ‫ݔ‬ሻ ൌ ሺͳሻ
ܲሺ࢞ሻ
times in other documents (là, thì, và, etc) is usually the less
meaning words, multiplying it with idf aims to decrease the So, we need find ci so that P(ci|x) obtains the maximum
weight of that word. The experimental results of [3] prove that value, it means finding ci satisfied:
tf×idf has a high score and obtains good results for text
classification which is rather closed to our problem. ܲሺ࢞ȁܿ௜ ሻܲሺܿ௜ ሻ
Furthermore, the list of Vietnamese stopwords (thì, là, và, ܿ௜ ൌ ܽ‫ݔܽ݉݃ݎ‬௜ୀଵǡǤǤǡ௡ ൫ܲሺܿ௜ ȁ࢞ሻ൯ ൌ ܽ‫ݔܽ݉݃ݎ‬௜ୀଵǡǤǤǡ௡ ቆ ቇሺʹሻ
ܲሺ࢞ሻ
nhé, etc) is proposed manually, and the system will remove all
these stopwords from a user’s question. Where P(ci) is a probability of a point data belongs to class ci
in the dataset. Naïve Bayes assumes that the dimensions of x
III. PROPOSED APPROACH
is independence each other, so we can calculate P(x|ci):
A. BACKGROUND ௗ
In this section, the machine learning approach is used to
ܲሺ࢞ȁܿ௜ ሻ ൌ ෑ ܲሺ‫ݔ‬௞ ȁܿ௜ ሻ ሺ͵ሻ
give a solution for Vietnamese Enrolment Advising system
௞ୀଵ
(ViEA).
In order to reduce the searching space and the This algorithm is easy to implement, high performance, so
computational cost, we use full-text search (FTS) to find the it’s usually applied the real-time system. Although it requires
questions in a dataset which relates to the user’s question. The the independence of dimensions which rarely happens, it still
answers of these questions consider as candidate answers. The obtains the good accuracy. Usually, there are two main
final answer is decided by voting by machine learning approaches for Naïve Bayes: Multivariate Bernoulli is used
algorithms. for binary vector representing for documents and Multinomial
FTS refers to a technique for searching in a collection of Naïve Bayes which is suitable for discrete features is used for
documents. It’s different from the traditional methods which documents whose feature vector is calculated by a bag of
only based on a part of original texts represented in a database, word. However, in practice, tf×idf also works well. When
FTS exams all of the words in the documents which match a P(xk|ci) in (3) is calculated as follows:
searching criteria. For searching on the large number of
documents, FTS is divided into two tasks: indexing and ܰ௞௜
searching. The indexing stage builds a list of terms from the ܲሺ‫ݔ‬௞ ȁܿ௜ ሻ ൌ ሺͶሻ
ܰ௜
whole documents and the searching stage searches the
matching documents based on the index instead of the text of
Where Nki is a frequency of word xk appears in documents
the originals documents.
belong to class ci, and Ni is the total of words in documents
In order to implement the idea, we choose three well-
belong to class ci. Besides, some word xk appear zero times in
known classifiers represented in the groups with the lowest,
the documents of class ci lead to (4) = 0 regardless of the
intermediate, highest results in the current, respectively KNN,
remaining values, this makes the calculating results become
Naïve Bayes and SVM. Next looks at the thesis of these
incorrect more, so it should use a smoothing parameter, call
algorithms voting the predicted answer.
Laplace smoothing, to avoid this, (4) is revised:
KNN (K-nearest neighbors) is a well-known supervised
machine learning algorithm. To assign data point di to a ܰ௞௜ ൅ ߙ
certain group, KNN will find K-nearest data points of di based ܲሺ‫ݔ‬௞ ȁܿ௜ ሻ ൌ ሺͷሻ
on specific similarity or distance measures, this method will ܰ௜ ൅ ݀ߙ
assign di to the same group of its nearest data points. This
approach is easy to implement, and good performance. ߙ is a positive number to avoid zero value, d is the number
However, it’s high computational cost and it’s sensitive to of dimensions.
noise data if K is small. Furthermore, SVM (Support Vector Machine) can be used
for both classification and regression. This algorithm based on
PLA (Perceptron Learning Algorithm) which assumes existing
line/plane/hyper-lane (M) to separate the dataset into two
1
https://pypi.org/project/pyvi/ classes, the main responsibility of PLA is to find M. There are

36
lots of M satisfied above conditions. For the dataset of points The results of three classifiers (KNN, Naïve Bayes and
(x1, y1), (x2, y2), … (xn, yn), xi is a d-dimensional vector, yi is SVM) are voted to select the answer. Fig. 2 is the architecture
the label of xi. Assume that it has two labels as positive (yi = 1) of our ViEA system.
and negative (yi = -1), let M0 is the most optimal M, according Firstly, the dataset is built by pairing the couple of question
to SVM, M0 must satisfy two conditions: two-nearest points and answer, where the question is a student’s worry, and the
of two respective classes is the same distance to M0, and that answer is the respond of the human experts. All of them have
one is the largest distance compares to other M (Fig. 1: M1 tokenized by pyvi library. The system accepts a question as
and M2 contains these points, and need to find w, b obtaining Vietnamese natural language in enrolment field, it will be
the largest margin). tokenized into Vietnamese words, remove some stopwords.
It considers that the distance from any points (xi,yi) to M0 is Next, FTS is applied to find the relevant questions in the
calculated as follows: dataset to reduce the searching space. The output of this stage
is the questions whose answers may be the expected answers.
ȁ࢝ࢀ ‫ݔ‬௜ ൅ ܾȁ ‫ݕ‬௜ ሺ࢝ࢀ ‫ݔ‬௜ ൅ ܾሻ The system has vectorized the user’s question and above
 ൌ ሺ͸ሻ relevant questions. Each dimension is a term wi calculating by
ԡ࢝ԡଶ ԡ࢝ԡଶ
tf×idf weight as follows:
Because (wTxi + b) and yi is the same sign, so in (6), the left
side infers to the right side. The purpose of SVM is to find w, ܰ
ሺ‫ ݂ݐ‬ൈ ݂݅݀ሻ௪೔ ൌ ݂‫ݍ݁ݎ‬ሺ‫ݓ‬௜ ሻ݈‫݃݋‬ ሺͻሻ
b so that (6) obtains the maximum value. For the points (xi, yi) ͳ ൅ ݂݀
belong to M1 and M2, assume that yi(wTxi + b) = 1. So, the
problem becomes: Where N is the number of documents in dataset, df is the
ͳ number of documents in dataset containing wi, freq(wi) is the
࢝ǡ ܾ ൌ ܽ‫ ݔܽ݉݃ݎ‬൬ ൰ ǡ ‫ݕ݁ݎ݄݁ݓ‬௜ ሺ࢝ࢀ ‫ݔ‬௜  ൅ ܾሻ ൒ ͳሺ͹ሻ frequency wi appearing in a document.
ԡ‫ݓ‬ԡଶ

Using Lagrange multiplier to solve the root of this problem. Answer Selection
Afterward, a category of a new data point is determined by the
function f(x) = sign(wTx + b).
Majority Voting

KNN Naïve Bayes SVM

Vectors
Enrolment
FTS
Website
Tokenized question

Question
Processing

Question ViEA
System Dataset
Fig. 1. Support Vector Machine

For nearly linear separable dataset uses soft margin SVM. Fig. 2. ViEA System Architecture

ே The vectorized data will be executed by machine learning


ԡ࢝ԡଶଶ
ሺ࢝ǡ ܾǡ ߦሻ ൌ ܽ‫࢝݊݅݉݃ݎ‬ǡ௕ǡక ൅ ‫ ܥ‬෍ ߦ௡ ሺͺሻ algorithms to find the most matching question with the user’s
ʹ question, its answer is namely ViEA’s answer to the user’s
௜ୀଵ
question. They are including KNN, Naïve Bayes and SVM for
Where C is a positive cost constant, ߦ calls slack variable, this regression problem. For KNN is used to find the nearest
ߦ௜ ൌ ȁ࢝ࢀ ࢞࢏ ൅ ܾ െ ‫ݕ‬௜ ȁ and satisfies ‫ݕ‬௡ ሺ࢝ࢀ ‫ݔ‬௡ ൅ ܾሻ ൒ ͳ െ question (or similar) to user’s question and select its answer.
ߦ௡ ǡ ܽ݊݀ߦ௡ ൒ Ͳǡ ‫ ݊׊‬ൌ ͳǡʹǡ ǥ ǡ ܰ. This can be solved by using For Naïve Bayes, the Multinomial Bayes is used for answer
Lagrange multiplier. selection since this method is suitable for feature vector as
tf×idf weight. Then, SVM algorithm is applied. Table 1 shows
B. THE ViEA ARCHITECTURE the parameters of the algorithms of the proposed system.

37
The final answer is one of answers of the relevant questions h͕c khác nh˱ là môn_h͕c t͹ ch͕n cͯa mình, n͇u sinh_viên
which is the majority voting of these algorithms. The ÿ̫m_b̫o ti͇n_ÿ͡ h͕c_t̵p theo qui_ÿ͓nh. sinh_viên có_th͋
remaining ones are the secondary answers. ch͕n các môn_h͕c t͹ ch͕n phù_hͫp vͣi sͧ_thích và năng_l͹c
TABLE 1. THE PARAMETERS APPLIED TO THE ALGORITHMS
cͯa b̫n_thân. các môn_h͕c t͹ ch͕n ÿ˱ͫc th͋_hi͏n trong
b̫ng ÿi͋m t͙t_nghi͏p là lͫi_th͇ cho sinh_viên khi ÿi xin vi͏c .
Algorithms Parameters N͇u môn_h͕c t͹ ch͕n là môn_h͕c l̯n ÿ̯u, sinh_viên không
KNN K=1 ph̫i ÿóng thêm h͕c_phí cho các môn_h͕c này vì ÿã ÿ˱ͫc tính
Multinomial Naïve Bayes alpha = 0.01 vào h͕c_phí tr͕n_gói cͯa sinh_viên.”. In general, this answer
kernel = linear said about the regulations to switch between majors, learn two
SVM
C = 1e5
parallel majors, and register for the credit courses. This isn’t
an expected answer.
For example, the question is “Cho em h͗i quy ÿ͓nh cͯa B͡
The system selects the majority voting of these algorithms.
giáo dͭc và ÿào t̩o năm nay yêu c̯u thí sinh ph̫i có h͕c l͹c
In this question, KNN and SVM votes the same answer, so it
gi͗i lͣp 12 mͣi ÿ˱ͫc ÿăng ký xét tuy͋n các ngành s˱ ph̩m
select this one as the main answer of the system. Another one
ÿúng không ̩?” (In this year, the regulation of the Ministry of
of Naive Bayes is the secondary answer.
Education and Training, a student must have a rank of level 12
as the distinction to register for pedagogy?). IV. THE EXPERIMENTS
It’s tokenized: “Cho em h͗i quy_ÿ͓nh cͯa B͡ giáo_dͭc và
One thousand pairs of question/answer are extracted from
ÿào_t̩o năm nay yêu_c̯u thí_sinh ph̫i có h͕c_l͹c gi͗i lͣp
the reputation websites about enrolment to build the dataset.
12 mͣi ÿ˱ͫc ÿăng_ký xét tuy͋n các ngành s˱_ph̩m ÿúng
Due to the limited time, we have prepared about other one
không ̩?”, and removed stop words such as “cho”, “em”,
hundred of questions and their answers for verifying the
“hӓi”, “cӫa”, “ҥ”, “ÿúng”, “không”, “các”, “và”, “mӟi”,
accuracy manually. The F1 score obtained is 76%, almost of
“ÿѭӧc”. The question becomes “quy_ÿ͓nh B͡ giáo_dͭc
wrong answers have not in dataset yet. We observe this is a
ÿào_t̩o năm nay yêu_c̯u thí_sinh ph̫i có h͕c_l͹c gi͗i lͣp
promising result for the EA systems.
12 ÿăng_ký xét tuy͋n ngành s˱_ph̩m”.
In order to show our evaluation, next to the example in part
In order to reduce searching space, the system continues to
III, we give more examples as follows:
find the relevant questions of user’s question in the dataset by
Question 1: “Trong tài li͏u h˱ͣng d̳n ghi rõ các chͱng ch͑
using FTS. The whole ones and user question are vectorized
có giá tr͓ s͵ dͭng ÿ͇n ngày 23/06/2018. Tuy nhiên, hi͏n t̩i
for representative features. In final, KNN, Naïve Bayes and
m͡t s͙ h͕c sinh mͣi ÿang ÿăng ký thi, n͇u ch˱a ḽy ÿ˱ͫc
SVM are applied, the results are as follows:
chͱng ch͑ tr˱ͣc ngày n͡p h͛ s˯ thì có ÿ˱ͫc ÿăng ký mͭc mi͍n
KNN and SVM give the same answer: “vͣi hình_thͱc thͱc
thi ngo̩i ngͷ hay không?” (In the guideline, the certificates
xét tuy͋n d͹a vào k͇t_qu̫ h͕c_t̵p_trung_h͕c ph͝_thông,
are valid until June 06, 2018. However, some students is
ng˱ͩng ÿ̫m_b̫o ch̭t_l˱ͫng ÿ̯u_vào ngành thu͡c nhóm
registering for the exam until now, if they have not obtained
ngành ÿào_t̩o giáo_viên trình_ÿ͡ ÿ̩i_h͕c là h͕c_sinh x͇p
the certificates before submitting period, can they register for
lo̩i h͕c_l͹c lͣp 12 tͳ gi͗i trͧ lên. Vͣi trình_ÿ͡ cao_ÿ̻ng,
the exemption of foreign language exam?)
trung_c̭p xét tuy͋n h͕c_sinh x͇p lo̩i h͕c_l͹c lͣp 12 tͳ khá
With this question, after ViEA pre-processes. Three
trͧ lên.” (In order to assure the threshold of output quality of
algorithms predict the same result, it will the final answer of
pedagogy, a student must have a rank of level 12 as a
ViEA system as follows: “h͕c_viên ph̫i n͡p ÿͯ h͛_s˯ trong
distinction grade for the formal university system, and as a
thͥi_gian quy_ÿ͓nh, và có_th͋ gia_h̩n (linh_ÿ͡ng không quá
good grade for college and intermediate system based on the
1 tu̯n). Tr˱ͥng_hͫp thͥi_gian b͝_sung v˱ͫt quá thͥi_gian
academic results at the high school).
thi_tuy͋n thì v̳n không ÿ˱ͫc xét mi͍n thi anh văn.”. It
Naïve Bayes answers: “a) chuy͋n ngành sinh_viên h͏ ÿ˱ͫc
compares this answer with a predefined answer of this
phép xin chuy͋n sang ngành ÿào_t̩o khác cͯa tr˱ͥng n͇u
question (in the couple of question/answer), we marks this is a
th͗a các ÿi͉u_ki͏n sau: ÿ̩t s͙ tín_ch͑ tích_lNJy t͙i thi͇u theo
correct answer.
qui_ÿ͓nh, có ÿi͋m trung_bình_tích_lNJy tͳ 6.5 trͧ lên. ÿi͋m xét
Question 2: “Các nguy͏n v͕ng có bình ÿ̻ng trong xét tuy͋n
tuy͋n cͯa sinh_viên b̹ng ho̿c cao h˯n ÿi͋m_chu̱n cͯa
không?” (The aspirations are equal in enrolment?). With this
ngành/ch˱˯ng_trình mà sinh_viên mu͙n chuy͋n ÿ͇n ho̿c
question, three of algorithms returns three different results and
sinh_viên thu͡c di͏n tuy͋n th̻ng, ˱u_tiên xét tuy͋n .
they also don’t similar to the predefined answers, so we marks
Sinh_viên không b͓ x͵_lý kͽ_lu̵t, ÿ˱ͫc s͹ ÿ͛ng_ý cͯa tr˱ͧng
this is wrong result.
khoa n˯i chuy͋n ÿi và tr˱ͧng khoa ti͇p_nh̵n. Chi_ti͇t
qui_ÿ͓nh sinh_viên liên_h͏ khoa và phòng ÿào_t̩o ÿ͋ ÿ˱ͫc TABLE 2. THE EXPERIMENT OF KNN, NAIVE BAYES, SVM AND VOTING
APPROACH
h˱ͣng_d̳n, sinh_viên các ch˱˯ng_trình tiên_ti͇n, ch̭t_l˱ͫng
cao, tài_năng liên_h͏ văn_phòng các ch˱˯ng_trình ÿ̿c_bi͏t KNN Naïve Bayes SVM Voting
ÿ͋ t˱_v̭n theo qui_trình và qui_ÿ͓nh riêng. b) h͕c song b̹ng No. of the
64 66 63 76
sinh_viên ÿ˱ͫc phép h͕c cùng lúc hai ch˱˯ng_trình ÿào_t̩o right answers
ÿ͋ khi t͙t_nghi͏p nh̵n hai văn_b̹ng. c) h͕c tín_ch͑ t͹ ch͕n No. of the
36 34 37 24
cͯa ngành h͕c khác trong quá_trình h͕c t̩i tr˱ͥng, sinh_viên wrong answer
có_th͋ ÿăng_ký h͕c các môn chuyên_ngành cͯa các ngành

38
In the experiment, three classifiers (KNN, Naïve Bayes, [12] Thien Khai Tran, Tuoi Thi Phan, “Computing Sentiment Scores of
Verb Phrases for Vietnamese,” The 2016 Conference on
SVM) is applied to find the appropriate answer by voting. The
Computational Linguishtics and Speech Processing, 2016.
experimental results show that KNN and SVM usually give [13] Wanpeng Song, Liu Wenyin, Naijie Gu, Xiaojun Quan, Tianyong Hao,
the same results, the accuracy of Naïve Bayes is better than “Automatic categorization of questions for user-interactive question
the others. So, when they give three different results, the answering,” Information Processing and Management (2011) 147-156.
[14] Huy Vu Nguyen, Dang Tuan Nguyen, “Application of First-Order
answer from Naïve Bayes is selected based on a priority.
Logic Inference in Vietnamese Question Answering System,” 6th
Table 2 gives a statistics (number of the right and wrong International Conference in Intelligent Systems, Modelling and
answers) of three KNN, Naïve Bayes and SVM independently Simulation, 2015.
and the proposed approach by voting. This also shows the [15] Quan Hung Tran, Nien Dinh Nguyen, Kien Duc Do, Thinh Khanh
Nguyen, Dang Hai Tran, Minh Le Nguyen, Son Bao Pham, “A
voting approach is better than the original classifiers with the
Community-Based Vietnamese Question Answering System,” 6th
most number of right answers. International Conference in Knowledge and Systems Engineering,
2014.
V. CONCLUSIONS AND FUTURE WORKS [16] Jun Suzuki, Yutaka Sasaki, Eisaku Maeda, “SVM answer selection for
open-domain question answering,” COLING’02 Proceedings of the
In this paper, we proposed an approach based on multi- 19th international conference on Computational linguistics – Volume 1,
classifiers for university enrolment advising. The dataset is p. 1-7, 2002.
constructed by pairing the couple of question/answer. When a
new question is raised, the segmentation of Vietnamese words
is applied to remove stop words. Then, FTS is performed to
find the relevant questions of user’s question. Finally, the
system applies three classifiers including: KNN, Naïve Bayes
and SVM for voting to select the appropriate answer. The
proposed approach can answer the complex questions as
shown in the experimental results. In fact, this method can be
applied in other domain. The main drawback is to have a “big
enough” dataset.
This work is now extended to improve the answer selection
step. In order to improve the accuracy of ViEA system for the
next query after each answer is given, the users can rate it
based on their satisfaction.

REFERENCES
[1] Phuong Le-Hong, Duc-Thien Bui, “A Factoid Answering System for
Vietnamese,” Companion of the The Web Conference, 2018.
[2] Toan Pham Van, Ta Minh Thanh, “Vietnamese News Classification
based on BoW with Keywords Extraction and Neural Network,” 21st
Asia Pacific Symposium on Intelligent and Evolutionary System, 2017.
[3] Ahmad Mazyad, Fabien Teytaud, Cyril Fonlupt, “A Comparative
Study on Term Weighting Schemes for Text Calssification,” The Third
International Conference on Machine Learning, Optimization and Big
Data, 2017.
[4] Dat Quoc Nguyen, Dai Quoc Nguyen, and Bao Son Pham, “Ripple
Down Rules for Question Answering,” Semantic Web 8, 4 (2017),
511–532.
[5] Alejandro Figueroa, “Automatically generating effective search queries
directly from community question-answering questions for finding
related questions,” Expert Systems With Application 77, p.11-19, 2017.
[6] Xuelian Deng, Yuqing Li, Jian Weng, “Feature Selection for text
classification: A review,” Springer Science+Business Media, LLC,
2018.
[7] Zhang S, Li X, Zong M, Zhu X, Wang R, “Efficient knn classification
with different numbers of nearest neighbors,” IEEE transactions on
neural networks and learning systems, 2017.
[8] Tang B, Kay S, He H, “Toward optimal feature selection in naive
bayes for text categorization,” IEEE Trans Knowl Data Eng
28(9):2508–2521, 2016.
[9] Tang J, Alelyani S, Liu H, “Feature selection for classification: a
review,” Data Classification: Algorithms and Applications, p 37, 2014.
[10] Wanfu Gao, Liang Hu, Ping Zhang, Jialong He, “Feature selection
considering the composition of feature relevancy,” Pattern Recognition
Latters 112 (2018) 70-74, 2018.
[11] Boris Galitsky, “Matching parse thinkets for open domain question
anwering,” Data & Knowledge Engineering 107 (2017) 24-55, 2017.

39

You might also like