Duong 2019
Duong 2019
Abstract—Enrolment Advising (EA) for Vietnamese cost and human resources, and help the students and their
Universities takes place many sites in Vietnam every year, from parents easily search the answers for their worries of
the high schools to the universities. Many sessions have been Vietnamese University enrolment yearly.
organized, both online and offline. It costs a lot of human The remaining of this paper is organized as follows.
resources and money for inviting the experts, printing the
Section II presents the related works. Section III presents the
leaflets, banner, poster, etc. In this context, we try to reduce the
costs for EA by an application of machine learning. So, we proposed approach and section IV presents the experimental
propose to develop an automatic university enrolment advising results. Finally, the conclusions and future works are
system by using ensemble classifier. presented in section V.
36
lots of M satisfied above conditions. For the dataset of points The results of three classifiers (KNN, Naïve Bayes and
(x1, y1), (x2, y2), … (xn, yn), xi is a d-dimensional vector, yi is SVM) are voted to select the answer. Fig. 2 is the architecture
the label of xi. Assume that it has two labels as positive (yi = 1) of our ViEA system.
and negative (yi = -1), let M0 is the most optimal M, according Firstly, the dataset is built by pairing the couple of question
to SVM, M0 must satisfy two conditions: two-nearest points and answer, where the question is a student’s worry, and the
of two respective classes is the same distance to M0, and that answer is the respond of the human experts. All of them have
one is the largest distance compares to other M (Fig. 1: M1 tokenized by pyvi library. The system accepts a question as
and M2 contains these points, and need to find w, b obtaining Vietnamese natural language in enrolment field, it will be
the largest margin). tokenized into Vietnamese words, remove some stopwords.
It considers that the distance from any points (xi,yi) to M0 is Next, FTS is applied to find the relevant questions in the
calculated as follows: dataset to reduce the searching space. The output of this stage
is the questions whose answers may be the expected answers.
ȁ࢝ࢀ ݔ ܾȁ ݕ ሺ࢝ࢀ ݔ ܾሻ The system has vectorized the user’s question and above
ൌ ሺሻ relevant questions. Each dimension is a term wi calculating by
ԡ࢝ԡଶ ԡ࢝ԡଶ
tf×idf weight as follows:
Because (wTxi + b) and yi is the same sign, so in (6), the left
side infers to the right side. The purpose of SVM is to find w, ܰ
ሺ ݂ݐൈ ݂݅݀ሻ௪ ൌ ݂ݍ݁ݎሺݓ ሻ݈݃ ሺͻሻ
b so that (6) obtains the maximum value. For the points (xi, yi) ͳ ݂݀
belong to M1 and M2, assume that yi(wTxi + b) = 1. So, the
problem becomes: Where N is the number of documents in dataset, df is the
ͳ number of documents in dataset containing wi, freq(wi) is the
࢝ǡ ܾ ൌ ܽ ݔܽ݉݃ݎ൬ ൰ ǡ ݕ݁ݎ݄݁ݓ ሺ࢝ࢀ ݔ ܾሻ ͳሺሻ frequency wi appearing in a document.
ԡݓԡଶ
Using Lagrange multiplier to solve the root of this problem. Answer Selection
Afterward, a category of a new data point is determined by the
function f(x) = sign(wTx + b).
Majority Voting
Vectors
Enrolment
FTS
Website
Tokenized question
Question
Processing
Question ViEA
System Dataset
Fig. 1. Support Vector Machine
For nearly linear separable dataset uses soft margin SVM. Fig. 2. ViEA System Architecture
37
The final answer is one of answers of the relevant questions h͕c khác nh˱ là môn_h͕c t ch͕n cͯa mình, n͇u sinh_viên
which is the majority voting of these algorithms. The ÿ̫m_b̫o ti͇n_ÿ͡ h͕c_t̵p theo qui_ÿ͓nh. sinh_viên có_th͋
remaining ones are the secondary answers. ch͕n các môn_h͕c t ch͕n phù_hͫp vͣi sͧ_thích và năng_lc
TABLE 1. THE PARAMETERS APPLIED TO THE ALGORITHMS
cͯa b̫n_thân. các môn_h͕c t ch͕n ÿ˱ͫc th͋_hi͏n trong
b̫ng ÿi͋m t͙t_nghi͏p là lͫi_th͇ cho sinh_viên khi ÿi xin vi͏c .
Algorithms Parameters N͇u môn_h͕c t ch͕n là môn_h͕c l̯n ÿ̯u, sinh_viên không
KNN K=1 ph̫i ÿóng thêm h͕c_phí cho các môn_h͕c này vì ÿã ÿ˱ͫc tính
Multinomial Naïve Bayes alpha = 0.01 vào h͕c_phí tr͕n_gói cͯa sinh_viên.”. In general, this answer
kernel = linear said about the regulations to switch between majors, learn two
SVM
C = 1e5
parallel majors, and register for the credit courses. This isn’t
an expected answer.
For example, the question is “Cho em h͗i quy ÿ͓nh cͯa B͡
The system selects the majority voting of these algorithms.
giáo dͭc và ÿào t̩o năm nay yêu c̯u thí sinh ph̫i có h͕c lc
In this question, KNN and SVM votes the same answer, so it
gi͗i lͣp 12 mͣi ÿ˱ͫc ÿăng ký xét tuy͋n các ngành s˱ ph̩m
select this one as the main answer of the system. Another one
ÿúng không ̩?” (In this year, the regulation of the Ministry of
of Naive Bayes is the secondary answer.
Education and Training, a student must have a rank of level 12
as the distinction to register for pedagogy?). IV. THE EXPERIMENTS
It’s tokenized: “Cho em h͗i quy_ÿ͓nh cͯa B͡ giáo_dͭc và
One thousand pairs of question/answer are extracted from
ÿào_t̩o năm nay yêu_c̯u thí_sinh ph̫i có h͕c_lc gi͗i lͣp
the reputation websites about enrolment to build the dataset.
12 mͣi ÿ˱ͫc ÿăng_ký xét tuy͋n các ngành s˱_ph̩m ÿúng
Due to the limited time, we have prepared about other one
không ̩?”, and removed stop words such as “cho”, “em”,
hundred of questions and their answers for verifying the
“hӓi”, “cӫa”, “ҥ”, “ÿúng”, “không”, “các”, “và”, “mӟi”,
accuracy manually. The F1 score obtained is 76%, almost of
“ÿѭӧc”. The question becomes “quy_ÿ͓nh B͡ giáo_dͭc
wrong answers have not in dataset yet. We observe this is a
ÿào_t̩o năm nay yêu_c̯u thí_sinh ph̫i có h͕c_lc gi͗i lͣp
promising result for the EA systems.
12 ÿăng_ký xét tuy͋n ngành s˱_ph̩m”.
In order to show our evaluation, next to the example in part
In order to reduce searching space, the system continues to
III, we give more examples as follows:
find the relevant questions of user’s question in the dataset by
Question 1: “Trong tài li͏u h˱ͣng d̳n ghi rõ các chͱng ch͑
using FTS. The whole ones and user question are vectorized
có giá tr͓ s͵ dͭng ÿ͇n ngày 23/06/2018. Tuy nhiên, hi͏n t̩i
for representative features. In final, KNN, Naïve Bayes and
m͡t s͙ h͕c sinh mͣi ÿang ÿăng ký thi, n͇u ch˱a ḽy ÿ˱ͫc
SVM are applied, the results are as follows:
chͱng ch͑ tr˱ͣc ngày n͡p h͛ s˯ thì có ÿ˱ͫc ÿăng ký mͭc mi͍n
KNN and SVM give the same answer: “vͣi hình_thͱc thͱc
thi ngo̩i ngͷ hay không?” (In the guideline, the certificates
xét tuy͋n da vào k͇t_qu̫ h͕c_t̵p_trung_h͕c ph͝_thông,
are valid until June 06, 2018. However, some students is
ng˱ͩng ÿ̫m_b̫o ch̭t_l˱ͫng ÿ̯u_vào ngành thu͡c nhóm
registering for the exam until now, if they have not obtained
ngành ÿào_t̩o giáo_viên trình_ÿ͡ ÿ̩i_h͕c là h͕c_sinh x͇p
the certificates before submitting period, can they register for
lo̩i h͕c_lc lͣp 12 tͳ gi͗i trͧ lên. Vͣi trình_ÿ͡ cao_ÿ̻ng,
the exemption of foreign language exam?)
trung_c̭p xét tuy͋n h͕c_sinh x͇p lo̩i h͕c_lc lͣp 12 tͳ khá
With this question, after ViEA pre-processes. Three
trͧ lên.” (In order to assure the threshold of output quality of
algorithms predict the same result, it will the final answer of
pedagogy, a student must have a rank of level 12 as a
ViEA system as follows: “h͕c_viên ph̫i n͡p ÿͯ h͛_s˯ trong
distinction grade for the formal university system, and as a
thͥi_gian quy_ÿ͓nh, và có_th͋ gia_h̩n (linh_ÿ͡ng không quá
good grade for college and intermediate system based on the
1 tu̯n). Tr˱ͥng_hͫp thͥi_gian b͝_sung v˱ͫt quá thͥi_gian
academic results at the high school).
thi_tuy͋n thì v̳n không ÿ˱ͫc xét mi͍n thi anh văn.”. It
Naïve Bayes answers: “a) chuy͋n ngành sinh_viên h͏ ÿ˱ͫc
compares this answer with a predefined answer of this
phép xin chuy͋n sang ngành ÿào_t̩o khác cͯa tr˱ͥng n͇u
question (in the couple of question/answer), we marks this is a
th͗a các ÿi͉u_ki͏n sau: ÿ̩t s͙ tín_ch͑ tích_lNJy t͙i thi͇u theo
correct answer.
qui_ÿ͓nh, có ÿi͋m trung_bình_tích_lNJy tͳ 6.5 trͧ lên. ÿi͋m xét
Question 2: “Các nguy͏n v͕ng có bình ÿ̻ng trong xét tuy͋n
tuy͋n cͯa sinh_viên b̹ng ho̿c cao h˯n ÿi͋m_chu̱n cͯa
không?” (The aspirations are equal in enrolment?). With this
ngành/ch˱˯ng_trình mà sinh_viên mu͙n chuy͋n ÿ͇n ho̿c
question, three of algorithms returns three different results and
sinh_viên thu͡c di͏n tuy͋n th̻ng, ˱u_tiên xét tuy͋n .
they also don’t similar to the predefined answers, so we marks
Sinh_viên không b͓ x͵_lý kͽ_lu̵t, ÿ˱ͫc s ÿ͛ng_ý cͯa tr˱ͧng
this is wrong result.
khoa n˯i chuy͋n ÿi và tr˱ͧng khoa ti͇p_nh̵n. Chi_ti͇t
qui_ÿ͓nh sinh_viên liên_h͏ khoa và phòng ÿào_t̩o ÿ͋ ÿ˱ͫc TABLE 2. THE EXPERIMENT OF KNN, NAIVE BAYES, SVM AND VOTING
APPROACH
h˱ͣng_d̳n, sinh_viên các ch˱˯ng_trình tiên_ti͇n, ch̭t_l˱ͫng
cao, tài_năng liên_h͏ văn_phòng các ch˱˯ng_trình ÿ̿c_bi͏t KNN Naïve Bayes SVM Voting
ÿ͋ t˱_v̭n theo qui_trình và qui_ÿ͓nh riêng. b) h͕c song b̹ng No. of the
64 66 63 76
sinh_viên ÿ˱ͫc phép h͕c cùng lúc hai ch˱˯ng_trình ÿào_t̩o right answers
ÿ͋ khi t͙t_nghi͏p nh̵n hai văn_b̹ng. c) h͕c tín_ch͑ t ch͕n No. of the
36 34 37 24
cͯa ngành h͕c khác trong quá_trình h͕c t̩i tr˱ͥng, sinh_viên wrong answer
có_th͋ ÿăng_ký h͕c các môn chuyên_ngành cͯa các ngành
38
In the experiment, three classifiers (KNN, Naïve Bayes, [12] Thien Khai Tran, Tuoi Thi Phan, “Computing Sentiment Scores of
Verb Phrases for Vietnamese,” The 2016 Conference on
SVM) is applied to find the appropriate answer by voting. The
Computational Linguishtics and Speech Processing, 2016.
experimental results show that KNN and SVM usually give [13] Wanpeng Song, Liu Wenyin, Naijie Gu, Xiaojun Quan, Tianyong Hao,
the same results, the accuracy of Naïve Bayes is better than “Automatic categorization of questions for user-interactive question
the others. So, when they give three different results, the answering,” Information Processing and Management (2011) 147-156.
[14] Huy Vu Nguyen, Dang Tuan Nguyen, “Application of First-Order
answer from Naïve Bayes is selected based on a priority.
Logic Inference in Vietnamese Question Answering System,” 6th
Table 2 gives a statistics (number of the right and wrong International Conference in Intelligent Systems, Modelling and
answers) of three KNN, Naïve Bayes and SVM independently Simulation, 2015.
and the proposed approach by voting. This also shows the [15] Quan Hung Tran, Nien Dinh Nguyen, Kien Duc Do, Thinh Khanh
Nguyen, Dang Hai Tran, Minh Le Nguyen, Son Bao Pham, “A
voting approach is better than the original classifiers with the
Community-Based Vietnamese Question Answering System,” 6th
most number of right answers. International Conference in Knowledge and Systems Engineering,
2014.
V. CONCLUSIONS AND FUTURE WORKS [16] Jun Suzuki, Yutaka Sasaki, Eisaku Maeda, “SVM answer selection for
open-domain question answering,” COLING’02 Proceedings of the
In this paper, we proposed an approach based on multi- 19th international conference on Computational linguistics – Volume 1,
classifiers for university enrolment advising. The dataset is p. 1-7, 2002.
constructed by pairing the couple of question/answer. When a
new question is raised, the segmentation of Vietnamese words
is applied to remove stop words. Then, FTS is performed to
find the relevant questions of user’s question. Finally, the
system applies three classifiers including: KNN, Naïve Bayes
and SVM for voting to select the appropriate answer. The
proposed approach can answer the complex questions as
shown in the experimental results. In fact, this method can be
applied in other domain. The main drawback is to have a “big
enough” dataset.
This work is now extended to improve the answer selection
step. In order to improve the accuracy of ViEA system for the
next query after each answer is given, the users can rate it
based on their satisfaction.
REFERENCES
[1] Phuong Le-Hong, Duc-Thien Bui, “A Factoid Answering System for
Vietnamese,” Companion of the The Web Conference, 2018.
[2] Toan Pham Van, Ta Minh Thanh, “Vietnamese News Classification
based on BoW with Keywords Extraction and Neural Network,” 21st
Asia Pacific Symposium on Intelligent and Evolutionary System, 2017.
[3] Ahmad Mazyad, Fabien Teytaud, Cyril Fonlupt, “A Comparative
Study on Term Weighting Schemes for Text Calssification,” The Third
International Conference on Machine Learning, Optimization and Big
Data, 2017.
[4] Dat Quoc Nguyen, Dai Quoc Nguyen, and Bao Son Pham, “Ripple
Down Rules for Question Answering,” Semantic Web 8, 4 (2017),
511–532.
[5] Alejandro Figueroa, “Automatically generating effective search queries
directly from community question-answering questions for finding
related questions,” Expert Systems With Application 77, p.11-19, 2017.
[6] Xuelian Deng, Yuqing Li, Jian Weng, “Feature Selection for text
classification: A review,” Springer Science+Business Media, LLC,
2018.
[7] Zhang S, Li X, Zong M, Zhu X, Wang R, “Efficient knn classification
with different numbers of nearest neighbors,” IEEE transactions on
neural networks and learning systems, 2017.
[8] Tang B, Kay S, He H, “Toward optimal feature selection in naive
bayes for text categorization,” IEEE Trans Knowl Data Eng
28(9):2508–2521, 2016.
[9] Tang J, Alelyani S, Liu H, “Feature selection for classification: a
review,” Data Classification: Algorithms and Applications, p 37, 2014.
[10] Wanfu Gao, Liang Hu, Ping Zhang, Jialong He, “Feature selection
considering the composition of feature relevancy,” Pattern Recognition
Latters 112 (2018) 70-74, 2018.
[11] Boris Galitsky, “Matching parse thinkets for open domain question
anwering,” Data & Knowledge Engineering 107 (2017) 24-55, 2017.
39