0% found this document useful (0 votes)
38 views25 pages

Arabic Text Classification: The Need For Multi-Labeling Systems

This article discusses Arabic text classification using both single-label and multi-label systems. It constructs two large datasets of Arabic news articles, one with 90k single-labeled articles across 4 domains and another with 290k multi-labeled articles across 21 labels. Various shallow and deep learning classifiers are tested on the datasets. For single-label classification, SVM achieved the highest accuracy of 97.9%. For multi-label classification, a custom convolutional gated recurrent unit (CGRU) network achieved the best accuracy of 94.85%, demonstrating the need for multi-label systems for Arabic text classification.

Uploaded by

acevallo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
38 views25 pages

Arabic Text Classification: The Need For Multi-Labeling Systems

This article discusses Arabic text classification using both single-label and multi-label systems. It constructs two large datasets of Arabic news articles, one with 90k single-labeled articles across 4 domains and another with 290k multi-labeled articles across 21 labels. Various shallow and deep learning classifiers are tested on the datasets. For single-label classification, SVM achieved the highest accuracy of 97.9%. For multi-label classification, a custom convolutional gated recurrent unit (CGRU) network achieved the best accuracy of 94.85%, demonstrating the need for multi-label systems for Arabic text classification.

Uploaded by

acevallo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 25

Neural Computing and Applications (2022) 34:1135–1159

https://doi.org/10.1007/s00521-021-06390-z (0123456789().,-volV)(0123456789().,-volV)

ORIGINAL ARTICLE

Arabic text classification: the need for multi-labeling systems


Hozayfa El Rifai1 · Leen Al Qadi1 · Ashraf Elnagar1

Received: 27 August 2020 / Accepted: 26 July 2021 / Published online: 1 September 2021
© The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2021

Abstract
The process of tagging a given text or document with suitable labels is known as text categorization or classification. The
aim of this work is to automatically tag a news article based on its vocabulary features. To accomplish this objective, 2
large datasets have been constructed from various Arabic news portals. The first dataset contains of 90k single-labeled
articles from 4 domains (Business, Middle East, Technology and Sports). The second dataset has over 290 k multi-tagged
articles. To examine the single-label dataset, we employed an array of ten shallow learning classifiers. Furthermore, we
added an ensemble model that adopts the majority-voting technique of all studied classifiers. The performance of the
classifiers on the first dataset ranged between 87.7% (AdaBoost) and 97.9% (SVM). Analyzing some of the misclassified
articles confirmed the need for a multi-label opposed to single-label categorization for better classification results. For the
second dataset, we tested both shallow learning and deep learning multi-labeling approaches. A custom accuracy metric,
designed for the multi-labeling task, has been developed for performance evaluation along with hamming loss metric.
Firstly, we used classifiers that were compatible with multi-labeling tasks such as Logistic Regression and XGBoost, by
wrapping each in a OneVsRest classifier. XGBoost gave the higher accuracy, scoring 84.7%, while Logistic Regression
scored 81.3%. Secondly, ten neural networks were constructed (CNN, CLSTM, LSTM, BILSTM, GRU, CGRU, BIGRU,
HANGRU, CRF-BILSTM and HANLSTM). CGRU proved to be the best multi-labeling classifier scoring an accuracy of
94.85%, higher than the rest of the classifies.

Keywords Arabic text classification · Single-label classification · Multi-label classification · Arabic datasets ·
Shallow learning classifiers · Deep learning classifiers

1 Introduction helpful in cutting the time needed to extract insights and


organize massive chunks of data.
Large numbers of repositories live online as a result of the One of the fundamental tasks in natural language pro-
heavy usage of Web 2.0 and the Internet, which leads to a cessing (NLP) is text classification. It is used to assign
need and demand for automatic classification methods. labels to textual data based on its context. Automating the
Almost 80% of data is textual and unstructured but con- process simplifies classifying documents, helps in stan-
sidered an extremely valuable and rich source of infor- dardizing the platform, and makes searching for specific
mation. Machine learning algorithms have been proved information straightforward and possible.
Manually classifying documents by experts is not as
efficient as it used to be due to their increasing amount.
This is where machine learning algorithms come into play,
& Ashraf Elnagar as an alternative to conventional ways. They produce faster
ashraf@sharjah.ac.ae and more fruitful results. Several examples and applica-
Hozayfa El Rifai tions of text classification have been explored such as
u16103377@sharjah.ac.ae language identification [38], dialect identification [19],
Leen Al Qadi sentiment analysis [16, 17, 22–24] and spam filtering
u16103630@sharjah.ac.ae [2, 37].
1
Department of Computer Science, University of Sharjah, Structuring data using machine learning are becoming
Sharjah, UAE essential in the business field. It helps in detecting new

123
1136 Neural Computing and Applications (2022) 34:1135–1159

patterns and trends and identifies relationships between customized list of stop words as a replacement for using the
seemingly unrelated data. For example, marketers can NLTK list.
gather and review keywords used by other firms in the Looking at the misclassified articles by the classifying
field. For Arabic NLP, this is still a challenging task among system, we decide to construct a new Arabic multi-labeled
others [29]. It is a very complex, morphological-derived dataset for the purpose of assigning the articles to multi-
language, and it is the mother tongue of over 300M people. labels instead. Two approaches were tested. Firstly, we
The research done in the field of Arabic computational implemented two classical classifiers made compatible
linguistics is impressively increasing in the last decade, but with the task of multi-labeling. Second, we built ten neural
the work still has room to expand and branch out. networks with unique architectures to test the effectiveness
Arabic has been reported by the Internet World Stats to of deep learning techniques. For the first approach, we used
be the 4th most popular language online, with over 225k the TF-IDF technique for feature extract. Both classifiers
users, representing 5.2% of all Internet users as of April need to be wrapped in a OneVsRest Classifier, to convert
2019. They also show that it has the highest growth rate in the classification problem to sub-problems. For the second
the number of users in the last 19 years, achieving approach, we used the tokenizer provided by Keras before
8,917.3%. training.
Working with Arabic text is different from working with We offer a multi-label multi-class text categorization
English text. It is more challenging for several reasons system that is capable at assigning an article with multiple
which include the following: (1) Forms: Arabic has three labels out of 21 labels. We evaluate and compare the two
different forms (Classical, Modern Standard Arabic approaches using the custom accuracy metric specific to
(MSA), and Dialectal), (2) Vocabulary size: Arabic lan- the multi-labeling system, along with hamming-loss scores.
guage has around 12.3 million words compared to 600,000 This work is an extension of our work [4, 5] on single-label
words in English, (3) Alphabet set: the character set has 28 classification.
consonants and 8 vowels. Besides, when writing cursive To summarize, in this work, we propose two new large
script several characters have different shape-forms. The datasets for Arabic news articles tagging. One dataset is
Arabic text is written from right to left, (4) Grammar: dedicated for single-label classification, while the other
Arabic assigns words, verbs, and pronouns to a gender. It dataset is dedicated for multi-label classification. Both
also has singular and plural forms for both male and datasets are new to the Arabic computational linguistics
female. Further, conjugated verbs in Arabic are different, field and shall serve the need for such rich datasets. Fur-
(5) Vowels: a significant difference is in the quality and thermore, we demonstrate the validity and efficacy of these
length of the vowels. Arabic generally uses diphthongs and proposed datasets by studying the performance of several
long vowels as in-fixes, and (6) Sentence structure: Arabic shallow as well as deep learning models to classify Arabic
has verbal and nominal sentences. A nominal sentence does text. This is a comprehensive study on the task of Arabic
not require a verb. news classification, which is needed to fill this research
In this work, we first introduce a newly built single- gap. In conclusion, the contributions of the work are:
labeled dataset of Arabic news articles, collected from
several portals to aid our research. Several classifier models ● Two large Arabic datasets, which tat are comprised of
are trained to predict the single class a proposed article news articles spanning several topics, are properly
belongs to. Moreover, a voting classifier has been imple- annotated for Arabic text classification. The datasets
mented, considering the best predicting classifiers, with the shall be made available for researchers in the Arabic
highest accuracy percentage. natural language processing field.
An Arabic news article single-labeling classification ● A rigorous investigation of several shallow and deep
excerpts from text the linguistic features using the TF-IDF learning classification models is carried out for the
technique. In the training phase, each article is turned into a Arabic text classification task to choose the best models.
feature vector, and then, the system identifies the most ● Considerable experiments are conducted to confirm the
common features under each category. This way, when the
fitness of the proposed datasets and the classification
classifier encounters a new article, it will attempt at pre-
models.
dicting the relevant category based on its features vectors.
We propose a multi-class classifier, to label an Arabic ● Fine-tuning of the models as well as the utilization of
news articles under an appropriate class out of 4 classes. word embedding is performed to achieve solid
We used a supervised approach for text classification. We performance.
tested different vectorization methods to seek the best, and The paper is organized as follows: The literature review is
to see the effect they have on the accuracy percentages. In presented in Section 2. Section 3 demonstrates the dataset.
addition, we investigated the possibility of using a

123
Neural Computing and Applications (2022) 34:1135–1159 1137

Sections 4 and 5 detail the proposed classification process. problem. Authors pursuing better results using a different
The results and discussion are presented in Sect. 6. At last, approach such as [15] have used a convolution neural
we present our conclusions in Sect. 7. network for Arabic text classification and achieved better
results than Logistic Regression and SVM. In [10], the
authors proved that using feature reduction techniques with
2 Literature review an ANN model achieves higher results than a basic ANN
model. An extended work to address the multi-labeling
Many surveys and papers shed light on the different classification task using a variety of deep learning models
approaches used for English text categorization and discuss is studied in [26]. In this work, we show a more compre-
existing literature [1, 18, 36]. For Arabic text classification, hensive study by including classical machine learning
surveys like [7, 33] also exist. Some researchers investi- algorithms, which produce superb results as well.
gated the classification task on other languages such as All the above references worked on the single-labeling
Portuguese. In [28], they used an SVM classifier on an task of Arabic text. Nonetheless, the need for integrating
English dataset and on a Portuguese dataset. They found multi-labeling is becoming essential. A vast set of news
that the Portuguese language required paying more atten- article span more than one major topic. For example, a
tion to the document representation like semantic/syntactic news article that talks about covid-19 (medical domain)
information and word order. and its impact on the economy should be tagged with both
Arabic text classification research and the goal to enrich labels rather than only one. Multi-labeling would resolve
the Arabic corpus are slowly becoming a priority in the the intersection of multiple domains instead of just
research community. In [31], the authors believe that many selecting one. In fact, more electronic news portals are
of the available datasets are not appropriate for classifica- tagging each news article with multiple tags (keywords).
tion, either because the classes are not defined well, or This process is usually carried out by humans. Therefore,
there are not any defined classes like in the 1.5 billion the need for an automated tagging system is becoming a
words Arabic Corpus [11]. The authors also introduce necessity. While multi-labeling task is well researched for
’NADA,’ a new filtered and preprocessed corpus, that the English language (for example, see [13, 44]), it is
combine already existing corpora DAA and OSAC. under-researched for Arabic language. This work helps to
’NADA’ contains 13,066 documents belonging to 10 cat- bridge this gap in the Arabic computational linguistic field.
egories in total. With regard to the number of labels, we In the sequel, we describe few studies on this task for
believe that the corpus is too small. Arabic.
Recent research papers focusing more on Arabic text Shehab et al. [43] investigated the multi-label classifi-
classification (ATC) are emerging. The author in [20] used cation task using three machine learning classifiers, which
a dataset of articles collected from (aljazeera.com), to are Decision Trees (DT), Rain Forest (RF), and KNN. The
compare the performance of 6 different classifiers. Under results show that DT outperforms the other 2 classifiers.
the same environmental settings, Naive Bayes was the best This is a limited study; there are more robust classifiers that
classifier, regardless of feature selection methods. can outperform DT as we show in our work. Hmeidi et al.
Many papers experiment with feature selection methods. [34] used a lexicon-based system to classify Arabic docu-
In [8], they studied the effect of using uni-grams and bi- ments. The dataset has 8,800 multi-label documents col-
grams, experimenting with the KNN classifier. In [39], they lected from BBC Arabic. Several single-label and multi-
reported that using an SVM classifier, for ATC, outper- label lexicons were produced to tackle the problem. The
forms other classifiers. An experiment on 4 classifiers dataset is relatively small to handle effectively multi-label
while by means of 2 feature selection techniques (infor- tagging. Besides, scalability is a major concern with lexi-
mation gain and chi-squared) was conducted on a BBC con-based methods. Both of these works used hamming
Arabic dataset in [41]. Lastly, in [31], the authors present a loss metric as an evaluating metric, along with precision
new feature selection method for ATC, and it outperforms and recall.
five other approaches, testing them using the SVM Al-Salemi et al. [6] proposed a new dataset gathered
classifier. from RT-news (RTANews) website for multi-labeling of
Many authors reported results of supervised classical Arabic news articles. They explored 4 transformation-
machine learning algorithms such as NB [12, 14, 21, 40], based algorithms: Binary Relevance, Classifier Chains,
SVM [3, 12, 27, 32], Decision Tree [3, 30, 42], KNN Calibrated Ranking by Pairwise Comparison and Label
[14, 32]. Powerset. They used 3 classifiers, namely SVM, KNN, and
Several others preferred to work with deep learning RF. They reported that RF and SVM produced the best
techniques and experimented with neural networks like in results. However, the dataset has 87% of the documents
[9, 10, 15, 25] to tackle the single-label classification was tagged with a single label. As a result, the dataset is

123
1138 Neural Computing and Applications (2022) 34:1135–1159

biased toward single-label rather than multi-label classifi-


cation. This shortcoming would heavily impact the per-
formance of the proposed algorithms on another balanced
dataset.
It is clear that the accuracy and the general performance
are highly dependable on the quality of the collected data
and by the feature representation method. The more
redundant features we have, the less accurate the classifi-
cation is. Therefore, we introduce new rich and represen-
tative datasets for treating the problem of both single-label
and multi-label Arabic documents classification. We truly
believe that the datasets would serve as benchmarks. In
contrast with the existing research works on this task, we
provide a thorough examination of several shallow and
deep learning algorithms to robustly solve the automatic
Fig. 1 Some statistics on the single-label dataset categories
tagging of Arabic news articles.

As for the multi-label dataset, we collected this dataset


3 Datasets using (Python Scrapy, Selenium and BeautifulSoup) from
ten different websites listed in Table 2. The articles in this
We visited 7 popular news portals (arabic.rt.com, dataset, all belong to one of the 4 classes: [Middle East,
youm7.com, cnbcarabia.com, beinsports.- Business, Technology, Sports], in additional to hundreds of
com, arabic.cnn.com, skynewsarabic.com tags. It consists of 293,363 multi-tagged articles written in
and tech-wd.com), to collect the articles from. We used (MSA). Figure2 shows the distribution of the articles in the
(Python Scrapy) library to scrape the articles. The single main categories.
labeled dataset contains 89,189 Arabic documents (ap-
proximately 32.5 million words). The articles of the dataset
are categorized under four main classes: [’Sports,’ ’Middle 4 Single-label classification
East politics,’ ’Business’ and ’Technology’].
All the collected articles are written in Modern Standard 4.1 Text features
Arabic (MSA), with no dialects. The articles are grouped in
one corpus. Machine learning algorithms cannot process text directly,
and to solve this, we represent the text in numerical vec-
Table 1 and Fig. 1 describe the distribution of the arti- tors. Words of the articles represent categorical features,
cles under the 4 categories in the dataset. In average, we and each sentence will be presented by one vector. This
scraped around 22k articles for each category. We made process is called vectorization. Two of the most common
sure to avoid bias by constructing a balanced dataset. techniques used in text vectorization are countVectorizer
and TF-IDFVectorizer. Both of these vectorizers are used
Table 1 Number or documents collected from 7 news portals to represent textual data in vector format. While
Websites Classes Articles count countVectorizer keeps track of the number of tokens (i.e.,
features) encountered in a document, TF-IDFVectorizer
Sky News Arabia Sports 7923
CNN Arabia Sports 3800
Tech 1680 Table 2 Scraped news portal for the multi-label dataset
Middle East 21,516 Websites scraped
Business 3908
CNBC Arabia Bein Sports
Bein Sports Sports 6603
CNN Arabia Tech-wd
Tech-wd Tech 23,682
Masrawy aitnews
Arabic RT Business 896
Youm7 Arabic RT
Youm7 Business 14,478
Al Arabiya SkyNewsArabia
CNBC Arabia Business 4653

123
Neural Computing and Applications (2022) 34:1135–1159 1139

Tokens that appear very frequently have a less of an


impact when being represented by the term frequency-in-
verse document frequency. This vectorizer is made up of
two components:

● Term Frequency (TF): computes how many times a


word appears in a given document, then adjusts the
frequency taking into consideration the length of the
document.
● Inverse Document Frequency (IDF): computes how
common or rare a word is in the entire article set. If a
word appears many times and is common, the score
approaches 0, otherwise, it approaches 1.
After that, we conducted another comparison using a cus-
Fig. 2 Some statistics on the multi-label dataset categories tom-made list of stop-words instead of the built-in list. In
fact, we adopted the customized list as it reported better
stores the weighted frequency of each token with respect to results. The general flow of operations of the proposed
the document. TF-IDFVectorizer is favored over system is described in Fig. 3.
countVectorizer as the latter one is biased to most frequent
tokens opposed to low frequent features that may be key 4.2 Selected classifiers
feature in determining the document genre. To overcome
this problem, we adopted TF-IDFVectorizer, which will Several different supervised classifiers are used for text
compute the relative frequency of each feature in each classification, where the main purpose is to tag an input
document. This vectorizer computes the most common text with the best representative label. We studied and
features which could identify the document main topics. observed the performance of ten shallow learning models,
However, as it is based on the bag-of-words (BoW) con- in addition to the ensemble classifier. Next, we describe all
cept, it does not capture the semantics when compared to implemented algorithms:
other models such as word embeddings. We conducted an
experimental comparison between the two vectorizer ● Logistic Regression (LR): this is a predictive model. It
methods. We used a subset of our single-labeled dataset, is a statistical learning technique used for the task of
containing 40k articles, classified under three categories. classification. Even though the name of the classifier has
The comparison involves using each vectorizer as the the word ‘Regression’ in it, it is used to produce discrete
features selection technique, which shall be fed to the same binary outputs.
classifier (SVM) to determine the document genre (single- ● Multinomial Naı̈ve Bayes (MNB): this classifier esti-
label) for all documents in the dataset. Table 3 confirms mates the probability of each class-label, based on
that higher accuracy scores were produced when using the Bayes theorem, for some text. The result is the class-
TF-IDF vectorizer as opposed to the countVectorizer. label with the highest probability score. MNB assumes
the features are independent, and as a result, all features
contribute equally to the computation of the predicted
label.
Table 3 TF-IDF vectorizer versus count vectorizer: performance
evaluation ● Decision Tree (DT): DT is basically a tree, where nodes
Algorithms TF-IDF vectorizer Count vectorizer represent features and leaves are the output labels.
Branches indicate decisions and whenever a decision is
LR 96.4 97.3 answered, a new decision will be inserted recursively
SVM 97.5 97.0 until a conclusion is made. Recursion is used to partition
DT 92.4 91.7 the tree into several decisions with possible results.
MNB 91.1 96.8
● Support Vector Machines (SVM): SVM is a very
XG 91.2 91.2
prevalent supervised classifier. It is non-probabilistic.
KN 95.0 69.9
SVM uses hyperplanes to segregate labels. SVM
RF 95.1 94.5
supports linear and nonlinear models. Basically, each
hyperplane is expressed by the input documents (vector)

123
1140 Neural Computing and Applications (2022) 34:1135–1159

Fig. 3 Generic system flow-diagram

x satisfying wx_  b ¼ 0, where w is the normal vector to training examples whose mean (centroid) is closest to
the hyperplane, and b is the bias. the input document.
● Random Forest (RF): RF is a supervised learning-based ● AdaBoost Classifier (ADB): ADB is a meta-estimator
classifier. This ensemble model utilizes a set of decision that starts by fitting a classifier on the training set of
trees, which computes the resulting label aggregately. documents. Next, AB fits additional copies of the
Strictly, the input is the documents x1 ; x2 ; . . .; xn with classifier on the training dataset but after adjusting the
their matching labels y1 ; y2 ; . . .; yn . Each decision tree fb weights of misclassified documents such that succeed-
is trained using a random sample ðXb ; Yb Þ, where ing classifiers attend to problematic cases1.
b ranges from 1 to the total number of trees. The ● Ensemble/Voting Classifier (VC): VC is basically an
forecasted label shall be computed by a majority vote of ensemble solution. VC is packaging all preceding
all used trees. classifiers. Majority voting is utilized for predicting
● XGBoost Classifiers (XGB): XGB is another supervised the final class label.
classifier. This robust classifier became popular as a
result of winning several Kaggle contests. Similar to
RF, it is an ensemble classifier made of decision trees
5 Multi-label classification systems
and a variant of the gradient boosting algorithm.
● Multi-layer Perceptron (MLP) Classifier: MLP is com- 5.1 Classical classifiers
prised of at least 3 layers of neuron nodes (input layer,
hidden layers, and output layer). As for general neural We selected 2 classical classifiers: [OneVsRestLogis-
networks, each node is connected to nodes in the ticRegression, and OneVsRestXGBoost]. The OneVsRest
subsequent layer. MLP utilizes a non-linear activation Classifier decomposes the multi-label problem into multi-
function to produce the resulting label. ple independent binary classification problems (one per
● KNeighbors Classifier (KNN): KNN classifier determi- label). Both LogisticRegression classifier and XGBoost
nes the neighbors of an input document. The predicted classifier were each wrapped inside a OneVsRestClassifier.
class label is collectively computed by all determined We used the TF-IDF technique to vectorize the articles,
neighbors, where each one votes for the closest label. and we chose to keep the default hyperparameters for each
The class-label with the maximum ballots is adopted. classifier. To encode the labels, we used MultiLa-
This is another major vote classifier. belBinarizer() that returns the string labels assigned
● Nearest Centroid Classifier (NC): for NC, the class- to each article in a one-hot encoded format.
label is the centroid of its data points. Given an input 1
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.
document then the class-label is computed based on the
AdaBoostClassifier.html

123
Neural Computing and Applications (2022) 34:1135–1159 1141

5.2 Deep learning classifiers sequential data such as text classification and time series
forecasting. This is different from RNN since they can
Artificial neural networks are computational models that remember longer sequences for long periods of time.
are designed to behave similarly to the functionality of a LSTM has three types of gates: the input, the output gates
human brain from analyzing information to deciding and the update gates. These gates determine what data to
actions. They are implemented to solve problems that are keep and what data to throw away.
impossible or hard to solve by human or mathematical
Gated Recurrent Unit (GRU), although both units tackle
standards. Traditionally, a neural network consists of
the vanishing gradient problem, GRUs are distinguished
plenty of nodes that work in parallel with each other,
from LSTMs by possessing a hidden state (memory). GRU
consisting of a layer. These layers are then connected to
utilizes 2 gates: an update gate and reset gate. The reset
make up the network. Deep neural networks are neural
gate determines the amount of past information to forget,
networks with different numbers of layers between both
while the update gate decides what information to keep and
layers, the input layer and the output layer. These addi-
what to not.
tional layers are called hidden layers. Their main objective
is to add more computation to the network to solve com- Hierarchical Attention Network (HAN) is extremely good
plicated tasks. These three main layers are connected to at predicting the class of a given document, because it
form a DNN model. These networks can be tweaked to recognizes its structure. A document is constructed by first
perform well on images, audio and textual data. encoding the words, then encoding the sentences to form a
In our work, we use each of convolution neural networks document representation. An attention mechanism is also
and recurrent neural networks models for the multi-labeled used in a HAN, to make sure that the context of a word or a
text classification. Overall, we designed ten different sentence is accounted for. Words and sentences also differ
models. These models are described below. in importance in regard to the main message of the
document.
Convolution neural networks (CNN) are unique classes of
neural network models that are designed to identify hidden
5.3 Proposed deep learning models
patterns and relationships in large data sets. The role of
CNNs is to learn different features utilizing many filters.
In our work, we developed ten deep learning models. We
These features are then taken down further to learn high-
experimented the combination of both CNN and RNN in
order features. CNN uses a pooling layer which is main
addition to CRF-CNN have been used to develop many
function is to reduce the spatial dimensions by performing
models on different tasks. These new approaches have
downsampling. Applying a max-pooling layer with filters
shown better results compared to the traditional machine
of size 2x2 is the most common approach in pooling. Here,
learning models.
we perform a pooling operation that returns the maximum
Each of these models is composed of an input embed-
value on a window of size 2x2. Another type called aver-
ding layer, multi hidden layers, and a dense output layer.
age-pooling works similar to the mentioned type above
The utilization of embedding layer helps in capture the
however, this will return only the average value instead.
semantic relationships between words and so it assigns
Recurrent neural networks (RNN) RNN models are more similar words similar representations.
powerful in use cases in which context and time are criti- An embedding layer is internally used to convert a
cal. Backpropagation enables looping information back sparse one-hot encoded matrix to a dense vector space.
through the network which empowers RNN models with This reduces the computational complexity and so the
processing sequential data. This helps finding correlations training time. A word2vec word embedding was first
between dependent variables. In our work, we combine trained on our dataset and used in our experiment; how-
CNN and RNN to generate a new CRNN architecture. The ever, it was producing poor results. Thus, we ditched it and
new architecture applies a spatial dropout for the embed- kept using the Keras tokenizer. The size of the input is set
ding layer, we then have a Conv1D layer followed by an to 200 words, and this was obtained by trying different
RNN layer, we then add a global max pooling layer fol- sizes. We then design different types of deep neural net-
lowed by a dropout layer to apply regularization. Finally, work and use the best design in our work.
we have a dense layer followed by the output layer. Next, For the last dense layer, which is the output layer, we
we describe the most common models of RNN. want multi-labels as an output, so we use a binary
crossentropy loss function along with the sigmoid activa-
Long short-term memory (LSTM) can be constructed by
tion function. Its output dimension is equal to the number
using a number of LSTM units to make up the network.
of categories available.
They are applicable for different tasks that require

123
1142 Neural Computing and Applications (2022) 34:1135–1159

● CNN: The CNN model consists of a spatial dropout 6 Experimental results and discussion
layer, followed by one CNN layer with kernel size of 3,
and 1024 filters, followed by a global max pool layer 6.1 Single-label classification
and a dropout layer.
● RNN: We used both LSTM and GRU models. The The objective is to put 11 classifying models to the test and
LSTM model consists of three layers, while the GRU determine how successful they are to single-label Arabic
model consists of four layers. These selections have news articles. We will perform single-label classification
been carefully chosen by testing different implementa- on a subset of our single-labeled dataset. After that we will
tions of these models, we then picked the methods that compare the performance of the same models on a recently
were giving best accuracies. Both models are improved reported Arabic dataset ‘Akhbarona’ [25, 26] that contains
versions of standard RNN, which solve the vanishing seven categories in total. In this experiment, we used the
gradient problem. 80% for training, 20% for testing split. The training set
consists of 71,707 articles, while the testing set contained
● BIRNN: For an enhanced performance on sequence
17,432 articles.
classification problems, both RNN models are wrapped We calculate the accuracy score to evaluate the perfor-
by a Bidirectional wrapper. The data are fed to the mance of the classifiers. The accuracy score is the total
learning algorithm once from the beginning to the end, number of correctly classified samples over the total
and once in reverse. Running the input bidirectionally number of samples. From the training set, we extracted
will result in the network understanding the context about 344 k features. Additionally, pre-processing of input
better. In the forward run, both LSTM and GRU documents is performed in order to remove all the non-
preserve information from the past, and in the backward Arabic characters. When dealing with textual data collect
run, future information is preserved. from the web, it is highly advised to use this method. We
● Bi-LSLTM - CRF: Adding a CRF layer to a bidirec- proceed further by cleaning the scraped articles and erasing
tional neural network has been proven to be very all the elongation, digits, punctuation, isolated chars, Latin
effective for sequence labeling tasks. Since the CRF letters, Qur’anic symbols, and other marks that were pos-
model is a unidirectional model, providing the sequence sibly included.
in both directions will decrease the ambiguity of the We believe that applying normalization on the collected
words in the sequence. text is not a necessary step, even though majority of
● CNN ? RNN: We generate a new CRNN architecture research works on Arabic NLP tasks do implement nor-
by combining each of CNN and RNN layers together. malization. There are enough samples provided in the
The new architecture applies a spatial dropout for the dataset to represent Arabic character-set. It is worth noting
embedding layer, we then have a Conv1D layer that in some cases, the normalization step can change the
followed by an RNN layer, we then add a global max semantics of some words. Normalization, which is a widely
pooling layer followed by a dropout layer to apply adopted practice in Arabic computational linguistics, is the
regularization. Finally, we have a dense layer followed process of unifying the orthography of some Arabic char-
by the output layer. acters. Namely, alif forms [‫ ﺁ‬، ‫ ﺇ‬، ‫ ﺃ‬، ‫ ] ﺍ‬to [‫ ] ﺍ‬, hamza forms
[‫ ﺀ‬،‫ ﺅ‬،‫ ] ﺉ‬to [‫ ] ﺀ‬, haa/taa marbootah [‫ ﺓ‬،‫ ]ﻩ‬to [‫ ] ﻩ‬, and yaa/
● HAN: Applying an additional layer after the RNNs
alif maqsura [‫ ﻱ‬،‫ ] ﻯ‬to [‫ ] ﻯ‬. The normalization step is
models, called attention layer. Along with solving the
meant to reduce the vocabulary space. However, this pro-
long-term memory issues with RNNs, the output
cess may lead to losing some key features as the meaning
sequence generated will be conditional on selective
of some words would change after normalization. For
items in the input sequence. The hierarchical attention
example, the word “‫( ”ﻓﺄﺭ‬means “mouse”) while “‫ ”ﻓﺎﺭ‬after
network is extremely good at predicting the class of a
normalization means ’escaped’ or ’‫ﻛﺮﺓ‬,’, which means
given document, because it recognizes its structure. A
’football,’ after normalization becomes “‫ ”ﻛﺮﻩ‬, which
document is constructed by first encoding the words,
means “hatred”. Such meaning-change could result in
then encoding the sentences to form a document
dropping some important features. In addition, with a large
representation. An attention mechanism is also used in
corpus such as the ones proposed in this work, there is no
a HAN, to make sure that the context of a word or a
need for this pre-processing step. In fact, the results show
sentence is accounted for. Words and sentences also
that the non-normalized word representation is not seri-
differ in importance in regards to the main message of
ously hampered by the lack of text normalization.
the document.
To implement the classifiers, we used Scikit-learn and
kept using the default hyper-parameters in addition to L1

123
Neural Computing and Applications (2022) 34:1135–1159 1143

penalty for some of the models. We used the testing set to the ‘Business’ class as well, which is in line with the
test the proposed classifiers. The high accuracy scores prediction of SVM. This is a strong indicator of the
indicate how robust the proposed system is based on the robustness of the SVM classifier. In fact, this motivates the
tuned hyper-parameters. need for multi-label text classification as single-label
Figure 4 demonstrates the accuracy scores that were classification is insufficient.
obtained by the classifiers. 94.8% was the average of the To further verify the performance, we tested the clas-
accuracy score. The best result, 97.9%, was achieved by sifiers on ’Akhbarona,’ [25], which is an unbalanced
the SVM classifier. On the other hand, the worst result, dataset that includes 46,900 articles in total. This dataset
which is 87.7%, was produced by the AdaBoost classifier. has 7 main classes, which are ’Sports,’ ’Politics,’ ’Medi-
Furthermore, close results that range from 97.5% to 97.9% cine,’ ’Religion,’ ’Business,’ ’Technology’ and ’Culture.’
were produced by four classifiers. Pre-processing to remove elongation, punctuation, digits,
The two classifiers, MultinomialNB and KNeighbors, single characters, Quranic symbols, Latin letters, and other
produced the accuracies of 96.3% and 95.4%, respectively, marks is initially performed on the dataset. Similar to our
meanwhile the remaining classifiers produced scores that earlier training and testing, this dataset was also split into
range from 87.7% to 94.4%, which is below the average 80% articles for training and 20% for testing. It was safe to
compared to the previous classifiers. Figures 5 displays the expect that the accuracy scores will be lower for two rea-
confusion matrix for the best and worst classifiers, which sons. The first is due to the unbalanced dataset that would
are SVM and AdaBoost, respectively. cause the classifiers to be biased toward a specific category.
Furthermore, Table 4 states that SVM scored the highest Second, when the original number of classes increases, the
F1-score 98%, similar to the majority voting classifier, probability of incorrectly classifying a document increases
while the lowest score of 88% was produced by the Ada- as well.
Boost classifier. We also include ROC, Hamming score, Table 5 displays a summary of the accuracy results on
F1-score, precision, and recall metrics. We demonstrate the the Akhbarona dataset. The best result, 94.4%, was pro-
robustness of the best classifier, SVM, by showing the duced by the SVM classifier. However, the Adaboost gave
results of prediction. Figure 6 shows an article, from the the worst result of 77.9%. In addition, four out of 11
testing set, which is originally tagged as ‘Technology’. classifiers report accuracy scores that range between 93.9%
SVM model was able to classify the article under the same and 94.4%. As for the remaining 7 classifiers, only the
category, with an accuracy of 95.7%. We examined a KNeighbors classifier produced an accuracy of 90.8%,
sample of the incorrectly classified documents from the which is higher than the average. The other 6 classifiers
testing dataset in order to recognize the reason behind the produced results that vary from 77.9% to 88.4%. The
misclassification. We realized that several documents can table also shows the rest of the evaluation metrics: preci-
be argued as incorrectly classified. Figure 7 shows an sion, recall, and F1-score.
article that was initially classified under ‘Technology.’
After reading the article, we concluded that it belongs to

Fig. 4 Performance of all classifiers on our proposed dataset

123
1144 Neural Computing and Applications (2022) 34:1135–1159

Fig. 5 The confusion matrices for highest(SVM) and lowest (AD) performers

Table 4 Evaluation metrics for


Classifier Evaluation metrics
all classifiers on the testing
dataset Accuracy Ham. loss F1 Score Precision Recall ROC

LR 97.50 0.025 97.57 97.58 97.57 99.84


SVM 97.92 0.021 97.98 97.99 97.98 99.87
DT 90.76 0.092 90.92 90.94 90.92 93.90
MNB 96.30 0.037 96.37 96.50 96.31 99.76
XGB 93.47 0.065 93.64 93.89 93.51 99.27
KNN 95.87 0.041 95.95 95.97 95.94 99.14
RF 94.46 0.055 94.61 94.64 94.60 99.21
NC 91.16 0.088 91.32 92.53 91.37 92.40
ADB 87.73 0.123 88.15 89.01 87.84 94.08
MLP 97.87 0.021 97.93 97.93 97.92 99.88
Ensemble 97.92 0.021 97.98 97.99 97.98 99.88

6.2 Multi-label classification testing set containing 29,676. The validating dataset is used
to tune some of the model’s hyperparameters such as:
The aim, in this section, is to investigate the performance (layer size, hidden unit number and regularization term).
of classifying Arabic news articles with multiple labels. We The same text pre-processing steps used on the single-la-
experiment with classical classifiers and with deep learn- beled dataset are used on the multi-labeling dataset.
ing. The total number of articles in the subset used in these We implemented and tested the proposed classifiers on
experiments is 148,376 articles. We chose the labels with the multi-labeled dataset. We used a custom accuracy
the highest frequency, because the performance of super- metric to evaluate the accuracy of the predictions. It is the
vised classifiers is highly dependent on the number of ratio of correctly predicted tags (output as 1) over total
instances for each label. Figure 8 shows the count of the 21 expected tags (originally 1 in dataset). The more correct
labels chosen from the dataset. labels the model predicts, the more accurate it is. We chose
For the classical classifiers approach, we split the dataset a threshold of 50%, meaning we consider the labels with a
into 80% training set consisting of 118,700 labeled articles, probability percentage higher or equal to 50% to be correct.
and 20% testing set consisting of 29,676 articles. For the The second evaluating metric used is hamming loss.
deep learning approach, we split the dataset into 80% Hamming loss metric is often used to evaluate the perfor-
training set, containing 118,700 articles (where 10% of mance of multi-labeling classifiers. It is the fraction of
them were used for validation 11,870), and 20% into the wrong predicted labels to the total number of labels. The

123
Neural Computing and Applications (2022) 34:1135–1159 1145

Fig. 6 A correctly tagged news-article as ’Technology’

smaller the value of hamming loss, the better results the precision score and LSTM reported the best ROC value. It
model is achieving. is notable that the scores of GRU, BILSTM, and CGRU are
Table 6 displays the accuracy scores achieved by two close.
shallow-learning (SL) classifiers. The XGBoost classifier Focusing more on the CNN-GRU model, Table 8 shows
scored the highest accuracy of 84.73%, while the Logistic the precision, recall and f1 scores of the 21 multi-labels.
Regression scored the lowest accuracy of 81.34%. The Table 10 displays the average scores of the model with
hamming loss, which calculates the ratio of wrongly pre- respect to micro, macro, weighted, and samples averages
dicted labels to the total number of labels. The lower the for the CGRU classifier (Table 9).
percentage is the better. Both classifiers had comparable Figure 10 shows how the relationships between the 21
hamming loss scores, but the XGBoost scored the lowest, true labels are present in the testing set. The edges in this
achieving 2.24%. We also include ROC, Hamming score, graph present the instances in which the two labels at the
F1-score, precision, and recall metrics. end of each edge are present. The width of the edge indi-
We train-test-validate ten deep neural (DL) networks, cated the number of instances. Figure 11 shows the rela-
with different architectures, seeking the best at performing tionships produced by the model’s predictions. The way the
the task at hand. Figure 9 displays the resulting accuracy labels are appearing together is very different than how
percentages, using the same accuracy metric described they were appearing in the collected dataset. The model is
earlier. It shows that the CNN-GRU scored 94.85% and classifying the articles under different main categories. It
surpassed all the other classifiers, including the SL classi- has learned efficiently enough to start identifying multiple
fiers for multi-labels. Table 7 shows all metrics for each topics in the articles.
classifier including Hamming score, F1-score, precision, Figure 12 displays an example of a news article classi-
recall, and ROC. The CGRU achieved the best Hamming, fied by the model. Originally, it is tagged as ’Business’.
F1, and recall scores. However, GRU reported the best The article discusses the wealth of ’ARAMCO’ and how it

123
1146 Neural Computing and Applications (2022) 34:1135–1159

Fig. 7 A misclassified news-article as ’Business’; Originally it is tagged as ’Technology’

Table 5 Classifiers accuracy scores on ’Akhbarona’ dataset ’Technology,’ ’Business,’ ’Saudi Business,’ ’Google,’ and
’Apple.’ This is an accurate and sufficient tagging of the
Classifiers Evaluation metrics
article. The English translation of Fig. 12 is depicted as
Accuracy Precision Recall F1-score well.
LR 93.9 0.94 0.94 0.94 Figure 13, is another article classified under ‘Barca’ in
SVM 94.4 0.94 0.94 0.94 the website. The resulting tags from CGRU are ’Sports’
DT 83.0 0.83 0.83 0.83
and ’Real Madrid’. If you carefully read this article, you
MNB 88.0 0.91 0.88 0.88
will easily find that the article has nothing to do with
’Barca’. On the contrary, the article is mainly talking about
XGB 88.4 0.89 0.88 0.88
’Real Madrid’ and the injury of its player ’Hazard’. This
KNN 90.8 0.91 0.91 0.91
example shows how CNN model is outperforming the
RF 87.8 0.88 0.88 0.88
original tagging of the article. This example confirms the
NC 86.2 0.89 0.86 0.87
fact that such tagging models may solve the issue of re-
ADB 77.9 0.80 0.78 0.78
occurring human tagging errors. The English translation of
MLP 94.1 0.94 0.94 0.94
Fig. 13 is depicted as well.
Ensemble 94.3 0.94 0.94 0.94
To complete the analyses, we discuss the upper bounds
of the computational cost of the implemented classifiers.
We use the following notation: n is the number of training
samples, f is the number of features, nt is the number of
is beating both ’google’ and ’apple’. Upon feeding the trees with depth d (DT and similar classifiers), k is the
article to the CGRU model, the predicted labels are number of neighbors (KNN), nv is the number of support

123
Neural Computing and Applications (2022) 34:1135–1159 1147

Fig. 8 Count of the labels used in CGRU experiment

Table 6 Evaluation metrics of


Classifier Evaluation metrics
the SL classifiers for the multi-
label classification task Accuracy Ham. loss F1 Score Precision Recall ROC

OVR-LR 81.34 2.50 75.62 88.67 69.01 98.40


OVR-XGB 84.73 2.24 78.86 87.59 74.87 98.47

Fig. 9 Accuracy scores for all deep neural networks

vectors (SVM), and ni is the number of neurons at layer i in line. However, prediction time is more important as it is
a neural network with l layers that is trained e epochs. used for every prediction after training the classifiers.
Approximate upper bounds for training time as well as To verify Table 10, we measured the time consumed for
prediction time are listed in Table 10. It should be noted training the models, we ran an experiment with n varies
that training time is minor as it is carried out once and off- between[1000,5000,10000,20000] and f takes the values

123
1148 Neural Computing and Applications (2022) 34:1135–1159

Table 7 Evaluation metrics for


Classifier Evaluation metrics
all deep learning classifiers for
the multi-label classification Accuracy Ham. loss F1 Score Precision Recall ROC
task
CNN 91.34 1.61 89.13 90.99 87.51 95.08
CNNLSTM 91.34 1.61 89.13 90.99 87.51 95.08
BILSTM 94.03 1.27 90.25 92.31 88.48 97.06
BIGRU 91.34 1.61 89.13 90.99 87.51 95.08
GRU 94.28 1.21 90.55 92.90 88.70 98.04
LSTM 90.17 1.78 86.85 90.61 83.92 98.70
CRF-BILSTM 91.34 1.61 89.13 90.99 87.51 95.08
HANLSTM 92.92 1.45 90.60 91.05 90.57 83.83
HANGRU 92.96 1.43 90.66 91.16 90.40 93.52
CGRU 94.85 1.21 90.72 92.06 89.74 97.74

Table 8 Evaluation metrics of the CNN-GRU classifier per each of


the 21 labels
Label Precision Recall F1-score

Business 97.81 97.95 97.88


oil 90.58 94.27 92.39
Business America 89.12 76.30 82.21
Business Egypt 93.24 88.91 91.02
Business SA 81.23 87.18 84.10
ME 99.16 98.87 99.01
Syria 94.72 92.99 93.84
Egypt 91.84 93.57 92.69
Fig. 10 Relationships of true labels in the testing dataset
Yaman 95.91 88.47 92.04
Saudi Arabia 89.94 79.47 84.38
Iraq 94.31 89.28 91.73
Sports 99.67 99.88 99.77
Premier League 88.84 95.73 92.16
Real Madrid 90.53 90.78 90.65
Barca 89.47 90.79 90.13
Football 82.62 57.37 67.72
Tech 99.52 99.78 99.65
Android 89.49 89.45 89.47
Apple 93.11 89.44 91.24
Google 87.84 88.86 88.35
Social Media 94.19 95.31 94.75

Fig. 11 Relationships of predicted labels by CNN

[100,500,1000]. The average training time in seconds is


Table 9 Average evaluation scores of the CNN-GRU classifier depicted in Fig. 14.
CNNGRU Precision Recall F1-score To predict the execution time during training for a CNN
model, we need to consider the features that contribute to
Micro avg 94.72 93.48 94.09
this time estimate, [35]. However, such features are
Macro avg 92.06 89.74 90.72
numerous and include layer features, convolution and
Weighted avg 94.63 93.48 93.95 pooling features, and hardware features. Such features can
Samples avg 95.67 94.85 94.69 easily vary tremendously based on the deep learning net-
work being implemented. Layer features may include
activation function, optimizer, and batch size. Of course, a

123
Neural Computing and Applications (2022) 34:1135–1159 1149

Fig. 12 An example of a news article correctly tagged, with 5 tags, by the CGRU model

layer in the network may have far more of these features. 7 Conclusions
Convolutional and pooling features include matrix size,
input depth and padding, output depth, stride size, and We have presented an automatic general text categorization
kernel size. Hardware features are those related to GPU/ system for Arabic text that can handle both single-label as
CPU technology used, GPU/CPU count, memory, and well as multi-label tagging tasks. We described a single-
clock speed. labeled dataset (90k Arabic documents) with their labels
In order to work out a prediction model of the compu- collected from seven different news portals. In addition, we
tational time cost for training a CNN network that has reported a multi-labeled dataset comprising 293k Arabic
multiple layers, we need to compute the time required for a articles, with their tags, collected from 10 news portals.
forward and backward pass on a single batch for a single We examined the first dataset by implementing 12
epoch. Then, the total time-cost estimate (T) for training shallow-learning classifiers for the single-label text classi-
the CNN network is: fication task. Although the SVM model outperformed the
X
l rest, the final accuracy scores confirm the robustness of all
T ¼eb ti classifiers, ranging between 87% and 97%. We also used
i¼0 the voting classifier, in pursuit of better accuracy, using an
where l is the number of layers in the CNN model, ti is the ensemble model. However, its resulting performance is
batch execution time estimate for layer i, b is the number of analogous to SVM.
batches, and e is the number of epochs. Of course, e is a The second dataset was examined by implementing ten
constant number that can be ignored. different deep learning neural networks along with two
shallow learning classifiers for the multi-labeling task. A
custom accuracy metric was implemented to evaluate the
performance of the developed models. For the shallow
learning case, the OVR-XGBoost classifier reported the
higher accuracy than the OVR-Logistic Regression

123
1150 Neural Computing and Applications (2022) 34:1135–1159

Fig. 13 An example of a misclassified news article that turns to be good

Table 10 Approximate
Classifier Training Prediction
computational cost for the
classifier LR O(nf) O(f)
SVM Oðn2 f þ n3 Þ Oðnv f Þ
DT Oðn f Þ
2 O(df)
MNB O(nf) O(f)
XGB Oðnfnt Þ Oðfnt Þ
KNN O(knf) O(kf)
RF OðnlogðnÞfnt Þ Oðdnt Þ
NC O(nf O(f)
ADB Oðnfnt Þ Oðfnt Þ
MLP/NN Oðeðfn1 þ n1 n2 þ  þ nl1 nl ÞÞ Oðfn1 þ n1 n2 þ  þ nl1 nl ÞÞ

classifier. Using deep learning models, the accuracy scores 94.85% reported by CNN-GRU, while the worst accuracy
were considerably higher than the two shallow-learning was 90.17% achieved by LSTM classifier. The rest of the
classifiers. The highest achieved accuracy score was

123
Neural Computing and Applications (2022) 34:1135–1159 1151

Fig. 14 Average training time for the standard classifiers for multiple n samples and f features; Logarithmic scale

Table 11 Classifiers parameters


Classifier parameters settings

LR C=1.0, class_weight=None, dual=False, fit_intercept=True,


intercept_scaling=1, max_iter=100, multi_class=’ovr’,
penalty=’l2’, solver=’lbfgs’, tol=0.0001, warm_start=False
SVM C=1.0, break_ties=False, cache_size=200, coef0=0.0,
decision_function_shape=’ovr’, degree=3, gamma=’scale’,
kernel=’linear’,max_iter=-1, probability=True,
shrinking=True, tol=0.001
DT criterion=’gini’, min_samples_split=2, min_samples_leaf=1
MNB alpha=1.0, fit_prior=True
XGB loss=’deviance’, learning_rate=0.1, n_estimators=100, subsample=1.0,
criterion=’friedman_mse’, min_samples_split=2, min_samples_leaf=1,
max_depth=3, warm_start=False, validation_fraction=0.1, tol=0.0001
KNN n_neighbors=5, weights=’uniform’, algorithm=’auto’, leaf_size=30, p=2,
metric=’minkowski
RF n_estimators=100, *, criterion=’gini’, min_samples_split=2,
min_samples_leaf=1, min_weight_fraction_leaf=0.0
NC metric=’euclidean’
ADB algorithm=’SAMME.R’, learning_rate=1.0, n_estimators=50
MLP activation=’relu’, alpha=0.0001, batch_size=’auto’, beta_1=0.9, beta_2=0.999,
epsilon=1e-08, hidden_layer_sizes=(100,), learning_rate=’constant’,
learning_rate_init=0.001, max_iter=200, momentum=0.9, n_
iter_no_change=10, power_t=0.5, solver=’adam’, tol=0.0001,
validation_fraction=0.1, warm_start=False

models performed very well ranging from 91.34% to Appendix 1


94.28%.
In future, we intend to test different embedding models Shallow Learning Classifiers
such as BERT and ELMo besides using transformer net- The different parameters considered for each classifier
works as well. In addition, we are aiming at increasing the are detailed in this appendix. We list all parameters
number of labels in the proposed datasets. We intend to required to produce the same results. Unlisted parameters
make all datasets available to the research community. are kept with the default values (Table 11).

123
1152 Neural Computing and Applications (2022) 34:1135–1159

Deep Learning Classifiers

● CNN

123
Neural Computing and Applications (2022) 34:1135–1159 1153

● CNNLSTM

123
1154 Neural Computing and Applications (2022) 34:1135–1159

● BILSTM

● BIGRU

123
Neural Computing and Applications (2022) 34:1135–1159 1155

● GRU

● LSTM

123
1156 Neural Computing and Applications (2022) 34:1135–1159

● CRF-BILSTM

123
Neural Computing and Applications (2022) 34:1135–1159 1157

● HANLSTM

123
1158 Neural Computing and Applications (2022) 34:1135–1159

● CGRU

5. Al Qadi L, El Rifai H, Obaid S, Elnagar A (2020) A scalable


Declarations shallow learning approach for tagging arabic news articles. Jor-
danian J Comput Inform Technol 6(3):263–280
Conflict of interest The authors declare that they have no conflict of 6. Al-Salemi B, Ayob M, Kendall G, Noah SAM (2019) Multi-label
interest. arabic text categorization: A benchmark and baseline comparison
of multi-label learning algorithms. Inf Process Manage 56
(1):212–227. https://doi.org/10.1016/j.ipm.2018.09.008
7. Al-Sbou AMF (2018) A survey of arabic text classification
References models. Int J Elect Comput Eng (IJECE) 8(6):4352–4352. https://
doi.org/10.11591/ijece.v8i6.pp4352-4355
1. Aggarwal CC, Zhai CX (2012) Mining text data. Springer Pub- 8. Al-Shalabi R, Obeidat R (2008) Improving knn arabic text clas-
lishing Company, Incorporated sification with n-grams based document indexing. In: Proceedings
2. Al-Alwani A, Beseiso M (2013) Arabic spam filtering using of the sixth international conference on informatics and systems,
bayesian model. Int J Comput Appl 79(7):11–14. https://doi.org/ pp 108–112
10.5120/13752-1582 9. Al-Tahrawi MM, Al-Khatib SN (2015) Arabic text classification
3. Al-Harbi S, Almuhareb A, Al-Thubaity A, Khorsheed M, Al- using polynomial networks. J King Saud Univ Comput Inform
Rajeh A (2008) Automatic arabic text classification. In: Pro- Sci 27(4):437–449. https://doi.org/10.1016/j.jksuci.2015.02.003
ceedings of the 9th international conference on the statistical 10. Al-Zaghoul F, Al-Dhaheri S (2013) Arabic text classification
analysis of textual data, pp 12–14 based on features reduction using artificial neural networks. In:
4. Al Qadi L, El Rifai H, Obaid S, Elnagar A (2019) Arabic text Proceedings of the 15th international conference on computer
classification of news articles using classical supervised classi- modelling and simulation, pp 485–490
fiers. In: 2019 2nd International conference on new trends in 11. Alalyani N, Larabi S (2018) Nada: New arabic dataset for text
computing sciences (ICTCS), pp 1–6 classification. Int J Adv Comput Sci Appl 9(9):5. https://doi.org/
10.14569/ijacsa.2018.090928

123
Neural Computing and Applications (2022) 34:1135–1159 1159

12. Alsaleem S (2011) Automated arabic text categorization using Technologies 3(1):1–187. https://doi.org/10.2200/
svm and nb. Int Arab J eTechnol 2(2):124–128 s00277ed1v01y201008hlt010
13. Azarbonyad H, Dehghani M, Marx M, Kamps J (2021) Learning 30. Harrag F, El-Qawasmeh E, Pichappan P (2009) Improving arabic
to rank for multi-label text classification: combining different text categorization using decision trees. In: 2009 First Interna-
sources of information. Nat Lang Eng 27(1):89–111 tional Conference on Networked Digital Technologies, pp 110–
14. Bawaneh MJ, Alkoffash MS, Al Rabea AI (2008) Arabic text 115. https://doi.org/10.1109/NDT.2009.5272214
classification using k-nn and naive bayes. J Comput Sci 4(7):600– 31. Bilal Hawashin A, Mansour Aljawarneh S (2013) An efficient
605. https://doi.org/10.3844/jcssp.2008.600.605 feature selection method for arabic text classification. Int J
15. Biniz M, Boukil S, El Adnani F, Cherrat L (2018) Abd elmajid el Comput Appl 83(17):1–6. https://doi.org/10.5120/14666-2588
moutaouakkil, “arabic text classification using deep learning 32. Hmeidi I, Hawashin Bilal, El-Qawasmeh E (2008) Performance
technics. Int J Grid Distrib Comput 11(9):103–114 of knn and svm classifiers on full word arabic articles. Adv Eng
16. Boudad N, Faizi R (2017) Rachid oulad haj thami, raddouane Inform 22(1):106–111. https://doi.org/10.1016/j.aei.2007.12.001
chiheb, “sentiment analysis in arabic: a review of the literature”. 33. Hmeidi I, Al-Ayyoub M, Abdulla NA, Almodawar AA, Abooraig
Ain Shams Eng J R, Mahyoub NA (2015) Automatic arabic text categorization: A
17. Dahou A, Xiong S, Zhou J (2016) Mohamed houcine haddoud comprehensive comparative study. J Inf Sci 41(1):114–124.
and pengfei duan, “word embeddings and convolutional neural https://doi.org/10.1177/0165551514558172
network for arabic sentiment classication”. In: Proceedings of 34. Hmeidi I, Al-Ayyoub M, Mahyoub NA, Shehab MA (2016) A
COLING 2016, the 26th International Conference on Computa- lexicon based approach for classifying arabic multi-labeled text.
tional Linguistics: Technical Papers, pp 2418–2427 Int J Web Inform Syst
18. Dharmadhikari SC, Ingle M, Kulkarni P (2011) Empirical studies 35. Justus D, Brennan J, Bonner S, Mcgough A (2018) Predicting the
on machine learning based text classification algorithms. Adv computational cost of deep learning models, pp 3873–3882.
Comput Int J 2(6):161–169. https://doi.org/10.5121/acij.2011. https://doi.org/10.1109/BigData.2018.8622396
2615 36. Korde S, Mahender CN (2012) Text classification and classifiers:
19. El-Haj M, Rayson P, Aboelezz M (2018) Arabic dialect identi- a survey. IJAIA J 3(2):85–99. https://doi.org/10.5121/ijaia.2012.
cation in the context of bivalency and code-switching. LREC 3208
2018, Eleventh International Conference on Language Resources 37. Li Y, Nie X, Huang R (2018) Web spam classification method
and Evaluation, pp 3622–3627 based on deep belief networks. Expert Syst Appl 96:261–270.
20. El-Halees A (2007) Arabic text classification using maximum https://doi.org/10.1016/j.eswa.2017.12.016
entropy. Islam Univ J (Series of Natural Studies and Engineering) 38. Malmasi S, Dras M (2015) Language identication using classier
15:167–167 ensembles. Proceedings of the joint workshop on language
21. El Kourdi M, Bensaid A, Rachidi Te (2004) Automatic arabic technology for closely related languages, varieties and dialects,
document categorization based on the naı̈ve bayes algorithm. In: pp 35–43
Proceedings of the Workshop on Computational Approaches to 39. Mesleh AMA (2007) Chi square feature extraction based svms
Arabic Script-based Languages, Semitic ’04, pp 51–58 arabic language text categorization system. J Comput Sci 3
22. Elnagar A, Einea O (2016) Brad 1.0: Book reviews in arabic (6):430–435. https://doi.org/10.3844/jcssp.2007.430.435, expor-
dataset. In: 2016 IEEE/ACS 13th International Conference of ted from https://app.dimensions.ai on 2019/02/03
Computer Systems and Applications (AICCSA), pp 1–8. https:// 40. Noaman HM, Elmougy S, Ghoneim A, Hamza T (2010) Naive
doi.org/10.1109/AICCSA.2016.7945800 bayes classifier based arabic document categorization. Proceed-
23. Elnagar A, Khalifa YS, Einea A (2018a) Hotel Arabic-Reviews ing of the 7th international conference on informatics and sys-
Dataset Construction for Sentiment Analysis Applications, tems (INFOS2010), pp 1–5
Springer International Publishing, pp 35–52. https://doi.org/10. 41. Raho G, Al-Shalabi R, Kanaan G (2015) Asma’a nassar different
1007/978-3-319-67056-0_3 classification algorithms based on arabic text classification:
24. Elnagar A, Lulu L, Einea O (2018b) An annotated huge dataset Feature selection comparative study. IJACSA) Int J Adv Comput
for standard and colloquial arabic reviews for subjective senti- Sci Appl 6(2)
ment analysis. Proc Comput Sci 142:182–189. https://doi.org/10. 42. Saad M (2010) The Impact of Text Preprocessing and Term
1016/j.procs.2018.10.474 (arabic Computational Linguistics) Weighting on Arabic Text Classification. Master’s thesis, Com-
25. Elnagar A, Einea O, Al-Debsi R (2019) Automatic text tagging of puter Engineering Dept., Islamic University of Gaza, Palestine
arabic news articles using ensemble learning models”. In: Pro- 43. Shehab MA, Badarneh O, Al-Ayyoub M, Jararweh Y (2016) A
ceedings of the 3rd International Conference on Natural Lan- supervised approach for multi-label classification of arabic news
guage and Speech Processing, pp 59–66 articles. In: 2016 7th International Conference on Computer
26. Elnagar A, Al-Debsi R, Einea O (2020) Arabic text classification Science and Information Technology (CSIT), pp 1–6. https://doi.
using deep learning models. Inform Process Manag 57 org/10.1109/CSIT.2016.7549465
(1):102121–102121. https://doi.org/10.1016/j.ipm.2019.102121 44. Wang T, Liu L, Liu N, Zhang H, Zhang L, Feng S (2020) A
27. Gharib TF, Habib MB, Fayed ZT (2009) Arabic text classification multi-label text classification method via dynamic semantic rep-
using support vector machines. Int J Comput Their Appl 16 resentation model and deep neural network. Appl Intell 50
(4):192–199 (8):2339–2351
28. Gonçalves T, Quaresma P, Kłopotek M, Wierzchoń S, Tro-
janowski K (2004) The impact of nlp techniques in the multilabel Publisher's Note Springer Nature remains neutral with regard to
text classification problem. Intelligent Information Processing jurisdictional claims in published maps and institutional affiliations.
and Web Mining 25
29. Habash N, Hirst G (2010) Introduction to arabic natural language
processing. Synthesis Lectures on Human Language

123

You might also like