0% found this document useful (0 votes)
35 views15 pages

Arabic Sentiment Analysis Model

Enhancing Arabic Aspect-Based Sentiment Analysis Using End-to-End Model

Uploaded by

r.python2030
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views15 pages

Arabic Sentiment Analysis Model

Enhancing Arabic Aspect-Based Sentiment Analysis Using End-to-End Model

Uploaded by

r.python2030
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

This article has been accepted for publication in IEEE Access.

This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3342755

Date of publication xxxx 00, 0000, date of current version Nov 09, 2023.
Digital Object Identifier 10.1109/ACCESS.2023.17425

Enhancing Arabic Aspect-Based Sentiment


Analysis Using End-to-End Model
1 1 1 1
Ghada M. Shafiq , Taher Hamza , Mohammed F. Alrahmawy and Reem El-Deeb
1
Department of Computer Science, Faculty of Computers and Information, Mansoura University, Mansoura 35516, Egypt
Corresponding author: Ghada M. Shafiq (e-mail: [email protected]).

ABSTRACT The majority of research on the Aspect-Based Sentiment Analysis (ABSA) tends to split this
task into two subtasks: one for extracting aspects, Aspect Term Extraction (ATE), and another for identifying
sentiments toward particular aspects, Aspect Sentiment Classification (ASC). Although these subtasks are
closely related, they are performed independently; while performing the Aspect Sentiment Classification task,
it is assumed that the aspect terms are pre-identified, which ignores the practical interaction required to
properly perform the ABSA. This study addresses these limitations using a unified End-to-End (E2E)
approach, which combines the two subtasks into a single sequence labeling task using a unified tagging
schema. The proposed model was evaluated by fine-tuning the Arabic version of the Bidirectional Encoder
Representations from Transformers (AraBERT) model with a Conditional Random Fields (CRF) classifier
for enhanced target-polarity identification. The experimental results demonstrated the efficiency of the
proposed fine-tuned AraBERT-CRF model, which achieved an overall F1 score of 95.11% on the SemEval-
2016 Arabic Hotel Reviews dataset. The model's predictions are then subjected to additional processing, and
the results indicate the superiority of the proposed model, achieving an F1 score of 97.78% for the ATE task
and an accuracy of 98.34% for the ASC task, outperforming previous studies.

INDEX TERMS Sentiment Analysis, Aspect-Based, AraBERT, CRF, Transfer Learning.

I. INTRODUCTION goal is to identify a category for an aspect word from a pre-


Due to the development of the internet and its platforms, defined set of categories and determine its corresponding
users frequently share their ideas on various blogs and social sentiment, respectively. In this research, we are concerned
media platforms. Understanding and analyzing users' ideas only with the ATE and ASC subtasks.
and opinions about a product or service is essential for Arabic ABSA has gained some attention over the past
businesses and owners. Sentiment Analysis (SA) is few years. However, the research involved still needs to be
concerned with this type of information; it understands improved due to the relative complexity and ambiguity of the
human opinions and analyzes them to obtain the required Arabic language's morphology. Additionally, the lack of
knowledge. SA can be performed at the document, sentence, publicly available annotated datasets and tools for processing
or aspect levels [1]. The Aspect-Based Sentiment Analysis Arabic text also represents a challenge [1].
(ABSA) is the most challenging type. It is usually divided Most studies involved in Arabic ABSA evaluate the
into four subtasks: Aspect Term Extraction (ATE), the first ATE and ASC subtasks independently, ignoring the
subtask, which aims to extract the explicit aspect terms in relatedness and dependency of the two subtasks [2, 3, 4, 5, 6,
each sentence; and Aspect Sentiment Classification (ASC), 7]. Some studies are either only extracting aspects from a
the second subtask, which seeks to identify the sentiment given sentence (ATE) [8, 9, 10] or predicting sentiment
polarities toward the given aspects. For example, the polarities (ASC) assuming that aspect entities are pre-
sentence “the food is delicious, but the service is terrible” identified input features to the model, which is not the case
contains the aspect term “food” with a positive sentiment in real-world scenario [11, 12, 13, 14].
polarity and the aspect term “service” with a negative On the contrary, ABSA research in the English language
sentiment polarity. is more evolved. Recent research directions are towards
Aspect Category Identification and Aspect Category tackling both Aspect Term Extraction and Aspect Sentiment
Sentiment Classification are the remaining subtasks; their Classification through a single model using an End-to-End

VOLUME 11, 2023 1

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3342755

(E2E) approach, which can help overcome the limitations of of tasks and deliver remarkable results without the need for
previous studies. The E2E-ABSA can be carried out in one large datasets [9, 19, 20]. AraBERT [21] is a pre-trained
of two approaches [15]. The first approach is known as the language model specifically designed to handle the
Joint approach; it involves performing the two subtasks in complexities and ambiguity of the Arabic language and has
parallel with two sets of labels: one for the aspect boundaries achieved state-of-the-art performances in many Arabic NLP
(B, I, and O) [16] denoting the Beginning, Inside, and tasks.
Outside of the aspect term, respectively, for the ATE task, Utilizing this model to evaluate our proposed model can
and the other set of labels represents the sentiment polarities have a great influence. The bi-directionality of BERT [22]
(positive, negative, and neutral) for the ASC task. The allows it to learn the context of each word with respect to the
outcomes of both tasks are combined to produce the final entire sequence simultaneously, making it easier for the
label. However, the lack of a correlation between the aspect model to identify the aspect boundaries. Furthermore, the
boundaries and the corresponding sentiment polarities could self-attention mechanism of BERT [18] allows for the
cause this approach to suffer from error propagation [15]. association of opinion words with their relevant aspect terms
The second approach is the unified approach, which in order to predict sentiment polarity.
combines the two subtasks into a single sequence labeling The Conditional Random Fields (CRF) [23] classifier
task. The aspect boundary labels and the sentiment polarity has also proven its efficiency in delivering accurate results in
labels are combined to generate one set of unified labels (B- a variety of sequence labeling tasks [9]; it preserves the
positive, I-positive, etc.). Although the unified approach dependencies between tags/labels, ensuring the correctness
preserves the dependency between the aspect boundaries and of the predicted tag sequence and boosting overall
their sentiment polarities, it makes model prediction more performance.
challenging and can result in performance degradation [17]. Motivated by the aforementioned, the following is a
The model must identify the aspect boundary and the summary of the main contributions of this study:
sentiment polarity without providing any implicit prior • This study aims to tackle the subtasks of ABSA,
information about the aspect terms. An example of an Arabic specifically ATE and ASC, by integrating them into
sentence that clarifies the differences among ABSA a single sequence labeling task using a unified E2E
approaches is shown in TABLE 1. approach in order to overcome the previously
Several techniques, from rule-based to traditional mentioned limitations of two separate models for
machine and deep learning techniques, have been used to each subtask. To the best of our knowledge, this is
handle the Arabic ABSA. Rule-based techniques are static the first study to apply a unified E2E-ABSA on the
techniques with no learning models involved; they also rely SemEval-2016 Arabic Hotel Reviews dataset [2].
on external resources, which are scarce in Arabic. Machine • Preparing the dataset of Arabic Hotel Reviews [2]
Learning (ML) techniques, on the other hand, rely on so that it matches the desired classification task.
intensive feature engineering to adjust the data and select the • Several experiments were applied to evaluate the
appropriate features. Although Deep Learning (DL) proposed E2E approach, utilizing a feature-based
techniques have overcome the intensive feature engineering vs. fine-tuned AraBERT model along with CRF vs.
limitation, they require a large dataset for models to train and softmax [24] to assess the impact of different
produce accurate results [9]. implementations on the performance of the
Recently, pre-trained transformer-based language proposed model.
models [18] have attracted much attention due to their • Resolve the complexity and morphological
significant influence on various Natural Language ambiguity of the Arabic language using the
Processing (NLP) applications, including ABSA. A large AraBERT model.
amount of unlabeled texts were used to train these models to • Preserve the tag/label dependencies using
make them efficient in comprehending the input context. As Conditional Random Fields.
a result, these models can be fine-tuned to handle a variety
TABLE 1. Example to clarify the difference among ABSA approaches applied on the Arabic sentence " ‫الطعام لذيذ ولكن الخدمة سيئة‬," i.e., the food is delicious,
but the service is terrible.

Task Input Example Output

‫ الطعام‬+ ‫الطعام لذيذ ولكن الخدمة سيئة‬ Positive


ABSA Sentence + Aspect
‫ الخدمة‬+ ‫الطعام لذيذ ولكن الخدمة سيئة‬ Negative
ATE Sentence ‫الطعام لذيذ ولكن الخدمة سيئة‬ O B O O B
ASC Sentence ‫الطعام لذيذ ولكن الخدمة سيئة‬ O NEG O O POS
Joint E2E-ABSAa ------------ ------------- O B-NEG O O B-POS
Unified E2E-ABSA Sentence ‫الطعام لذيذ ولكن الخدمة سيئة‬ O B-NEG O O B-POS
a
Join output labels of both ATE and ASC tasks.

2 VOLUME 11, 2023

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3342755

• Experimental results demonstrate that the proposed with the same set of features utilized in [7]. However, the
fine-tuned AraBERT-CRF model outperforms SMO classifier outperformed the baseline work and achieved
single-task methods and yields a better ABSA task the best results regarding the OTE task and the sentiment
representation. polarity identification task, respectively. Although ML-based
models perform well, they rely significantly on data
The rest of the paper is structured as follows: Section II
preprocessing and intensive feature engineering.
provides the related works in the ABSA field. The proposed
Additionally, Deep Learning models have made
model is presented in Section III. Section IV discusses the
conducted experiments and comparisons of the achieved significant contributions to the ABSA task. The authors in [14]
results with other related works. Section V shows our proposed INSIGHT-1 at SemEval-2016. They applied a
conclusion and future directions. Convolutional Neural Network (CNN) model for the ABSA
task on the Arabic Hotel Reviews dataset. The authors in [4],
II. RELATED WORKS the same authors of [26], have examined the use of the
This section provides an overview of the works applied to the Recurrent Neural Network (RNN) model to address the OTE
two subtasks of Arabic ABSA, Aspect Term Extraction and task as well as the aspect sentiment polarity identification task.
Aspect Sentiment Classification, showing their advantages They combined the word2vec [27] word embedding along
and limitations. However, while some studies may have with the features presented in the previous experiment [26].
covered other ABSA subtasks, our focus in this study is The results demonstrated that the SMO classifier
entirely on ATE and ASC. In addition, some of the work on outperformed the RNN model regarding performance metrics;
English E2E-ABSA is presented due to the lack of work on however, the RNN was faster during the execution time. Other
Arabic E2E-ABSA. variations of RNN are then explored in many studies. The
A. ASPECT TERM EXTRACTION (ATE) and ASPECT Bidirectional Long Short-Term Memory (BiLSTM) [28] and
SENTIMENT CLASSIFICATION (ASC) the Bidirectional Gated Recurrent Unit (BiGRU) were the
The Aspect Term Extraction task (or Opinion Target most utilized techniques in combination with the CRF
Expression (OTE) Extraction) extracts the explicit target classifier. In [5], the authors utilized a BiLSTM-CRF model
opinionated words or phrases in each text. It is usually for the ATE task, whereas in [3], a BiGRU-CRF model was
formulated as a sequence labeling task with a BIO tagging utilized. In [8], a BiLSTM-attention-LSTM-CRF model is
schema [16]. Consequently, the Aspect Sentiment utilized for the OTE task. As a feature representation, a
Classification task identifies the sentiment polarities towards combination of Continuous-Bag-of-Words (CBOW) [27] and
the given aspects [1]. This task is usually addressed with character-level embeddings generated via CNN is utilized in
several naming conventions of aspect term/based [3] and [8]. The fastText [29] character-level embedding is
polarity/sentiment identification/classification; however, for utilized in [5]. For the aspect-based sentiment polarity
simplicity, we will refer to it as ASC. classification task, the authors in [3] proposed an interactive
A lot of research has been conducted regarding the two attention network model (IAN) combined with a BiGRU. In
subtasks. In [25], the authors provided a benchmark annotated [5], they proposed the Aspect Based-LSTM-Polarity
Arabic News Posts dataset with a lexicon-based approach to Classification (AB-LSTM-PC) model with an aspect
evaluate their work on aspect term extraction and aspect term attention-based vector. In [12], the authors used a combination
polarity identification. The same authors then investigated of CBOW and skip-gram character-level embeddings. They
enhancing their baseline work by utilizing a set of ML applied a Stacked Bidirectional Independent LSTM (Bi-Indy-
classifiers, including CRF, Naïve Bayes (NB), Decision Tree LSTM) with a position-weighting and an attention mechanism
(J48: WEKA1 implementation), and K-Nearest Neighbor combined with a GRUs layer for the aspect sentiment
(IBK: WEKA implementation) along with a set of classification task.
morphological and word features including Named Entity An improvement in the performance concerning the
Recognition (NER), Part-of-Speech (POS) tagging and N- previous experiments was observed after utilizing character-
Grams. Results demonstrated that the J48 classifier level embeddings and attention mechanisms.
outperformed other classifiers regarding the ATE task, Consequently, pre-trained language models based on
whereas the CRF classifier achieved the best performance transformer architecture [18] have achieved remarkable
regarding the aspect term polarity identification task [7]. The success in Arabic ABSA. The authors in [10] used a
authors in [2] created a benchmark dataset of Arabic Hotel combination of AraBERT and Flair embeddings for aspect
Reviews in SemEval-2016 for the ABSA task. They applied extraction. They compared attaching a BiLSTM-CRF and
the Support Vector Machine (SVM) classifier as a baseline BiGRU-CRF layer on top of the stacked embeddings. The
model. An enhanced study is introduced in [26]; the authors results showed that fine-tuning AraBERT with a BiLSTM-
experimented with applying NB, Bayes Networks, J48, IBK, CRF layer achieved better performance. In [11], the authors
and SVM (SMO: WEKA implementation) classifiers along fine-tuned the pre-trained language model Arabic BERT [30]

1
https://www.cs.waikato.ac.nz/ml/weka/

3 VOLUME 11, 2023

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3342755

for the aspect sentiment polarity classification task. They used Aspect Sentiment Detection modules with a sentiment
a sentence-pair classification approach where the aspect term lexicon and an attention mechanism. The BiLSTM-CRF
is paired with the input sentence as an auxiliary sentence. In layer is utilized for predicting the final target sentiment label.
[13], the authors combined AraBERT and Arabic BERT and The authors also utilize attaching BERT embeddings, which
fine-tuned the generated Sequence-to-Sequence (Seq2Seq) eventually boost the performance. In [35], the authors
model for the aspect term polarity task. In [6], the authors utilized a BERT-SAN model where the BERT model is fine-
investigated the use of a Multilingual Universal Sentence tuned along with a neural classification layer and a Self-
Encoder (MUSE) [31] with a pooled BiGRU for the aspect Attention Network (SAN) for the unified E2E-ABSA. In [36,
extraction and aspect polarity classification tasks. The model 15], the authors applied two stacked BiLSTM layers for a
achieved a state-of-the-art result, indicating the superiority of unified E2E-ABSA. The GloVe [37] embeddings and target-
position information are used as features. In [17], the authors
the pre-trained language models. TABLE 2 summarizes the
propose a CasNSA model that consists of several modules: a
work on both ATE and ASC tasks, respectively.
contextual semantic representation module, a target

TABLE 2. Summary of Related Works for the Aspect Term Extraction (ATE) and Aspect Sentiment Classification (ASC) tasks.
Discussed Performance
Ref/Year Model Dataset Features
Task P(%) R(%) F1(%) ACC(%)
ATE Hotel - - 30.9 -
[2]/2016 SVM baseline N-Unigrams
ASC Reviews - - - 76.42
ATE J48 Gaza News 81.3 82.5 81.7 -
[7]/2016 POS, NER, and N-Grams
ASC CRF Posts - - - 87.9
Arabic Hotel
[14]/2016 ASC CNN (Aspect + text) word embeddings randomly initialized - - - 82.7
Reviews
ATE Arabic Hotel POS, NER, N-Grams, morphological and word features - - 48 -
[4]/2018 RNN
ASC Reviews + word2vec word embedding - - - 87
ATE Arabic Hotel 89.8 90 89.8 -
[26]/2019 SVM POS, NER, N-Grams, morphological and word features
ASC Reviews - - - 95.4
ATE BiLSTM-CRF - - 69.98
Arabic Hotel
[5]/2019 AB-LSTM-PC + Soft fastText char-level embedding
ASC Reviews - - - 82.6
Attention
BiLSTM-attention- Arabic Hotel
[8]/2020 ATE CNN char-level + CBOW word-level embeddings - - 72.83 -
LSTM-CRF Reviews
Bi-Indy-LSTM + Arabic Hotel (Aspect + text) word embeddings using skip-gram and
[12]/2021 ASC - - - 87.31
recurrent attention Reviews CBOW
ATE Arabic Hotel - - 93 92.82
[6]/2021 BiGRU MUSE sentence-level embeddings
ASC Reviews 90.8 90.5 90.86 91.40
ATE BiGRU-CNN-CRF Arabic Hotel - - 69.44 -
[3]/2021 AraVec [32] word-level + CNN char-level embedding
ASC IAN-BGRU Reviews - - - 83.98
HAAD [33] - - - 73
Gaza News
fine-tune Arabic - - - 85.73
[11]/2021 ASC Posts (Aspect + text) Arabic BERT word embeddings
BERT
Arabic Hotel
- - - 89.51
Reviews
fine-tune AraBERT- Gaza News
[9]/2022 ATE AraBERTv0.1 word embedding 87.7 88.5 88.1 -
BiGRU-CRF Posts
fine-tune AraBERT- Arabic Hotel AraBERTv0.2 word embedding + Flair string
[10]/2022 ATE - - 79.9 -
BiLSTM-CRF Reviews embedding
HAAD - - - 74.85
(Aspect + text) word embeddings using (AraBERT +
[13]/2022 ASC fine-tune AraBERT Arabic Hotel
Arabic BERT) along with Seq2Seq dialect normalization - - - 84.65
Reviews

boundary recognizer, and a sentiment polarity identifier. The


B. END-TO-END ASPECT-BASED SENTIMENT
ANALYSIS (E2E-ABSA) model was tested on four different datasets, and the highest
Despite the efficiency of the previously discussed single-task F1-score achieved was on the SemEval-2014 dataset.
approaches, they lack the practical interaction required to Joining the Aspect Term Extraction and Aspect
fully perform the ABSA task. Additionally, the work Sentiment Classification tasks together into a single task can
involved in ASC relies on the aspect term as a pre-identified achieve the required dependency and relatedness between
feature of the model in conjunction with the input sentence, aspect terms and their sentiment polarities. However, dealing
which is not the case in real-world scenarios. To overcome with both tasks simultaneously requires a model capable of
these limitations, various studies on English ABSA have processing a large search space and can converge to achieve
developed models that can perform these subtasks jointly, good results with a fast execution time.
either through a hierarchical approach [34] or an End-to-End
III. THE PROPOSED METHODOLOGY
approach [17, 35, 36, 15]. The following studies correspond
In contrast to previous studies, our proposed model neither
to the English E2E-ABSA.
relies on intensive feature engineering nor a pre-identified
In [34], the authors propose a hierarchical multi-task
learning framework. The framework consists of ATE and

4 VOLUME 11, 2023

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3342755

aspect term but rather on the relationships between words,


contextualized information, and tag dependencies.
We formulated the subtasks of ABSA, Aspect Term
Extraction and Aspect Sentiment Classification tasks as an
End-to-End sequence labeling task via a unified tagging
schema. The proposed model was evaluated by utilizing the
pre-trained language model AraBERT along with different
classification techniques.
After preparing the data for the desired task, the
AraBERT model is utilized to extract the required features in
FIGURE 2. Snapshot from The Sentence-Level Annotated Arabic Hotel
two approaches: first, as a feature-based model with no Reviews Dataset.
weights modified during the training phase. Second, as a
TABLE 3. Description of The Arabic Hotel Reviews Dataset.
fine-tuned model. The extracted feature vectors are then
processed with a fully connected neural network layer Sentence Aspect Terms
Dataset
(Dense) to reduce the dimensionality and interpret the data Train Test Total Train Test Total
for the final classification stage. Two classification Sentence-level 4802 1227 6029 96,246 23,856 120,102
algorithms were applied at this stage: a multi-layer
perceptron with a softmax activation function and a liner- The preprocessing stage went through the following
chain CRF. FIGURE 1 shows the overall architecture of the procedures to prepare the dataset to be compatible with our
proposed model. More details regarding the architecture’s experiment: to begin, the original XML file of the dataset is
components will be explained in the next subsections. transformed to IOB file format. As presented in Algorithm 1,
based on the 'target' and 'polarity' attributes, each sentence is
divided into a list of words, and each word is assigned an
appropriate label from the label list [𝐵 − 𝑁𝐸𝐺, 𝐵 − 𝑁𝐸𝑈,
𝐵 − 𝑃𝑂𝑆, 𝐼 − 𝑁𝐸𝐺, 𝐼 − 𝑁𝐸𝑈, 𝐼 − 𝑃𝑂𝑆, 𝑂].
Except for 𝑂, each label consists of two parts: the
target's boundary and the sentiment polarity. If the word is
not included in the ‘target’ attribute, the label 𝑂 is assigned.
If the ‘target’ attribute consists of only one word, the label
𝐵 − 𝑃𝑂𝑆, 𝐵 − 𝑁𝐸𝐺 or 𝐵 − 𝑁𝐸𝑈 is assigned based on the
‘polarity’ attribute. Finally, if the ‘target’ attribute consists
of several words, the label 𝐵 − is assigned to the first word,
followed by 𝐼 − to the remaining words combined with
𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒, 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 or 𝑛𝑒𝑢𝑡𝑟𝑎𝑙 polarity.
The IOB file is then converted to a CSV file with some
FIGURE 1. The Architecture of The Proposed Model. text cleaning, which includes removing punctuation, digits,
and any non-Arabic letters, normalizing Hamza (‫ ا‬,‫ إ‬,‫ أ‬to ‫ )ا‬and
A. DATASET ta-marbuta (‫ ة‬to ‫)ه‬, and normalizing letters with diacritics
The Arabic Hotel Reviews dataset [2] from the SemEval-2016 (“‫ ”اَ َك َل‬to “‫ ”اكل‬i.e., “eat”). Word elongation is also removed to
workshop was utilized to evaluate the proposed model. It avoid any duplication of letters (“‫ ”جمييييل‬i.e., “niiiice” to
contains reviews written in Modern Standard Arabic and “‫“ ”جميل‬nice”). Algorithm 2 summarizes the steps involved.
Dialectal Arabic as well. The dataset was annotated on two
levels: text-level annotation and sentence-level annotation. In Algorithm 1 XML to IOB
this research, only the sentence-level annotation is targeted. Input: dataset file in XML format
Output: dataset file in iob format
As displayed in FIGURE 2, reviews are written in XML 1 Sentences ← []
format, with each review containing multiple sentences, each 2 For <sentence> ∈ xml file do
of which contains a text with several attributes (target, 3 Text, [target, from, to, polarity] ← extract the text and its
corresponding attributes
category, and polarity). According to those attributes, the 4 Sentences ← Sentences + {‘text’: Text, ’attributes’: [target,
aspect terms and their corresponding sentiment polarities are from, to, polarity]}
extracted. TABLE 3 depicts the distribution of sentences and End
aspect terms in the dataset. 5 For sentence ∈ Sentences do
6 Dict ← {} (Create a dictionary for each sentence)
7 For attribute ∈ sentence [“attributes" ], do
B. DATA PREPROCESSING 8 Remove attribute with ‘NULL’ target
9 Update the dictionary with the target’s position starting
1) TEXT CLEANING AND IOB TAGGING index as a key
Dict[from] ← [target, from, to, polarity]
End

5 VOLUME 11, 2023

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3342755

10 Last_end ← 0 (pointer) TABLE 4. Distribution of Classes in Arabic Hotel Reviews Dataset After
11 For key ∈ Sort(Dict) do Preprocessing.
12 target, from_, to_, polarity ← Dict[key]
Tag O B-POS B-NEG B-NEU I-POS I-NEG I-NEU
13 Extract the text that precedes the first target
Text_with_Os ← text [last_end: from_] Train 79081 5846 3151 662 1130 629 85
14 Update the pointer to point to the remaining text Test 19370 1430 786 163 274 194 26
Last_end ← to_ Total 98451 7276 3937 825 1404 823 111
15 If the current target consists of only one word:
16 t ← t + target + “B-” + polarity TABLE 5. Samples of The Dataset After Preprocessing.
17 Else
18 Do the same for the first word, then change “B-” to Sentence Label Sequence
“I-” for the remaining words.
End
‫كانت الغرفه ممتازه وكذلك الموظفين‬ B-POS O O B-POS O
19 Concatenate the text that precedes the first target with t
i.e., The room was excellent and so
S ← text_with_Os + t were the staff
20 If the current target is the last target that appears in the
sentence: ‫فريق العمل الودود والمتعاون على االطالق‬ O O O O I-POS B-POS
21 S ← S + Text[to_: ] i.e., Absolutely friendly and
22 Replace the white spaces in S with “O” followed by a new line cooperative staff team
23 Write the sentence S to the .iob output file. ‫موقع االوتيل جيد واالكل جيد والحلويات مميزه‬ O B-POS O B-NEU O I-NEU B-NEU
End i.e., The hotel’s location is good,
25 Return the.iob file the food is good, and the desserts
are distinctive

Algorithm 2 IOB to CSV


Input: .iob file output from Algorithm 1 ‫ ـه‬##, and ‫ ;سيئـ‬this suffix in the Arabic language is similar to
Output: dataset file in .csv format the "ing",” ed”, “s” and others, but with a different meaning.
1 word_list, label_list ← [], [] For instance, the word “Learning” could be tokenized into
2 idx_list ← [] (keeps track of words within the same sentence) “Learn” and “##ing”.
3 idx ← 0 This tokenization method eliminates the out-of-
4 For line ∈ .iob file, do vocabulary (OOV) problem, hence resolving the complexity
5 If the line is NOT empty line
and ambiguity of the Arabic Language without the need for
6 word, label ← Extract the word and its label
7 word ← Remove_punctuation (word)
extensive preprocessing (stemming or lemmatization).
8 word ← Remove_diacritics(word) However, this strategy may result in a mismatch between the
9 word ← Remove_elongation(word) input tokens and the labels in the dataset. As illustrated in
10 word ← Remove_non_arabic_letters_digits(word) FIGURE 3, the label sequence has only six elements, whereas
11 word ← Normalization(word) the tokenized sentence has nine tokens. To deal with this
12 word_list ← word_list + word problem, each subtoken beginning with ‘##’ will be ignored,
13 label_list ← label_list + label leaving only the known part of the token and its corresponding
14 idx_list ← idx_list + idx
label to be fed to the model.
15 Else
16 idx ← idx + 1 (new sentence)
End
17 csv_file ← dataframe([idx_list, word_list, label_list])
18 Join words with the same idx as one sentence in a new column
19 Join labels with the same idx as one label sequence in a new
column
20 Drop the remaining columns
21 Return the .csv file
FIGURE 3. Example That Clarifies the Problem with The WordPiece
Tokenizer.
TABLE 4 shows a distribution of classes in the dataset.
A sample from the dataset after preprocessing is presented in Furthermore, BERT’s tokenizer attaches special tokens to
TABLE 5. each sentence. [CLS] and [SEP] tokens are attached at the
beginning and end of the sentence, respectively. It also pads
C. DATA REPRESENTATION the input sentences to the same length by appending the special
1) TOKENIZATION AND ENCODING token [PAD] at the end. Tokens are then encoded into three
Before feeding the input sentence to the model, it must be vectors of integer values: a vector of token ids utilizing the
tokenized and encoded in a specified form. This stage makes BERT’s vocabulary, a vector of mask values, and a vector of
use of BERT's WordPiece tokenizer. It divides the token into token type ids. These vectors are utilized as inputs to the
subtokens of known and unknown words; for example, the BERT embedding layers to generate the initial
word “‫”سيئه‬, i.e., “bad”, could be tokenized into two subwords, representations.

D. FEATURE EXTRACTION

6 VOLUME 11, 2023

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3342755

1) PRE-TRAINED ARABERT MODEL


AraBERT is a pre-trained Arabic language model based on
the BERT language model. It embeds a sequence of words
into a sequence of contextualized vectors with specific
dimensions. The BERT model has three Embedding Layers:
• Token Embedding, which encodes the meaning of
each word utilizing an input ids vector.
• Segment Embedding, which encodes the sentence
position utilizing a mask-encoded vector.
• Position Embedding, which encodes the word’s
position in the input sentence utilizing a token type
ids vector.
Those embeddings are concatenated, providing context-
independent word embeddings. To generate the
contextualized embeddings, the self-attention mechanism of
the Transformer's Encoder component is utilized [18]. In
which each input element is connected to every other input
element, and the weightings (attention scores) between them
are dynamically calculated based on that connection.
FIGURE 4. Self-Attention Mechanism Adapted From [18]. Words in The
As illustrated in FIGURE 4, the initial embeddings are Sequence with a 768-dimensional Vector are Represented by w1, w2, and
utilized in combination with randomly initialized weight w3. Matrices 𝑸𝒘 , 𝑲𝒘 , and 𝑽𝒘 are of Size: Number of Words in The
Sequence 𝒙 768.
matrices, Query (𝑄𝑤 ), Key (𝐾𝑤 ), and Value (𝑉𝑤 ), to form 𝑄,
𝐾, and 𝑉 matrices, which are used to calculate attention
scores as indicated in (1) [18].
At each time step, a dot product is applied to calculate
the similarity between a target word (Query word) and every
other word in the sequence (Key words). A division and a
softmax function are used to normalize the scores calculated;
𝑑𝑘 denotes the dimension of the 𝐾 matrix, which is the same
as the embedding dimension (768 for BERT-base). The FIGURE 5. Multi-Head Attention Mechanism. Each Encoder Block
Contains 12 Attention Heads, Each with its own Self-Attention.
normalized scores are used to weight the 𝑉 matrix, resulting
in a weighted feature vector for each input token.
𝑄𝐾 𝑇
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝑄, 𝐾, 𝑉) = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 ( )𝑉 (1)
√𝑑𝑘
The AraBERT-base model comprises 12 attention heads
included within each of its 12 Encoder blocks. The outputs
of each attention head are combined to generate the final
contextualized embeddings, as shown in FIGURE 5.
Two experiments were conducted: first, utilizing
AraBERT as a feature-based model, and second, fine-tuning
its parameters within a Deep Learning model. FIGURE 6. ReLU Activation Function is Used in a Multi-Layer Perceptron
with Two Dense Hidden Layers of 128 and 64 Neurons, respectively.
2) DENSE LAYER
We implement a Multi-Layer Perceptron (MLP) that E. CLASSIFICATION
comprises two Dense hidden layers with a Rectified Linear 1) SOFTMAX
Unit (ReLU) [38] activation function to reduce the input For the classification stage, we initially investigated utilizing
dimensionality and speed up the training process. a fully connected layer with a softmax activation function to
The embeddings from the final Encoder block, referred predict a tag for each input token. Softmax [24] is a function
to as the last hidden state, are used as inputs to the ReLU that normalizes the output of a neural network to a
activation function defined by (2) as follows: probability distribution over the predicted output classes as
follows:
𝐷 = 𝑅𝑒𝐿𝑈(𝑊𝐻 + 𝑏) = 𝑀𝑎𝑥((𝑊𝐻 + 𝑏), 0) (2)
𝑒 𝑑𝑖
Where H is the last hidden state matrix of dimensions: 𝑦̂ = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑊𝐷 + 𝑏) = 𝐶 (3)
∑𝑗=0 𝑒 𝑑𝑗
sequence length 𝑥 768, 𝑊 is a trainable weight matrix, and
Where 𝑦̂ denotes the matrix of predicted probabilities, 𝑑𝑖
𝑏 is a bias term. FIGURE 6 illustrates the process of MLP
denotes the hidden representation of a token with respect to
with ReLU Dense layers.

7 VOLUME 11, 2023

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3342755

class 𝑖, while 𝑑𝑗 denotes the representation with respect to all For prediction, the Viterbi algorithm is used to find the
classes 𝐶. tag sequence with the highest score 𝑌 ∗ :
Because the proposed E2E-ABSA is a multi-class 𝑌 ∗ = 𝐴𝑟𝑔𝑚𝑎𝑥 𝑌 ′ ∈𝑌̂ 𝑆(𝑋, 𝑌 ′ ) (8)
classification task, the model is trained to minimize the
categorical cross-entropy [39] between predicted and true The operational flow within CRF is shown in FIGURE 7.
results as follows:
𝑁 𝐶
1
𝐿(𝑦̂, 𝑦) = − ∑ ∑ 𝑦𝑖𝑘 log (𝑦̂𝑖𝑘 ) (4)
𝑁
𝑖=1 𝑘=1

Where 𝑦𝑖𝑘 indicates the 𝑖 𝑡ℎ true label which is a one-hot


encoded vector of class 𝑘; 𝑦̂𝑖𝑘 indicates the 𝑖 𝑡ℎ predicted
probability of class 𝑘, where 𝑁 is the number of samples in
the training dataset.

2) CONDITIONAL RANDOM FIELDS


Instead of modeling tagging decisions independently, we can
model them jointly using conditional Random Fields (CRF)
[23]. The linear-chain CRF is a discriminative model for
predicting the probability of a sequence of labels given a
sequence of observations while taking the labels’
dependencies into account.
We experimented with utilizing the liner-chain CRF as
a classification layer for our proposed model.
For a sequence input 𝑋 = {𝑥1 , 𝑥2 , … , 𝑥𝑛 }, we consider
the matrix 𝑃 to be the emission scores outputted by
AraBERT hidden states after processing. Its size is 𝑛 x 𝑘
where 𝑘 is the number of distinct tags/labels and 𝑃𝑖,𝑗
represents the score of the tag 𝑗 given the observed word 𝑖.
For a sequence of labels 𝑌 = {𝑦1 , 𝑦2 , … , 𝑦𝑛 }, the
sequence score is defined by (5) [40] as follows:
𝑛 FIGURE 7. The Operations Flow Within a Liner-Chain CRF.
𝑛
𝑆(𝑋, 𝑌) = ∑ 𝐴𝑦𝑖,𝑦𝑖+1 + ∑ 𝑃𝑖,𝑦𝑖 (5) IV. EXPERIMENTAL RESULTS AND EVALUATION
𝑖=1 This section represents the experimental settings, evaluation
𝑖=0
metrics along with results and discussion of the conducted
Where 𝐴 is the transition scores matrix learned during
experiments.
the training and 𝐴𝑖,𝑗 represents a transition score from tag 𝑖
to tag 𝑗, responsible for setting constraints on the tags to A. EXPERIMENTAL SETTINGS
ensure the tag dependencies. The base version of the pre-trained language model
After calculating the sequence score, the softmax AraBERT is utilized during experiments. The AraBERT-
function is applied to calculate the likelihood probability for base model was released in four versions: AraBERTv0.1,
the correct tag sequence 𝑌 over all the possible tag sequences AraBERTv1, AraBERTv0.2, and AraBERTv2. We utilized
𝑌̂. It is defined by (6) [40] as follows: AraBERTv0.2. The model is available on the HuggingFace
model page under the aubmindlab name.2.
𝑒 𝑠(𝑋,𝑌) For building our functional network, we used the Keras3
𝑃(𝑌|𝑋) = (6) API, which is built on top of the TensorFlow Python
∑𝑌 ′ ∈𝑌̂ 𝑒 𝑠(𝑋,𝑌 ′ ) package. The model was trained for 5 epochs with a batch of
Our models’ parameters are trained to maximize the log- 32 input samples, each padded to a maximum length of 64
likelihood of the correct tag sequence by minimizing the characters. Each input token is encoded into a 768-
negative log-likelihood defined by (7) [40]. dimensional vector. The Adam optimizer [41] is utilized with
a learning rate 5e-5. The model comprises two Dense layers
′ with 128 and 64 neurons, respectively. Other hyper-
− 𝑙𝑜𝑔(𝑃(𝑌|𝑋)) = −[𝑆(𝑋, 𝑌) − 𝑙𝑜𝑔 ∑ 𝑒 𝑠(𝑋,𝑌 ) ] (7)
𝑌 ′ ∈𝑌̂ parameters are the same as those in the pre-trained
AraBERTv0.2 implementation. All experiments were run on

2
https://huggingface.co/aubmindlab/bert-base-arabertv02
3 https://keras.io/api/

8 VOLUME 11, 2023

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3342755

Google Colaboratory with a Tesla P100 GPU, 25 GB RAM, We calculate the True Positive (TP), True Negative (TN),
and 167 GB Disk Space. False Positive (FP), and False Negative (FN) for each tag
independently. For example, in terms of the B-POS tag:
B. EVALUATION METRICS
All experiments were evaluated with four versions of k-fold • TP is the number of samples predicted as B-POS, and
cross-validation [42]: 3, 5, 10, and 15. The entire dataset its actual label is also B-POS.
(train and test) is shuffled and divided into k smaller sets; for • FP is the number of samples predicted as B-POS, but
each k, the model is trained using k-1 of the folds as training its actual label is something else.
data, then the model is validated on the test part. This process • FN is the number of B-POS samples but predicted as
is repeated k times with a new model and different testing something else.
folds in each case. The performance measure is then the • TN is the number of samples predicted as not B-POS,
average of the values computed in the loop to ensure the and its actual label is also not B-POS.
model’s resistance to overfitting.
The following metrics [43, 44] will be used to evaluate The evaluation scores are evaluated token-wise [45],
our proposed model, including Precision (P), Recall (R), F1 then an average value is calculated as the proposed model's
score, Accuracy (ACC), Area Under Curve (AUC), and Area evaluation score (macro-average [43]). Furthermore,
Under Precision-Recall (AUPR), which are defined by (9- BERT’s tokenizer generates new labels that are not defined
(15) as follows : in the dataset, which are created by [CLS], [SEP], and [PAD]
tokens discussed earlier. Those labels are ignored since they
𝑇𝑃 are irrelevant to the actual inference. Therefore, only the
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = (9)
𝑇𝑃 + 𝐹𝑃 seven entities specified by B-NEG, B-POS, B-NEU, I-POS,
I-NEG, I-NEU, and O are reported for the evaluation metrics.
𝑇𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃𝑅 = (10)
𝑇𝑃 + 𝐹𝑁 C. RESULTS AND DISCUSSION
2(𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙 ) This section presents the experiments that were carried out
𝐹1 𝑠𝑐𝑜𝑟𝑒 = (11) in this study, along with an analysis of the obtained results.
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
1) EXPERIMENT 1: FEATURE-BASED METHOD
𝑇𝑃 + 𝑇𝑁
𝐴𝑐𝑐 = (12) In this experiment, we investigated the impact of utilizing the
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁 pre-trained AraBERT model as a feature-based model while
𝐹𝑃 keeping its parameters fixed during the training process.
𝐹𝑃𝑅 = (13) As stated in TABLE 6 and TABLE 7, the performance
𝐹𝑃 + 𝑇𝑁
of the feature-based AraBERT model on our E2E-ABSA
1 task is not particularly outstanding in all folds when using
𝐴𝑈𝐶 = ∫ 𝑇𝑃𝑅 𝑑(𝐹𝑃𝑅) (14) either the CRF or the MLP with softmax as classifiers. The
0
model does not appear to learn the required contextualized
features.
𝐴𝑈𝑃𝑅 = ∑(𝑅𝑒𝑐𝑎𝑙𝑙𝑛 − 𝑅𝑒𝑐𝑎𝑙𝑙𝑛−1 )𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑛 (15) This behavior is expected because the AraBERT model
𝑛
was pre-trained on two specific tasks: Next Sentence
The precision (9) is the ratio of correctly predicted Prediction and Masked Language Modeling [21]. The
values for a class to all of its predictions, while the Recall representation of the model is obviously insufficient for the
(10), or the True Positive Rate (TPR), is the ratio of correctly downstream task, and task-specific fine-tuning is required to
predicted values for a class to the number of actual samples take advantage of AraBERT's capabilities in enhancing
of that class in the dataset. performance. However, the best performance was achieved
F1 score (11) is the harmonic average of Precision and by the 15-fold AraBERT-softmax model with a Precision of
Recall and is used mainly for evaluating sequence labeling 29.91%, Recall of 20.09%, F1 score of 21.65%, AUC of
tasks [9, 13, 17, 34]. 94.75%, and AUPR of 26.85%.
The Accuracy (12) is obtained by dividing the correctly Additionally, the average-AUC value seems high
classified labels by the total number of labels in the dataset. compared to the other model performance metrics. This is
The Receiver Operating Characteristic (ROC) [44] curve is often true for highly imbalanced datasets. As illustrated in
summarized by AUC (14) based on the TPR and False FIGURE 8 (a), the ROC curve has two lines: one for how
Positive Rate (FPR) at different classification thresholds. often the model correctly identifies positive cases (TPR) and
The higher the AUC, the better the model's performance in another for how often it mistakenly identifies negative cases
distinguishing between positive and negative classes. as positive (FPR). However, the false positive rate could be
The Precision-Recall (PR) curve is summarized by pulled down due to the large number of true negatives,
AUPR [44], defined by (15), as the weighted mean of resulting in a high-pointed ROC curve.
precisions achieved at each threshold 𝑛, where the weights In our proposed feature-based model, it is apparent that
are the increase in Recall from the previous threshold 𝑛 − 1. the model is confusing tags 𝐵 − 𝑃𝑂𝑆, 𝐵 − 𝑁𝐸𝐺, 𝐵 − 𝑁𝐸𝑈,

9 VOLUME 11, 2023

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3342755

TABLE 6. Feature-Based AraBERT-Softmax Evaluation Results Using Different K-Folds.

#Fold 3 5 10 15
Label / Metric P R F1 AUC AUPR P R F1 AUC AUPR P R F1 AUC AUPR P R F1 AUC AUPR
B-POS 0.52 0.15 0.23 0.94 0.33 0.52 0.25 0.33 0.96 0.39 0.55 0.25 0.34 0.96 0.41 0.54 0.32 0.40 0.97 0.46
B-NEG 0.35 0.02 0.04 0.93 0.16 0.36 0.07 0.11 0.95 0.20 0.45 0.06 0.11 0.95 0.21 0.45 0.11 0.17 0.96 0.27
B-NEU 0.00 0.00 0.00 0.90 0.01 0.02 0.00 0.00 0.90 0.02 0.00 0.00 0.00 0.93 0.03 0.00 0.00 0.00 0.93 0.03
O 0.88 0.96 0.92 0.99 0.97 0.89 0.96 0.93 0.99 0.97 0.89 0.97 0.93 0.99 0.97 0.90 0.97 0.93 0.99 0.98
I-POS 0.17 0.00 0.00 0.92 0.05 0.09 0.00 0.00 0.93 0.08 0.06 0.00 0.00 0.93 0.06 0.17 0.01 0.02 0.94 0.09
I-NEG 0.00 0.00 0.00 0.89 0.02 0.07 0.00 0.00 0.92 0.04 0.07 0.00 0.00 0.92 0.03 0.07 0.00 0.00 0.93 0.05
I-NEU 0.00 0.00 0.00 0.84 0.00 0.00 0.00 0.00 0.86 0.00 0.00 0.00 0.00 0.87 0.00 0.03 0.00 0.00 0.90 0.00
Macro-average 27.40 16.12 16.95 91.57 22.01 27.87 18.28 19.65 93.00 24.30 28.79 18.26 19.68 93.57 24.42 29.91 20.09 21.65 94.57 26.85

TABLE 7. Feature-Based AraBERT-CRF Evaluation Results Using Different K-Folds.

#Fold 3 5 10 15
Label / Metric P R F1 AUC AUPR P R F1 AUC AUPR P R F1 AUC AUPR P R F1 AUC AUPR
B-POS 0.52 0.23 0.31 0.95 0.34 0.50 0.24 0.32 0.95 0.35 0.53 0.24 0.33 0.96 0.39 0.51 0.27 0.35 0.96 0.39
B-NEG 0.29 0.03 0.05 0.91 0.14 0.42 0.05 0.09 0.94 0.19 0.39 0.07 0.11 0.95 0.22 0.36 0.08 0.13 0.95 0.20
B-NEU 0.00 0.00 0.00 0.89 0.02 0.00 0.00 0.00 0.88 0.01 0.00 0.00 0.00 0.91 0.02 0.02 0.00 0.00 0.90 0.02
O 0.89 0.96 0.92 0.99 0.96 0.89 0.96 0.92 0.99 0.97 0.89 0.96 0.93 0.99 0.97 0.89 0.96 0.92 0.99 0.97
I-POS 0.14 0.00 0.01 0.86 0.03 0.21 0.01 0.01 0.92 0.06 0.16 0.01 0.01 0.91 0.06 0.27 0.01 0.01 0.93 0.08
I-NEG 0.03 0.00 0.00 0.91 0.02 0.02 0.00 0.00 0.88 0.02 0.05 0.01 0.01 0.90 0.03 0.00 0.00 0.00 0.92 0.04
I-NEU 0.00 0.00 0.00 0.84 0.00 0.00 0.00 0.00 0.85 0.00 0.00 0.00 0.00 0.86 0.00 0.00 0.00 0.00 0.86 0.00
Macro-average 26.71 17.39 18.43 90.71 21.57 29.10 17.97 19.26 91.67 22.85 28.83 18.37 19.87 92.57 24.14 29.31 18.75 20.20 93.00 24.30

FIGURE 8. (a) ROC Curve and (b) PR Curve for 15-Fold Feature-Based AraBERT-Softmax Model in a One-
vs-Rest Approach.

𝐼 − 𝑃𝑂𝑆, 𝐼 − 𝑁𝐸𝐺, and 𝐼 − 𝑁𝐸𝑈 with tag 𝑂, and in some Consequently, As illustrated in FIGURE 9 (a), the data
instances, it predicts tag 𝑂 more often than the correct tag point that is close to the 1 on the TPR axe is actually the
(current tag in a one-vs-rest). This implies that the model has optimal threshold, which means that at this threshold, the
asymmetric error distribution, and the ROC curve fails to classifier is perfectly able to distinguish between positive
explicitly show this performance difference. class (current tag in a one-vs-rest) and the negative class (rest
Concurrently, as all tags contribute equally to the of tags). However, as AUC excels under imbalanced
classification task, the PR curve is used instead; this metric settings, the results could be misleading. For instance, the
computes a weighted average precision value for each tag performance gap between AUC and the pointwise metrics (P,
independent of the predictions of other tags. As illustrated in R, and F1 score) regarding the 3-fold AraBERT-Softmax
FIGURE 8 (b), the model that is considered good with ROC- model, presented in TABLE 8, is significant. The AUC is
AUC, performs poorly with PR curve that focuses on the 99.85%, whereas P, R, and F1 score are 69.13%, 71.50%,
positive labels (current tag) and not the true negatives. and 69.23%, respectively, which means that even a perfect
ROC-AUC does not mean that the predictions are well-
2) EXPERIMENT 2: FINE-TUNED METHOD
calibrated.
In this experiment, the AraBERT model’s parameters are
FIGURE 9 (b) illustrates the AUPR in a one-vs-rest
fine-tuned during the training process.
approach for the 15-fold fine-tuned AraBERT-CRF model,
As demonstrated in TABLE 8 and TABLE 9, results
where the average-AUPR is 97.37%. As presented in
were significantly improved when the model's parameters
TABLE 8, CRF outperformed softmax in all mentioned
were adjusted for our E2E-ABSA task rather than using the
pointwise metrics; however, the average-AUPR for the 15-
model as a feature-based only.
fold AraBERT-Softmax model is 98%. It should be noted
With CRF as a classifier, the best performance was
that CRF employs the transition scores matrix to generate the
achieved by 15-fold with 95.41% Precision, 95.23% Recall,
prediction probabilities, whereas the PR curve and ROC
95.16% F1 score, 100% AUC, and 97.37% AUPR; similarly,
curve depend only on the emission scores of each token
using MLP with softmax as a classifier, the best performance
independently. However, CRF still produces quite stable
was achieved by 15-fold with 94.14% Precision, 94.52%
results.
Recall, 94.18% F1 score, 100% AUC, and 98% AUPR.

10 VOLUME 11, 2023

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3342755

TABLE 8. Fine-Tuned AraBERT-Softmax Evaluation Results Using Different K-folds.

#Fold 3 5 10 15
Label / Metric P R F1 AUC AUPR P R F1 AUC AUPR P R F1 AUC AUPR P R F1 AUC AUPR
B-POS 0.79 0.79 0.79 1.00 0.88 0.89 0.90 0.90 1.00 0.96 0.94 0.94 0.94 1.00 0.99 0.96 0.97 0.96 1.00 0.99
B-NEG 0.78 0.86 0.81 1.00 0.88 0.90 0.91 0.90 1.00 0.96 0.95 0.94 0.94 1.00 0.98 0.96 0.97 0.97 1.00 0.99
B-NEU 0.47 0.46 0.45 0.99 0.47 0.75 0.69 0.71 1.00 0.81 0.86 0.83 0.84 1.00 0.93 0.90 0.88 0.89 1.00 0.96
O 0.98 0.97 0.98 1.00 1.00 0.99 0.99 0.99 1.00 1.00 0.99 0.99 0.99 1.00 1.00 1.00 0.99 1.00 1.00 1.00
I-POS 0.69 0.74 0.71 1.00 0.77 0.83 0.84 0.83 1.00 0.90 0.91 0.91 0.91 1.00 0.96 0.93 0.96 0.94 1.00 0.97
I-NEG 0.71 0.83 0.76 1.00 0.82 0.85 0.91 0.88 1.00 0.94 0.93 0.92 0.92 1.00 0.97 0.93 0.97 0.95 1.00 0.99
I-NEU 0.42 0.35 0.33 1.00 0.37 0.64 0.61 0.61 1.00 0.74 0.86 0.80 0.81 1.00 0.90 0.91 0.88 0.89 1.00 0.96
Macro-average 69.13 71.50 69.23 99.85 74.14 83.53 83.54 83.19 1.00 90.14 91.93 90.50 90.79 1.00 96.14 94.14 94.52 94.18 1.00 98.00

TABLE 9. Fine-Tuned AraBERT-CRF Evaluation Results Using Different K-folds.

#Fold 3 5 10 15
Label / Metric P R F1 AUC AUPR P R F1 AUC AUPR P R F1 AUC AUPR P R F1 AUC AUPR
B-POS 0.84 0.87 0.85 0.99 0.91 0.90 0.92 0.91 1.00 0.96 0.95 0.95 0.95 1.00 0.97 0.96 0.97 0.97 1.00 0.99
B-NEG 0.86 0.88 0.87 1.00 0.91 0.90 0.92 0.91 1.00 0.95 0.96 0.94 0.95 1.00 0.96 0.97 0.97 0.97 1.00 0.98
B-NEU 0.69 0.63 0.66 0.99 0.74 0.79 0.71 0.74 0.99 0.82 0.89 0.87 0.88 1.00 0.94 0.92 0.90 0.91 1.00 0.95
O 0.98 0.98 0.98 1.00 1.00 0.99 0.99 0.99 1.00 1.00 0.99 0.99 0.99 1.00 1.00 1.00 1.00 1.00 1.00 1.00
I-POS 0.79 0.79 0.78 0.99 0.80 0.89 0.84 0.86 0.99 0.89 0.92 0.92 0.92 1.00 0.94 0.93 0.96 0.95 1.00 0.97
I-NEG 0.80 0.84 0.82 1.00 0.86 0.90 0.88 0.89 1.00 0.92 0.95 0.92 0.93 1.00 0.95 0.96 0.95 0.96 1.00 0.97
I-NEU 0.57 0.55 0.55 1.00 0.67 0.78 0.71 0.73 1.00 0.81 0.89 0.85 0.87 1.00 0.94 0.94 0.92 0.92 1.00 0.95
Macro-average 79.03 79.24 78.78 99.57 84.14 87.97 85.20 86.13 99.85 90.71 93.60 92.16 92.75 1.00 95.88 95.41 95.23 95.16 1.00 97.37

FIGURE 9. (a) ROC Curve and (b) PR Curve for 15-Fold Fine-Tuned AraBERT-CRF Model in a One-vs-Rest
Approach.

Furthermore, we observe that maintaining boundary-


sentiment consistency within the same aspect term,
particularly for those with multiple words (e.g., ‫"نظافة و جودة‬
"‫ الطعام‬i.e., “food cleanliness and quality”) is difficult for the
AraBERT-Softmax model. In contrast, the AraBERT-CRF
model resolves this problem by employing the transition
matrix component to generate predictions based on the
features from both the current and previous tags. As
illustrated in FIGURE 10, when using softmax, the E2E-
ABSA problem becomes a token-wise classification
problem, predicting a tag for each token independently of
other tags in the sequence. However, this behavior can lead
to errors in the overall prediction. For instance, in TABLE
10, the word ‫“ نظافة‬cleanliness” is misclassified as a non-
FIGURE 10. The Difference Between CRF and
aspect, ignoring its relation to the word ‫" الطعام‬food" while it Softmax Behaviour.
represents the beginning of the aspect ‫“ نظافة الطعام‬food
cleanliness” and should have been assigned the tag B-POS. token while accounting for tag transition dependencies. For
Additionally, the tag B-POS is assigned to the aspect ‫الطعام‬ instance, in TABLE 11, CRF ensures that a predicted tag for
“food” while its actual tag is I-POS, ignoring its relation to a certain word is compatible with the other tags in the tag
the word ‫“ جودة‬quality” and the word ‫“ نظافة‬cleanliness” when sequence, preventing errors that could occur when using
it represents the inside of the aspect ‫“ جودة الطعام‬food quality” softmax as the classification layer.
and the aspect ‫“ نظافة الطعام‬food cleanliness”. As demonstrated in the results above, CRF outperformed
On the other hand, when CRF is used as a classifier, the softmax by a small margin; this is likely due to the multi-
E2E-ABSA problem turns into a sequence labeling problem. head self-attention mechanism of BERT; it leads to
As shown in FIGURE 10, the model predicts a tag for each incorporating significant information of concatenated local

11 VOLUME 11, 2023

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3342755

TABLE 10. Example of Model Inference with softmax as a Classification TABLE 12. Experimental Results After Splitting The Prediction of The E2E
Layer. Fine-tuned AraBERT Model for The Tasks of Aspect Term Extraction
Sentence ‫مثيل‬ ‫لها‬ ‫ليس‬ ‫الطعام‬ ‫جودة‬ ‫و‬ ‫نظافة‬ (ATE) and Aspect Sentiment Classification (ASC).
Label O O O I-POS B-POS O B-POS
Prediction O O O B-POS B-POS O O Model Task #Fold P(%) R(%) F1(%) ACC(%)
3 82.98 90.23 86.29 -
5 90.78 92.86 91.79 -
TABLE 11. Example of Model Inference with CRF as a Classification ATE
10 95.25 96.29 95.75 -
Layer. Fine-tuned
15 96.93 97.11 97.02 -
AraBERT-
Sentence ‫مثيل‬ ‫لها‬ ‫ليس‬ ‫الطعام‬ ‫جودة‬ ‫و‬ ‫نظافة‬ 3 75.99 80.81 78.23 90.20
Softmax
5 86.45 85.98 86.18 94.43
Label O O O I-POS B-POS O B-POS ASC
10 92.01 93.73 92.85 97.12
Prediction O O O I-POS B-POS O B-POS 15 95.53 95.68 95.61 98.07
3 91.01 89.66 90.32 -
5 93.78 92.97 93.36 -
and global context words and learning further interactive ATE
10 96.89 96.10 96.49 -
aspect-sentiment representations, which helps the proposed Fine-tuned
15 97.80 97.05 97.78 -
AraBERT-
model in producing improved sequence representations. The 3 86.69 84.24 85.39 93.64
CRF
5 89.77 91.03 90.38 95.59
difference will be more evident when utilizing context-free ASC
10 94.63 94.70 94.66 97.69
embedding models. 15 96.25 96.19 96.22 98.34
However, the best performance was achieved by the 15-
fold fine-tuned AraBERT-CRF model with 95.41% Additionally, we observed that the ATE task
Precision, 95.23% Recall, and 95.16% F1 score. The consistently outperforms the ASC task and the E2E-ABSA
confusion matrix of this model is presented in FIGURE 11; task. This result indicates that the boundary information
based on the differences between predicted and actual labels, learned by the model enhances the evaluation scores of the
it is demonstrated that the model can discriminate between overall E2E-ABSA task. Therefore, utilizing a model that
labels effectively. can set constraints on the boundary information is crucial for
improving the overall E2E-ABSA task, and the CRF model
can be a straightforward and efficient solution.
Furthermore, we observed that k-fold cross-validation
may have an impact on the model's performance. By
employing k-fold cross-validation, all parts of the dataset can
be used for training and testing, forcing the model to attend
to a larger context and increasing the possibility of
associating with relevant opinion words without overfitting.
According to TABLE 12, the best results are obtained at
k=15, which means that small k is likely insufficient to
involve the potential opinion words and does not offer an
accurate evaluation of the model’s performance. FIGURE 12
illustrates the train and test error of the 15-fold AraBERT-
CRF model which shows the model’s resistance to
overfitting. If the model overfits in a particular fold, the
FIGURE 11. Normalized Confusion Matrix of The 15-Fold Fine-Tuned training error of that fold will be less than the testing error;
AraBERT-CRF Model Showing All Entities in The Dataset.
hence, when summing/averaging the errors of all folds, a
3) COMPARISONS WITH EXISTING STUDIES model that overfits would have a low cross-validated
To evaluate the proposed E2E-ABSA approach, we further performance.
processed the predicted labels to separate them into two
distinct categories: aspect term labels (B, I and O) and
sentiment polarity labels (positive, negative, and neutral), to
be appropriate for comparisons with the previous single-task
approaches. By reformulating the predictions into two
separate tasks, we can maximize the evaluation scores,
which results in a better classifier for each task and
ultimately enhances the ABSA task.
We utilized the Precision, Recall, and F1 score metrics
to evaluate the two tasks in addition to the Accuracy for
evaluating the ASC task. As shown in TABLE 12, the 15-
fold fine-tuned AraBERT-CRF model achieved the best
performance, with an F1 score of 97.78% for the ATE task
and 96.22% for the ASC task, respectively, and an Accuracy
of 98.34% for the ASC task. FIGURE 12. Train Test Error of The 15-Fold Fine-Tuned AraBERT-CRF
Model.

12 VOLUME 11, 2023

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3342755

TABLE 13. Comparison Results of Different Studies on The Arabic Hotel


Reviews Dataset.
Task Model Acc (%) F1 (%)
BiLSTM-CRF [5] - 69.88

BiGRU-CRF [3] - 69.44


BiLSTM-Attention-
- 72.83
CRF [8]
Aspect Term BF-BiLSTM-CRF [10] - 79.9
Extraction (ATE)
MUSE-BiGRU [6] - 93
(proposed) Fine-tuned FIGURE 13. Comparisons Results With The Previous Single-Task
- 97.02 Approaches.
AraBERT-Softmax
(proposedl) Fine-
tuned AraBERT-CRF
- 97.78 Furthermore, it is observed that fine-tuning the AraBERT
SVM [26] 95.4 -
model using MLP with softmax already outperformed the
existing works without using CRF; this is likely due to
Bi-Indy-LSTM +
Recurrent Attention 87.31 - AraBERT representations encoding the associations
[12] between input tokens, which significantly enhances the
fine-tuned Arabic
89.51 -
model performance. However, utilizing a model that sets
BERT [11] restrictions about which tag should come before or after
Aspect Sentiment
fine-tuned
Classification (ASC)
AraBERTv0.1 [13]
84.65 - another helps direct the model to more accurate tag-
sentiment prediction.
MUSE-BiGRU [6] 91.40 -
(proposed) Fine-tuned V. CONCLUSION AND FUTURE WORK
98.07 -
AraBERT-Softmax This study aims to investigate the importance of tackling the
(proposed) Fine-tuned subtasks of ABSA, specifically Aspect Term Extraction and
98.34 -
AraBERT-CRF
Aspect Sentiment Classification, simultaneously through a
single model to preserve the relationships between the two
As a result, we compared the proposed 15-fold fine- subtasks, which was neglected by most related domain
tuned AraBERT-CRF model with several previous research researchers. Unlike single-task approaches, our model
works on the Arabic Hotel Reviews dataset to evaluate its creates a direct interaction between aspect terms and their
quality. Consequently, the proposed model outperformed the sentiment polarities, dissolving the need to take the aspect
previous single-task methods. As shown in TABLE 13, features into account along with the sentence as pre-
compared to BiGRU-CRF [3], BiLSTM-CRF [5], and identified information to retrieve the sentiment polarity.
BiLSTM-Attention-CRF [8], where they achieved 69.88%, To address this problem, we utilized a unified tagging
69.44%, and 72.83% F1 score, respectively, it achieved schema to create an End-to-End ABSA task and evaluated
28.34%, 27.9%, and 24.95% increases in the F1 score for the the proposed approach on the SemEval-2016 Arabic Hotel
ATE task, respectively. Compared to BF-BiLSTM-CRF Reviews dataset. Several experiments were performed
[10], our proposed model achieved 17.88% increases in F1 utilizing the AraBERT model, and results showed that the
score for the ATE task; however, compared to MUSE- proposed fine-tuned AraBERT-CRF model outperformed
BiGRU [6], it achieved 4.78% and 6.94% absolute gains on the existing state-of-the-art models by achieving an overall
ATE F1 score and ASC Accuracy score, respectively, F1 score of 95.11%.
indicating that a unified E2E model with an appropriate Further processing is then made on the predictions,
design can be more effective than the single-task approaches splitting them into ATE-labels and ASC-labels for a valid
on the ABSA task. comparison. Results indicate that even after splitting the
While the work presented in TABLE 13 for the ASC predicted labels, the model still surpassed the existing
task utilized a pre-identified aspect information, the methods, achieving an F1 score of 97.78% for the ATE and
proposed model achieved better results without aspect term
an accuracy of 98.34% for the ASC.
annotation; it outperformed the SVM model used in [26] by
2.94% and achieved 98.34% accuracy. Compared to the Bi- Although the unified tagging schema solved the error
Indy-LSTM-recurrent attention model in [12], our proposed propagation problem, it suffers from the large search space
and requires a model capable of dealing with such a problem.
model increased the Accuracy by 11.03%. The work in [11,
For future work, we plan to explore other subtasks of
13] fine-tuned Arabic-based BERT models with a single
ABSA, aspect category detection and aspect category
layer for token classification; they achieved an accuracy of
84.65% and 89.51%, respectively. By comparing these sentiment classification. We may also utilize the triplet
models to our proposed model, it is clear that our model extraction technique, which is concerned with extracting the
outperformed them by 13.69% and 8.83%, respectively, on target, opinionated word, and their corresponding sentiment
polarity in one model. Additionally, other Deep Learning
the ASC task. FIGURE 13 illustrates the comparisons with
techniques, different embeddings, and datasets could be
the previous single-task approaches.
evaluated via the unified approach to assessing its impact on
the task of ABSA.

13 VOLUME 11, 2023

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3342755

ACKNOWLEDGMENT [17] H. Ding, S. Huang, W. Jin, Y. Shan and H. Yu, “A Novel Cascade
The Department of Computer Science supported this work Model for End-to-End Aspect-Based Social Comment Sentiment
under the Faculty of Computers and Information - Mansoura Analysis,” Electronics, vol. 11, p. 1810, 2022.
University of Egypt. [18] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
Gomez, {. Kaiser and I. Polosukhin, “Attention is all you need,”
Advances in neural information processing systems, vol. 30, 2017.
REFERENCES
[19] M. F. Abdelfattah, M. W. Fakhr and M. A. Rizka, “ArSentBERT:
[1] R. Obiedat, D. Al-Darras, E. Alzaghoul and O. Harfoushi, “Arabic fine-tuned bidirectional encoder representations from transformers
Aspect-Based Sentiment Analysis: A Systematic Literature Review,” model for Arabic sentiment classification,” Bulletin of Electrical
IEEE Access, vol. 9, pp. 152628--152645, 2021. Engineering and Informatics, vol. 12, pp. 1196--1202, 2023.
[2] M. AL-Smadi, O. Qwasmeh, B. Talafha, M. Al-Ayyoub, Y. Jararweh [20] R. Bensoltane and T. Zaki, “Combining BERT with TCN-BiGRU for
and E. Benkhelifa, “An enhanced framework for aspect-based enhancing Arabic aspect category detection,” Journal of Intelligent
sentiment analysis of Hotels' reviews: Arabic reviews case study,” in \& Fuzzy Systems, pp. 1--14, 2023.
2016 11th International conference for internet technology and
secured transactions (ICITST), IEEE, 2016, pp. 98--103. [21] W. Antoun, F. Baly and H. Hajj, “Arabert: Transformer-based model
for arabic language understanding,” arXiv preprint
[3] M. M. Abdelgwad, T. H. A. Soliman, A. I. Taloba and M. F. Farghaly, arXiv:2003.00104, 2020.
“Arabic aspect based sentiment analysis using bidirectional GRU
based models,” Journal of King Saud University-Computer and [22] J. Devlin, M.-W. Chang, K. Lee and K. Toutanova, “Bert: Pre-
Information Sciences, vol. 34, pp. 6652--6662, 2022. training of deep bidirectional transformers for language
understanding,” arXiv preprint arXiv:1810.04805, 2018.
[4] M. Al-Smadi, O. Qawasmeh, M. Al-Ayyoub, Y. Jararweh and B.
Gupta, “Deep Recurrent neural network vs. support vector machine [23] J. Lafferty, A. McCallum and F. C. Pereira, “Conditional random
for aspect-based sentiment analysis of Arabic hotels’ reviews,” fields: Probabilistic models for segmenting and labeling sequence
Journal of computational science, vol. 27, pp. 386--393, 2018. data,” 2001.
[5] M. Al-Smadi, B. Talafha, M. Al-Ayyoub and Y. Jararweh, “Using [24] J. S. Bridle, “Probabilistic interpretation of feedforward classification
long short-term memory deep neural networks for aspect-based network outputs, with relationships to statistical pattern recognition,”
sentiment analysis of Arabic reviews,” International Journal of in Neurocomputing: Algorithms, architectures and applications,
Machine Learning and Cybernetics, vol. 10, pp. 2163--2175, 2019. Springer, 1990, pp. 227--236.
[6] A.-S. Mohammad, M. M. Hammad, A. Sa’ad, A.-T. Saja and E. [25] M. Al-Ayyoub, H. Al-Sarhan, M. Al-So'ud, M. Al-Smadi and Y.
Cambria, “Gated Recurrent Unit with Multilingual Universal Jararweh, “Framework for Affective News Analysis of Arabic News:
Sentence Encoder for Arabic Aspect-Based Sentiment Analysis,” 2014 Gaza Attacks Case Study.,” J. Univers. Comput. Sci., vol. 23,
Knowledge-Based Systems, p. 107540, 2021. pp. 327--352, 2016.
[7] A. Mohammad, M. Al-Ayyoub, H. N. Al-Sarhan and Y. Jararweh, [26] M. Al-Smadi, M. Al-Ayyoub, Y. Jararweh and O. Qawasmeh,
“An aspect-based sentiment analysis approach to evaluating arabic “Enhancing aspect-based sentiment analysis of Arabic hotels’
news affect on readers,” Journal of Universal Computer Science, vol. reviews using morphological, syntactic and semantic features,”
22, pp. 630--649, 2016. Information Processing & Management, vol. 56, pp. 308--319, 2019.
[8] S. Al-Dabet, S. Tedmori and M. Al-Smadi, “Extracting opinion [27] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado and J. Dean,
targets using attention-based neural model,” SN Computer Science, “Distributed representations of words and phrases and their
vol. 1, pp. 1--10, 2020. compositionality,” Advances in neural information processing
systems, vol. 26, 2013.
[9] R. Bensoltane and T. Zaki, “Towards Arabic aspect-based sentiment
analysis: a transfer learning-based approach,” Social Network [28] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
Analysis and Mining, vol. 12, pp. 1--16, 2022. Neural computation, vol. 9, pp. 1735--1780, 1997.
[10] A. S. Fadel, M. E. Saleh and O. A. Abulnaja, “Arabic Aspect [29] A. Joulin, E. Grave, P. Bojanowski and T. Mikolov, “Bag of tricks
Extraction Based on Stacked Contextualized Embedding With Deep for efficient text classification,” arXiv preprint arXiv:1607.01759,
Learning,” IEEE Access, vol. 10, pp. 30526--30535, 2022. 2016.
[11] M. Abdelgwad, “Arabic aspect based sentiment classification using [30] A. Safaya, M. Abdullatif and D. Yuret, “Kuisail at semeval-2020 task
BERT,” arXiv: 2107.13290, 2021. 12: Bert-cnn for offensive speech identification in social media,” in
Proceedings of the Fourteenth Workshop on Semantic Evaluation,
[12] S. Al-Dabet, S. Tedmori and A.-S. Mohammad, “Enhancing Arabic 2020, pp. 2054--2059.
aspect-based sentiment analysis using deep learning models,”
Computer Speech & Language, vol. 69, p. 101224, 2021. [31] Y. Yang, D. Cer, A. Ahmad, M. Guo, J. Law, N. Constant, G. H.
Abrego, S. Yuan, C. Tar and Y.-H. Sung, “Multilingual universal
[13] M. E. Chennafi, H. Bedlaoui, A. Dahou and M. A. Al-qaness, “Arabic sentence encoder for semantic retrieval,” arXiv preprint
aspect-based sentiment classification using Seq2Seq dialect arXiv:1907.04307, 2019.
normalization and transformers,” Knowledge, vol. 2, pp. 388--401,
2022. [32] A. B. Soliman, K. Eissa and S. R. El-Beltagy, “Aravec: A set of arabic
word embedding models for use in arabic nlp,” Procedia Computer
[14] S. Ruder, P. Ghaffari and J. G. Breslin, “Insight-1 at semeval-2016 Science, vol. 117, pp. 256--265, 2017.
task 5: Deep learning for multilingual aspect-based sentiment
analysis,” arXiv preprint arXiv:1609.02748, 2016. [33] M. Al-Smadi, O. Qawasmeh, B. Talafha and M. Quwaider, “Human
annotated arabic dataset of book reviews for aspect based sentiment
[15] X. Li, L. Bing, P. Li and W. Lam, “A unified model for opinion target analysis,” in 2015 3rd International conference on future internet of
extraction and target sentiment prediction,” in Proceedings of the things and cloud, IEEE, 2015, pp. 726--730.
AAAI conference on artificial intelligence, 2019.
[34] X. Wang, G. Xu, Z. Zhang, L. Jin and X. Sun, “End-to-end aspect-
[16] B. Carpenter, “Coding chunkers as taggers: Io, bio, bmewo, and based sentiment analysis with hierarchical multi-task learning,”
bmewo+,” LingPipe Blog, p. 14, 2009. Neurocomputing, vol. 455, pp. 178--188, 2021.

14 VOLUME 11, 2023

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3342755

[35] X. Li, L. Bing, W. Zhang and W. Lam, “Exploiting BERT for end- now, he is the acting-head of the Computer Scientists Department at
to-end aspect-based sentiment analysis,” arXiv preprint Mansoura University.
arXiv:1910.00883, 2019.
REEM EL-DEEB was born in El-Mahalla El-
[36] B. Xu, X. Wang, B. Yang and Z. Kang, “Target embedding and Kobra, Egypt, in 1987. She received the B.S. degree
position attention with lstm for aspect based sentiment analysis,” in in Computer Science from Mansoura University,
Proceedings of the 2020 5th International Conference on Faculty of Computers and Information, Egypt, in
Mathematics and Artificial Intelligence, 2020, pp. 93--97. 2008, and the M. Sc. and Ph.D. degrees in Computer
[37] J. Pennington, R. Socher and C. D. Manning, “GloVe: Global Vectors Science from Mansoura University, Faculty of
for Word Representation,” in Proceedings of the 2014 conference on Computers and Information, Egypt, in 2012 and
empirical methods in natural language processing (EMNLP), 2014, 2019, respectively. In 2009, she joined the
Computer Science Department, Mansoura University, as a Teaching
pp. 1532--1543.
Assistant, and in 2012 as an Assistant Lecturer and she became an Assistant
[38] V. Nair and G. E. Hinton, “Rectified linear units improve restricted Professor in 2019. Her Current Research Interests include Natural Language
boltzmann machines,” in Icml, 2010. Processing, Artificial intelligence applications, and Machine learning for
[39] J. Terven, D. M. Cordova-Esparza, A. Ramirez-Pedraza and E. A. text semantic analysis and language understanding. Dr. Reem was the
Chavez-Urbiola, “Loss Functions and Metrics in Deep Learning. A recipient of the Scientific Publishing Grant award (2019).
Review,” arXiv preprint arXiv:2307.02694, 2023.
[40] G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami and C.
Dyer, “Neural architectures for named entity recognition,” arXiv
preprint arXiv:1603.01360, 2016.
[41] D. P. Kingma and J. Ba, “Adam: A method for stochastic
optimization,” arXiv preprint arXiv:1412.6980, 2014.
[42] M. Stone, “Cross-validatory choice and assessment of statistical
predictions,” Journal of the royal statistical society: Series B
(Methodological), vol. 36, pp. 111--133, 1974.
[43] M. Hossin and M. N. Sulaiman, “A review on evaluation metrics for
data classification evaluations,” International journal of data mining
\& knowledge management process, vol. 5, p. 1, 2015.
[44] T. Saito and M. Rehmsmeier, “The precision-recall plot is more
informative than the ROC plot when evaluating binary classifiers on
imbalanced datasets,” PloS one, vol. 10, p. e0118432, 2015.
[45] M. Pontiki, D. Galanis, H. Papageorgiou, I. Androutsopoulos, S.
Manandhar, M. AL-Smadi, M. Al-Ayyoub, Y. Zhao, B. Qin, O. De
Clercq and others, “Semeval-2016 task 5: Aspect based sentiment
analysis,” in ProWorkshop on Semantic Evaluation (SemEval-2016),
Association for Computational Linguistics, 2016, pp. 19--30.

GHADA M. SHAFIQ received a B.Sc. degree


from the Computer Science Department, Faculty of
Computer and Information Sciences, Mansoura
University, Mansoura, Egypt, in 2018. Since 2018,
she has been a Demonstrator with the Department
of Computer Science, Faculty of Computer, and
Information Sciences, Mansoura University. Her
research interests include natural language
processing, artificial intelligence, and language
models.

MOHAMMED F. ALRAHMAWY received


B.Eng. degree in Electronics Engineering from the
University of Mansoura, Egypt, in 1997, and MSc
in Automatic Control Engineering from Mansoura
University in 2001. In 2005, he joined the real-time
systems research group at Computer Science
Department at The University of York, UK as a
PhD research student, where he got his Ph.D.
degree in Computer Science in 2011. In 2011, he
joined, as a lecturer, the Department of Computer
Science, Mansoura University, and in January 2023
he became a professor in Computer Science at the
same department. His current research interests include Deep learning,
Network and Graph Analytics, Realtime Systems and Languages, NLP,
Cloud computing, Distributed and Parallel Computing, Image Processing,
Computer Vision, IoT and Big data. He was the receptionist of the best MSc
thesis award from Mansoura University in 2003. His PhD was fully funded
by the Egyptian Ministry of Higher Education. Since January 2022 until

15 VOLUME 11, 2023

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4

You might also like