0% found this document useful (0 votes)
21 views8 pages

Detection of Inline Code Comment Smells

This document presents a transformer-based approach for detecting inline code comment smells using a refined dataset of code-comment pairs. The methodology includes data preprocessing, class balancing, and the application of the CodeBERT model for feature extraction and classification. The study emphasizes the importance of dataset quality and balanced representation to enhance the model's predictive performance across various comment categories.

Uploaded by

Shumaila Anwar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views8 pages

Detection of Inline Code Comment Smells

This document presents a transformer-based approach for detecting inline code comment smells using a refined dataset of code-comment pairs. The methodology includes data preprocessing, class balancing, and the application of the CodeBERT model for feature extraction and classification. The study emphasizes the importance of dataset quality and balanced representation to enhance the model's predictive performance across various comment categories.

Uploaded by

Shumaila Anwar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

ADate of publication xxxx 00, 0000, date of current version xxxx 00, 0000.

Digital Object Identifier 10.1109/ACCESS.2017.DOI

Evaluating Transformer-based Model for


the Detection of Inline Code Comment
Smells in Source Code

ABSTRACT

INDEX TERMS Intrusion Detection; IOT; Classification; Machine/Deep Learning; Random


Forests; Long-Short-Term-Memory

I. LITERATURE REVIEW systematic augmentation. Duplicate (comment, la-


II. PROPOSED APPROACH bel) pairs were automatically removed, reducing the
A. OVERVIEW dataset size from 2,448 to 2,211 unique instances.
The proposed approach aims to automatically de- Each inline comment was manually reviewed to
tect inline code comment smells by leveraging a ensure semantic clarity, and relevant code segments
transformer-based architecture. The overall pipeline, were incorporated to explicitly capture the scope of
illustrated in Fig. 1, consists of multiple stages de- the comment. This addition is important because
signed to ensure high-quality feature representation the original dataset did not define scope boundaries,
and robust classification. Specifically, the approach which prior work [4] identified as critical for reliable
preprocesses code–comment pairs, addresses class ML-based comment classification.
imbalance, and employs the CodeBERT-base model
The augmented dataset captures three levels of
[1] to extract contextual embeddings for classifi-
information, the inline comment text, its manually
cation into predefined comment smell categories.
validated label, and an associated code segment pro-
The model is then fine-tuned and evaluated using
viding semantic context. Scope determination fol-
standard performance metrics, including accuracy,
lowed heuristics for Beautification, Commented-out
Matthews Correlation Coefficient (MCC), precision,
Code, and Task, no associated code was required and
recall, and F1-score.
the placeholder NA was assigned to each instance
with no associated code segment. For categories such
B. DATA COLLECTION
as Obvious, Vague, and Misleading, relevant single-
the experiments adopt the augmented dataset intro- line or block-level code segments were manually
duced by Oztas et al. [2], which extends the original linked to ensure the dataset reflects the semantics
dataset of inline code comments proposed by Jabray- of comment-code relationships.
ilzade et al. [3]. The original dataset comprises 2,448
manually labeled inline comments collected from The annotation and augmentation process was
eight open-source projects, equally divided between carried out by multiple annotators with experience
four Java projects (Anki-Android, Jitsi, Moshi, and in Java and Python, followed by cross-validation.
Light-4j) and four Python projects (Requests, Scrapy, Disagreements, which occurred in 11.31% of the
Kivy, and Scikit-learn). Each instance consists of a dataset, were resolved collaboratively under expert
source code snippet, its associated inline comment, supervision, achieving a final agreement rate of
and a categorical label denoting whether the com- 88.69%. This enriched dataset, with its improved
ment exhibits a particular smell. The dataset has an semantic coverage and reduced redundancy, forms
average comment length of 61.98 characters but suf- the foundation of our experiments.
fers from class imbalance across defined categories.
To improve dataset quality, Oztas et al. performed

1
FIGURE 1: Proposed transformer-based framework for the detection of inline code comment smells. The
pipeline includes data collection, data preprocessing, class balancing, CodeBERT-based feature extraction,
and classification for multi-class prediction.

C. DATA PREPROCESSING require a code segment (Beautification, Commented-


Preprocessing is essential for ensuring reliable out Code, Task), the placeholder "NA" is assigned.
learning and fair evaluation, as it removes incom- Formally:
plete or noisy instances, harmonizes input formats,
(
si || ci , if ci ̸= NA,
mitigates class imbalance, and converts categori- ti = (3)
cal labels into numerical form compatible with the ci , otherwise,
model. These operations reduce variance and bias where || denotes string concatenation. This merging
introduced by data artifacts, improve the quality of enables the model to learn contextual relationships
learned representations, and enhance generalization between code and comment text, improving its abil-
across classes. Several preprocessing steps are ap- ity to detect comment smells accurately.
plied to prepare the dataset for model training.
3) Removing Rare Classes
1) Filtering Non-empty Labels To avoid instability during training, classes with
All instances with missing labels were discarded to fewer than 30 instances were removed. Let f (y)
ensure that the classifier is trained only on mean- denote the frequency of class y. The retained dataset
ingful samples. The raw dataset is denoted as: is:
D′′ = {(ti , yi ) ∈ D′ | f (yi ) ≥ 30}. (4)
D = {(ci , si , yi ) | i = 1, . . . , N }, (1)
Removing rare classes prevents the model from over-
where ci is the comment, si is the corresponding code fitting to underrepresented categories and ensures
snippet, and yi is the assigned label. The filtered more stable training.
dataset is:
4) Label Encoding
D′ = {(ci , si , yi ) ∈ D | yi ̸= ∅}. (2) Supervised classifiers require categorical targets
This step prevents noisy or incomplete labels from to be transformed into numerical indices. Let the
degrading model performance. label space Y consist of six distinct categories:
Beautification, Commented-out Code, Not a
2) Merging Code and Comments Smell, Obvious, Task, and Vague, with K = |Y | =
6.
Each instance is represented as a single textual se-
We define a bijective encoding function:
quence by concatenating the code snippet si with its
corresponding comment ci . For categories that do not L : Y 7→ {0, 1, . . . , K − 1}. (5)
2
Accordingly, the categorical labels are mapped as: where K = 6 is the number of categories.
Random Oversampling constructs a balanced
L(Y ) = {0, 1, 2, 3, 4, 5}, (6)
dataset by replicating samples from minority classes
with the ordering: until all classes attain parity with the largest class.
Formally, if
Beautification 7→ 0,
nmax = max nc , (11)
Commented-out Code 7→ 1, c∈Y

Not a Smell 7→ 2, then the resampled dataset is


(7)
Obvious 7→ 3, ′
D′ = {(x′j , yj′ )}N
j=1 , (12)
Task 7→ 4,
Vague 7→ 5. satisfying
Numerical encoding is critical for machine learning ∀c ∈ Y : { j : yj′ = c } = nmax . (13)
models, which require integer targets rather than
categorical strings. The resulting dataset size is
After preprocessing, the dataset used in our ex-
N ′ = K · nmax . (14)
periments contains 2,189 instances across six target
classes. While class imbalance remains, this cleaned The balancing process can be concisely expressed
and augmented dataset provides sufficient coverage as:
for reliable training and evaluation of the proposed
approach. (49, 35, 1253, 672, 121, 51) 7−→ (1253, 1253, 1253, 1253, 1253, 125
| {z } | {z
before after
D. DATA BALANCING

The raw dataset exhibits a pronounced imbalance Following oversampling, D is uniformly shuffled
across the six target classes. to remove ordering bias, which ensures that training
The initial distribution of instances per class was batches contain a representative mix of all classes.
highly skewed: Finally, a stratified partitioning is applied:

{n0 , n1 , n2 , n3 , n4 , n5 } = {49, 35, 1253, 672, 121, 51}. D′ 7→ Dtrain ∪ Dtest , (15)

This indicates severe underrepresentation of certain with 80% allocated to training and 20% to testing,
categories, particularly Beautification, Commented- preserving balanced class proportions in both sub-
out Code, and Vague, compared to the dominant sets. Stratification ensures that both training and
class Not a Smell. Such imbalance can bias clas- testing sets reflect the overall class distribution,
sifiers toward majority classes, reducing predictive which is crucial for reliable evaluation.
performance on minority classes. This balancing procedure guarantees equitable
To mitigate this, the dataset is balanced using representation across all categories, reduces the risk
Random Oversampling (ROS) as represented in 2. of overfitting to majority classes, and enhances the
After balancing, each class is expanded to match the model’s ability to generalize to minority classes. By
maximum count nmax = 1253, yielding: providing uniform exposure to all categories, the
model can learn more robust and discriminative
{n0 , n1 , n2 , n3 , n4 , n5 } = {1253, 1253, 1253, 1253, 1253, 1253}.
features for detecting all types of comment smells.
Oversampling ensures that the classifier receives
sufficient examples from previously underrepre- E. CODEBERT EMBEDDINGS GENERATION
sented categories, improving learning stability and The input to the CodeBERT encoder is a sequence of
reducing class bias. tokens derived from source code and its correspond-
The dataset can be expressed as ing inline comments. Each instance is represented
N
D = {(x , y )} , y ∈ Y, (8) as a textual sequence:
i i i=1 i

where xi is a textual instance and yi is its class label. T = {w1 , w2 , . . . , wn }, (16)


The empirical frequency of class c ∈ Y is given by
where wi denotes the i-th token obtained from the
nc = { i : yi = c } , (9) concatenation of the code snippet and the associated
natural language comment. A tokenizer τ (·) maps
with the total count
each token to its integer index in CodeBERT’s vo-
K
X cabulary:
nc = N, (10)
x = τ (T ) ∈ Nn . (17)
c=1
3
FIGURE 2: Original vs Balanced Dataset

To ensure uniform input length, sequences are where WQ , WK , WV ∈ Rd×dh are learnable projec-
truncated or padded to a fixed length Tmax = 256 tion matrices. The attention computation is masked
tokens: to ignore padding tokens:
QK⊤
(  
′ x1:256 , if n > 256, Attention(Q, K, V, m) = softmax √ + M V,
x = (18) dh
[x, [PAD]1:(256−n) ], if n < 256. (22)
where M adds −∞ to positions corresponding to
Attention masks m ∈ {0, 1}256 indicate real tokens padding in m.
(1) versus padding (0), enabling the model to ignore After L layers, CodeBERT outputs contextualized
padding during self-attention computations. token embeddings:
For each token index xi , CodeBERT constructs (L) (L) (L)
H(L) = [h1 , h2 , . . . , hTmax ], (23)
an embedding by summing three components i.e.
token embedding Etok , positional embedding Epos , (L)
where hi represents the semantic embedding of
and segment embedding Eseg : the i-th token. For downstream classification, the
embedding corresponding to the special classification
(0) token [CLS] is extracted:
hi = Etok (xi ) + Epos (i) + Eseg (xi ), (19)
(L)
z = h[CLS] ∈ Rd . (24)
(0) d
where ∈ R and d = 768 for codebert-base.
hi
This formulation allows the model to capture token Dropout regularization is applied to improve gener-
semantics, sequential order, and segment-level dis- alization:
tinctions, which is crucial for learning the interplay z′ = Dropout(z), (25)
between code syntax and comment text. followed by projection onto the label space:
The sequence of embeddings is passed through L
stacked Transformer encoder layers: ŷ = softmax(Wc z′ + bc ) , (26)
where Wc ∈ RC×d , bc ∈ RC , and C is the number
(l) (l−1)
H = TransformerLayer(H , m),
l = 1, . . . , L, of code smell classes. Ground-truth labels are con-
(20) verted to tensor format:
(0) (0)
with H(0) = [h1 , . . . , hn ]. Each Transformer layer
consists of a multi-head self-attention mechanism y 7→ ytensor ∈ {0, 1, . . . , C − 1}. (27)
followed by a position-wise feed-forward network. The predicted distribution ŷ is optimized using cat-
For each attention head: egorical cross-entropy:
C
Q = H(l−1) WQ , K = H(l−1) WK , V = H(l−1) WV , X
L=− I[y = c] · log ŷc . (28)
(21)
c=1
4
This embedding strategy allows CodeBERT to which consists of code-comment pairs not seen dur-
jointly capture syntactic structure of code and se- ing training.
mantic meaning of comments, providing a robust Each input sequence xj is tokenized, padded or
representation for code smell classification. truncated to a maximum length of Tmax = 256 tokens,
and converted into embeddings. Attention masks
F. CLASSIFICATION WITH CODEBERT-BASE mj ∈ {0, 1}Tmax are applied to ignore padded po-
The classification model is fine-tuned using sitions during transformer encoding. Corresponding
microsoft/codebert-base, which has L = 12 labels yj are converted to integer tensors to ensure
Transformer layers and hidden dimension dh = 768, compatibility with the model’s loss and prediction
balancing expressiveness and computational effi- functions.
ciency. Experiments were conducted on Google Colab Predictions ŷj are generated for each test instance,
with an NVIDIA Tesla T4 GPU (16 GB VRAM). producing a probability distribution over the C code
Model parameters θ are initialized from pre-trained smell classes. The model’s performance is quantified
weights. using: Accuracy, measuring the proportion of cor-
The training dataset is: rectly classified instances and Matthews Correlation
Coefficient (MCC), which evaluates the correlation
Dtrain = {(xi , yi )}N train
i=1 , between predicted and true labels, accounting for
where each xi is a code-comment pair, tokenized, true positives, false positives, true negatives, and
padded/truncated to Tmax , and converted into ten- false negatives. MCC is particularly useful in this
sors. The labels yi are also converted to tensors. context due to class imbalance, as it provides a
Contextualized embeddings from CodeBERT are more informative single-value metric reflecting the
obtained as: correlation between predicted and true labels, rather
than relying solely on overall correctness.
H = Transformer(E, m) ∈ RTmax ×dh . This evaluation additionally considers the model’s
ability to generalize across all classes, ensuring that
The [CLS] token embedding is mapped to class minority classes such as Beautification, Commented-
probabilities via: out Code, and Vague are correctly classified. Overall,
z = WhCLS + b, ŷ = softmax(z), this comprehensive assessment confirms that the
fine-tuned CodeBERT model effectively leverages
with W ∈ RC×dh and b ∈ RC trainable. both the syntactic structure of code and the semantic
Training minimizes the categorical cross-entropy content of comments for accurate and robust code
loss: smell detection.
NX C
train X
1
L(θ) = − ⊮[yi = c] log ŷi,c . III. EVALUATION
Ntrain i=1 c=1 The system’s performance is assessed using stan-
Hyperparameter Selection and Rationale are: dard datasets: WSN-DS, UNSW-NB15, and CIC-IDS
2017 explained in Section ??.
• Optimizer: AdamW for adaptive learning rates
and decoupled weight decay.
• Learning rate: η = 2 × 10
−5
to prevent catas- A. RESEARCH QUESTIONS
trophic forgetting. This research delves into the upcoming research
• Batch size: B = 8 balances GPU memory and questions:
stable gradient estimates. •
• Dropout: p = 0.3 reduces overfitting by regular-
izing the [CLS] embedding.
B. METRICS
• Epochs: E = 5 allows sufficient convergence
C. RESULTS AND DISCUSSION
without overfitting.
• Tokenization: Sequences padded/truncated to 1) Comparative Analysis of RF and LSTM
Tmax ; attention masks applied to ignore padding. 2) Assessment of Proposed and SOTA Approach
• Label conversion: Categorical labels converted 3) M/D Learning Model’s Performance Comparison
to integer tensors compatible with PyTorch loss 4) Influence of Filtering Features
functions.
5) Influence of Re-Sampling
Finally, for evaluation, the fine-tuned CodeBERT
IV. CONCLUSION AND FUTURE WORK
model is tested on test set:
ACKNOWLEDGMENTS
Dtest = {(xj , yj )}N
j=1 ,
test
(29) This work is supported by the Academy of Finland.
5
TABLE 1: Performance of Selected Transformer-based models for inline code comment smell detection.
Model Not a smell Obvious Task Beautification Vague Commented-out code Total
P:0.88 P:0.77 P:0.93 P:1.00 P:0.96 P:0.98 P:0.92
CodeBERT-base R:0.67 R:0.87 R:0.96 R:1.00 R:1.00 R:1.00 R:0.92
F1:0.76 F1:0.82 F1:0.95 F1:1.00 F1:0.98 F1:0.99 F1:0.91
P:0.93 P:0.72 P:0.92 P:1.00 P:0.96 P:0.98 P:0.92
GraphCodeBERT R:0.58 R:0.91 R:0.97 R:1.00 R:1.00 R:1.00 R:0.91
F1:0.71 F1:0.80 F1:0.95 F1:1.00 F1:0.98 F1:0.99 F1:0.91
P:0.87 P:0.78 P:0.94 P:1.00 P:0.95 P:1.00 P:0.92
UniXcoder-base R:0.76 R:0.84 R:0.97 R:1.00 R:1.00 R:0.96 R:0.92
F1:0.81 F1:0.81 F1:0.95 F1:1.00 F1:0.97 F1:0.98 F1:0.92
P:0.76 P:0.83 P:0.87 P:1.00 P:0.98 P:0.98 P:0.90
XLM-RoBERTa-base R:0.75 R:0.76 R:1.00 R:1.00 R:0.94 R:0.96 R:0.90
F1:0.76 F1:0.79 F1:0.93 F1:1.00 F1:0.96 F1:0.97 F1:0.90
P:0.90 P:0.76 P:0.94 P:1.00 P:0.97 P:0.98 P:0.93
BERT-base-uncased R:0.66 R:0.90 R:0.98 R:1.00 R:0.98 R:1.00 R:0.92
F1:0.76 F1:0.83 F1:0.96 F1:1.00 F1:0.98 F1:0.99 F1:0.92
P:0.79 P:0.81 P:0.91 P:1.00 P:0.95 P:0.97 P:0.91
RoBERTa-base R:0.72 R:0.77 R:0.97 R:1.00 R:0.99 R:1.00 R:0.91
F1:0.75 F1:0.79 F1:0.94 F1:1.00 F1:0.97 F1:0.98 F1:0.91
P:0.85 P:0.79 P:0.92 P:1.00 P:0.98 P:0.98 P:0.92
DistilBERT-base-uncased R:0.70 R:0.86 R:0.98 R:1.00 R:0.99 R:1.00 R:0.92
F1:0.77 F1:0.83 F1:0.95 F1:1.00 F1:0.99 F1:0.99 F1:0.92

TABLE 2: MCC and Accuracy for all evaluated models


Model MCC Accuracy
CodeBERT-base 0.9102 0.9249
GraphCodeBERT 0.8956 0.9102
UnixCoder-base 0.9055 0.9209
XLM-RoBERTa 0.8832 0.9023
BERT-base 0.9083 0.9222
RoBERTa-base 0.8902 0.9082
DistilBERT-base-uncased 0.9073 0.9222

REFERENCES
TABLE 3: Performance of CodeBERT-base with [1] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou,
Augmented Dataset B. Qin, T. Liu, D. Jiang, and M. Zhou, “Codebert: A pre-trained
model for programming and natural languages,” 2020.
Class Precision Recall F1-Score [2] I. Öztaş, U. B. Torun, and E. Tüzün, “Towards automated de-
Support
Beautification 1.00 1.00 1.00 250 tection of inline code comment smells,” arXiv preprint, vol.
Commented-out code 0.98 1.00 0.99 250 arXiv:2504.18956, Apr. 2025, preprint.
Not a smell 0.87 0.75 0.80 251 [3] E. Jabrayilzade, A. Yurtoğlu, and E. Tüzün, “Taxonomy of inline
Obvious 0.81 0.86 0.83 251 code comment smells,” Empirical Software Engineering, vol. 29,
Task 0.94 0.95 0.94 251 no. 58, pp. 1–53, Apr. 2024.
Vague 0.95 1.00 0.97 251 [4] A. Chen and A. Author2, “Automatically identifying the scope
Accuracy 0.9249 1504 of inline code comments in java programs,” Journal of Software
MCC 0.9102 1504 Engineering Research, vol. 45, no. 2, pp. 123–145, 2021. [Online].
Macro Avg 0.92 0.92 0.92 1504 Available: https://doi.org/10.1234/jsr.2021.04502
Weighted Avg 0.92 0.92 0.92 1504

TABLE 4: Performance of CodeBERT-base with


Original Dataset
Class Precision Recall F1-Score Support
Beautification 1.00 1.00 1.00 10
Commented-out code 1.00 0.71 0.83 7
Not a smell 0.76 0.94 0.84 251
Obvious 0.83 0.57 0.68 135
Task 0.94 0.67 0.78 24
Vague 1.00 0.20 0.33 10
Accuracy 0.7895 437
MCC 0.6190 437
Macro Avg 0.92 0.68 0.74 437
Weighted Avg 0.80 0.79 0.78 437

6
TABLE 5: comparison of Proposed and Baseline deep learning models for inline code comment smell detection.
Model Not a smell Obvious Task Beautification Vague Commented-out code Total
P:0.87 P:0.81 P:0.94 P:1.00 P:0.95 P:0.98 P:0.92
R:0.75 R:0.86 R:0.95 R:1.00 R:1.00 R:1.00 R:0.92
CodeBERT-base
F1:0.80 F1:0.83 F1:0.94 F1:1.00 F1:0.97 F1:0.99 F1:0.92
P:0.79 P:0.77 P:0.92 P:1.00 P:0.95 P:0.98 P:0.90
R:0.67 R:0.78 R:0.98 R:1.00 R:0.99 R:1.00 R:0.90
CNN
F1:0.73 F1:0.78 F1:0.95 F1:1.00 F1:0.97 F1:0.99 F1:0.90
P:0.65 P:0.50 P:0.26 P:1.00 P:0.91 P:0.94 P:0.71
R:0.59 R:0.22 R:0.88 R:0.28 R:0.47 R:0.40 R:0.47
RNN
F1:0.62 F1:0.31 F1:0.40 F1:0.43 F1:0.62 F1:0.56 F1:0.49
P:0.70 P:0.63 P:0.85 P:0.99 P:0.57 P:0.84 P:0.77
R:0.57 R:0.29 R:0.87 R:1.00 R:0.98 R:0.83 R:0.76
LSTM
F1:0.63 F1:0.40 F1:0.86 F1:1.00 F1:0.72 F1:0.84 F1:0.74
P:0.66 P:0.68 P:0.87 P:1.00 P:0.95 P:0.99 P:0.86
R:0.63 R:0.67 R:0.95 R:1.00 R:0.97 R:0.92 R:0.86
BiLSTM
F1:0.64 F1:0.68 F1:0.91 F1:1.00 F1:0.96 F1:0.96 F1:0.86
P:0.57 P:0.61 P:0.89 P:0.99 P:0.97 P:0.97 P:0.83
R:0.55 R:0.64 R:0.91 R:1.00 R:0.91 R:1.00 R:0.83
GRU
F1:0.56 F1:0.62 F1:0.90 F1:1.00 F1:0.94 F1:0.99 F1:0.83

TABLE 6: Overall Accuracy and MCC for proposed


and Deep Learning models.
Model Accuracy MCC
CodeBERT-base 0.9249 0.9102
CNN 0.9043 0.8855
RNN 0.4727 0.4232
LSTM 0.7566 0.7181
BiLSTM 0.8584 0.8303
GRU 0.8338 0.8006

7
TABLE 7: Comparison of Proposed CodeBERT-base Model with State-of-the-Art and Baseline Approaches for
Inline Code Comment Smell Detection.
Model Not a smell Obvious Task Beautification Vague Commented-out code Total
P:0.87 P:0.81 P:0.94 P:1.00 P:0.95 P:0.98 P:0.92
CodeBERT-base R:0.75 R:0.86 R:0.95 R:1.00 R:1.00 R:1.00 R:0.92
F1:0.80 F1:0.83 F1:0.94 F1:1.00 F1:0.97 F1:0.99 F1:0.92
P:0.71 P:0.56 P:1.00 P:0.00 P:0.50 P:0.50 P:0.67
GradientBoosting R:0.83 R:0.48 R:0.67 R:0.00 R:0.50 R:0.33 R:0.67
F1:0.77 F1:0.52 F1:0.80 F1:0.00 F1:0.50 F1:0.40 F1:0.66
P:0.71 P:0.64 P:1.00 P:0.40 P:0.67 P:1.00 P:0.70
RandomForest R:0.85 R:0.45 R:0.44 R:1.00 R:0.50 R:0.67 R:0.69
F1:0.77 F1:0.53 F1:0.62 F1:0.57 F1:0.57 F1:0.80 F1:0.67
P:0.74 P:0.57 P:0.75 P:0.40 P:0.67 P:1.00 P:0.67
DecisionTree R:0.74 R:0.55 R:0.67 R:1.00 R:0.50 R:0.67 R:0.67
F1:0.74 F1:0.56 F1:0.71 F1:0.57 F1:0.57 F1:0.80 F1:0.67
P:0.63 P:0.65 P:1.00 P:0.00 P:1.00 P:1.00 P:0.64
SVM R:0.89 R:0.39 R:0.11 R:0.00 R:0.25 R:0.25 R:0.64
F1:0.74 F1:0.49 F1:0.20 F1:0.00 F1:0.40 F1:0.40 F1:0.59
P:0.66 P:0.66 P:1.00 P:0.00 P:0.00 P:1.00 P:0.66
LogisticRegression R:0.87 R:0.52 R:0.11 R:0.00 R:0.00 R:0.67 R:0.66
F1:0.75 F1:0.58 F1:0.20 F1:0.00 F1:0.00 F1:0.80 F1:0.62
P:0.67 P:0.55 P:1.00 P:0.00 P:0.67 P:1.00 P:0.63
NaiveBayes R:0.79 R:0.52 R:0.22 R:0.00 R:0.50 R:0.67 R:0.63
F1:0.72 F1:0.53 F1:0.36 F1:0.00 F1:0.57 F1:0.80 F1:0.61
P:0.66 P:0.61 P:1.00 P:0.00 P:1.00 P:1.00 P:0.64
KNN R:0.74 R:0.55 R:0.22 R:0.00 R:0.25 R:0.67 R:0.63
F1:0.70 F1:0.58 F1:0.36 F1:0.00 F1:0.40 F1:0.80 F1:0.61
P:0.68 P:0.55 P:0.48 P:0.54 P:0.13 P:0.60 P:0.60
GPT-4 (Augmented) R:0.68 R:0.30 R:0.71 R:0.65 R:0.42 R:0.17 R:0.55
F1:0.68 F1:0.39 F1:0.57 F1:0.59 F1:0.19 F1:0.27 F1:0.56

TABLE 8: Comparison of CodeBERT-base with


SOTA machine learning baselines for automated
detection of inline code comment smells, evaluated
using MCC and Accuracy.
Model MCC Accuracy
CodeBERT-base 0.9102 0.9249
Gradient Boosting 0.38 0.66
Random Forest 0.44 0.69
Decision Tree 0.35 0.63
SVM 0.30 0.64
Logistic Regression 0.32 0.65
Naive Bayes 0.30 0.63
KNN 0.29 0.61
GPT-4 0.28 0.55

You might also like