Distinguishing Human Generated Text From
ChatGPT Generated Text Using Machine Learning
Niful Islam1 , Debopom Sutradhar1 , Humaira Noor1 ,
Jarin Tasnim Raya2 , Monowara Tabassum Maisha2 , Dewan Md Farid1
1 Department of CSE, United International University (UIU), Bangladesh
2 Department of CSE, University of Asia Pacific (UAP), Bangladesh
Email : {nislam201057, dsutradhar201046}@[Link], hnoor222007@[Link],
{20101002, 20101001}@[Link], dewanfarid@[Link]
arXiv:2306.01761v1 [[Link]] 26 May 2023
Abstract—ChatGPT is a conversational artificial intelligence system for identifying human generated text and AI generated
that is a member of the generative pre-trained transformer of the text.
large language model family. This text generative model was fine- Natural Language Processing (NLP) is a rapidly growing
tuned by both supervised learning and reinforcement learning so
that it can produce text documents that seem to be written by field of study that works on understanding human language.
natural intelligence. Although there are numerous advantages of NLP gives machines the ability to learn human language by
this generative model, it comes with some reasonable concerns turning it into numerical data [5]. With the increasing number
as well. This paper presents a machine learning-based solution of digital texts, the need of NLP is growing rapidly [6]. In
that can identify the ChatGPT delivered text from the human recent years, NlP has provided a large scale analysis and
written text along with the comparative analysis of a total
of 11 machine learning and deep learning algorithms in the management of text data making sentiment analysis, emotion
classification process. We have tested the proposed model on detection and other complicated tasks possible. Furthermore,
a Kaggle dataset consisting of 10,000 texts out of which 5,204 with the help of NLP, it is possible to detect mental illness at
texts were written by humans and collected from news and early stage and provide treatment [7]. Previously the training
social media. On the corpus generated by GPT-3.5, the proposed of NLP models were slow and inefficient [8]. Especially after
algorithm presents an accuracy of 77%.
Index Terms—ChatCPT, Classification, Generative AI, NLP,
2017, the innovation of transformer architecture has revolu-
Tokenization tionized the NLP field. The transformers made NLP tasks to
be carried out sequentially that gave birth to large language
models like ChatGPT. This models are so efficient that it
I. I NTRODUCTION
can imitate human behaviour which creates some reasonable
The emergence of generative AI models are rapidly chang- concerns.
ing our way of communication. It is being extensively used Transformers are one of the most powerful tools for natural
in content creation, arts and design, healthcare and many language processing [9]. It can largely be divided into two
more. Although these models, especially conversational AI parts. The encoder and the decoder. The encoder part takes a
models like ChatGPT, have the potential of revolutionizing text sequence as input and produces a sequence of encoded
the society, they also come with some possible dangers. One representation. The difference between other architecture is
of the biggest concerns is that, it can produce false news or that the encoded sequence is more context aware. When a
spread misinformation [1] [2]. Since, the AI-generated texts series of encoders are stacked togather, it is called a BERT
are almost identical to the human-generated texts, the model [10]. The decoder part, on the other hand, is able to generate an
can be used to manipulate individuals or organizations in arbitrary length sequence. Stacking decoder blocks produces
various ways. There are some legal and ethical concerns of an architecture named GPT. ChatGPT is also a transformer
using generative AI. Since these models are trained on large based architecture that has been trained on a large set of
datasets, there might remain some bias in a particular sector. public data in self-supervised fashion. It has more than a
For decision making, if an individual employs these models, billion parameters making it on of the biggest language model
it may lead to discriminatory attitudes. Furthermore, students available. It is considered a major breakthrough since it’s
may rely too heavily on the AI generated tools, which could release in November, 2022.
damage their critical thinking and communication ability, In this paper, we present a machine learning based approach
negatively impacting their academic and professional life [3]. for detecting ChatGPT generated text and human generated
For newly developed problems, it requires new solutions. As text. The proposed model involves vectoring sentences using
these models are trained on old data, asking solutions for TF-IDF vectorizer and then classifying it employing extremely
advanced challenges might provide misleading solutions [4]. randomized trees classifier. This article also presents a com-
In addition, incorporating conversational AI into the system parative analysis of different machine learning algorithms
could result in low user satisfaction. Therefore, it requires a namely Logistic Regression, Support Vector Machine, Deci-
sion Tree, K- Nearest Neighbor, Random Forest, AdaBoost, can be effective in identifying some sorts of AI-generated
Bagging Classifier and deep learning algorithms named Multi- text, they are frequently open to adversarial attempts that
layer Perceptron and Long Short-Term Memory for detecting trick them into thinking the material is human-generated. A
ChatGPT generated text along with the impact of some data lightweight neural network-based paraphraser was developed
pre-processing techniques in the classification process. To and applied it to the AI-generated [Link] Kirchenbauer et
summarize, the article presents the followings: al. [19] introduced a watermarking method. This method add
• A machine learning based model for differentiating Chat- a small amount of noise to the weights of the LLM during
GPT generated text from human generated text. [Link] noise is made with the intention of encoding
• Comparative analysis of different machine learning and a distinct watermark signal that can later be decoded by a
deep learning algorithms in the classification process. watermark [Link] GPT-2 and GPT-3 language models
The article is organized such that Section II contains the prior are used to demonstrate the utility of their watermarking
works for solving the same problems followed by our proposed technology. The watermark can be found even after fine-tuning
method in Section III. Section IV holds the outcomes obtained the LLM on fresh data and that it is resistant to a variety of
from the study. The article terminates in Section V. attacks, including gradient masking and weight perturbations.
In a paper Kalpesh Krishna et al. [20] created a substantial
II. R ELATED W ORK amount of AI-generated text samples using a number of
To detect AI generated texts multiple approaches are pro- cutting-edge language models, such as GPT-3 and T5. The
posed. Traditionally Sebastian Gehrmann et al. [11] proposed a efficiency of several rule-based and machine learning-based
statistical method to distinguish machine and human generated text-derived AI detectors is then assessed using the created
text. The paper introduces a tool named GLTR. The GLTR tool samples. A retrieval-based defensive method was proposed
is a 6-gram character-level statistical language model model, that depends on determining the text’s original author. The
which is trained on a large corpus of text data. The tool uses suggested technique operates by maintaining a database of
this model to calculate the probability of each character in the known AI-generated text samples and their associated original
generated text, and then shows any character that has a low sources, and by comparing any new text samples against this
probability of occurring in the training corpus. Anton Bakhtin database to identify probable sources. The authors demonstrate
et al. [12] proposed a Energy-based model (EBM) to discrim- that the suggested retrieval-based defensive mechanism is
inate machine generated text. EBM is also a statistical model successful in identifying material that has been paraphrased
which finds an energy function from given data. The authors by AI, with detection a good accuracy. Souradip Chakraborty
used a comparatively larger dataset collected from human to et al. [21] proposed multiple possibilities that can detect AI
machine conversation. Eric Mitchell et al. [13] proposed a generated texts including some statistical methods and several
zero shot learning method called DetectGPT. This method machine learning algorithms. In most of the papers GPT-2
detects whether text is machine-written or not by calculating or previous version of GPT-3 were used. But in our paper we
log probabilities computed by the model of interest. On text approached with conventional machine learning algorithms but
samples generated by the GPT-2, the research team conducted with a dataset that was generated by GPT-3.5 which were more
an study [14]. Atsumu Harada et al. [15] gathered two datasets, human like.
one with sentences produced by humans, the other with
sentences written by both humans and machines. The cosine
similarity between sentence pairs was then calculated as a III. M ETHODOLOGY
measure of text consistency. Finally, based on the cosine sim-
ilarity ratings of the sentences, they classified them as either
human-written or human and machine-written using machine The aim of this research is to differentiate human text from
learning methods. Sandra Mitrovic et al. [16] proposed a generative model text using machine learning. In Figure 1,
transformer based model to detect chatgpt generated texts. To the high level overview of our process is described. The task
determine if a text was produced by ChatGPT or a person, the initiates by data collection. Section III-A holds the detailed
paper’s authors developed a machine learning methodology. process of this stage.
The model was trained on a dataset of 10,000 text samples In the data preprocessing stage, the dataset was balanced
that were classified as either human- or ChatGPT-generated. using undersampling technique. Moreover, the class column
It is based on a combination of text-based and user-based was converted into numerical values using binary encoding
attributes. Tiziano Fagni et al. [17] proposed a method to detect also known as one-hot encoding. Deleting stop words in the
deepFake tweets. At first tweets were generated by different pre-processing stage impacted the classification performance
language models. Then different machine learning methods negatively since the selection of stop words play a crucial role
were used with tf-idf and Bag of Words [Link] in differentiating human and AI. Finally, since machines can
Sadasivan et al. [18] evaluated the effectiveness of several understand numbers only, the sentences were vectorized using
existing approaches for detecting AI-generated text including TF-IDF vectorizer. The details of this technique is mentioned
rule-based methods, statistical methods, and machine learning- in section III-B. The speciality of this vectorizer over other
based [Link] discover that although these techniques methods is its ability to capture the importance of a word.
2. There are several hyper-parametres of this algorithm. In
our research, we have found that 50 Decision Tree classifiers
without pruning works the best for this problem with splitting
criteria as gini.
Fig. 3. Detailed overview of the process
A. Data Collection and Preprocessing
Fig. 1. High level overview of the process The dataset consists of 10,000 texts among which 5204
texts are generated from humans and rest of them are from
ChatGPT. Figure 4 holds the distribution of the dataset.
The initial dataset was constructed by first collecting data
from Quora and CNN news using web scrapping. Later they
were given to ChatGPT for paraphrasing. However, the initial
datset had a 1:8 ratio of human:ChatGPT text. Later, data-
preprocessing was applied to make the dataset balanced.
Fig. 2. Extra Tree Classifier
Later, the data was split into two halfs with a ratio of
80:20 where majority portion was kept for training and the
rest for testing. Figure 3 presents the detailed description of
the process. Total eleven models were selected for ablation
study. Section III-C presents a small description of every
algorithms. The final model selected for this task was Ex-
tremely Randomized Trees Classifier (ERTC). ERTC is an
ensemble algorithm that is based on decision tree. However,
instead of selecting the best partition point, it splits the data Fig. 4. Dataset distribution
based on random points. On the training phase, it constructs
some number of decision trees based on randomly selected B. TF-IDF Vectorizer
attributes and features [22]. At testing, it takes majority voting Term Frequency-Inverse Document Frequency (TF-IDF )
for prediction. An illustration of ERTC is shown in Figure is a popular word vectorization technique. The idea behind
TF-IDF is to emphasize the important words. The process of of a weak hypothesis or a negative gradient. A powerful
calculating TF-IDF has two steps. First step is to calculate the predicting model is created by the gradient boosting classifier
term frequency, that is how often a term (word or group of by combining several weak learning models.
words) appears in the document. Let term be denoted with Multi-layer Perceptron : Artificial neurons are the main
t, document with d, the collection of documents (corpus) concept of Multi-layer Perceptron (MLP). These neurons are
with D and total number of documents in the corpus with a set of interconnected units or nodes that loosely resemble
N . Now, the term frequency is calculated by dividing the the neurons in a biological brain. Like the synapses in a
number of occurrences of a term in the document by the total biological brain, each connection has the ability to send a
number of terms. Equation 1 holds the mathematical formula signal to neighboring neurons. An artificial neuron can signal
of calculating TF. neurons that are connected to it after processing signals that
nt,d are sent to it. The output of each neuron is calculated by
T F (t, d) = P (1) some non-linear function of the sum of its inputs.
k nk,d
Long Short-Term Memory: Long Short-Term
Inverse Document Frequency, on the other hand, measures Memory(LSTM) is a type of recurrent neural network
how important a term is by calculating the frequency of that that can remember information of long sequences.
term in the corpus. As described in Equation 2, it is calculated Extremely Randomized Trees : Extremely Randomized
by taking the logarithm of the total number of documents in Trees, sometimes referred to as Extra Trees, build numerous
the corpus divided by the number of documents that contain trees, similar to Random Forest (RF) techniques, over the
the term. whole dataset during training. The difference between RF
N classifier is that RF splits data based on best splitting criteria
IDF (t, D) = log (2) but Extra Trees classifier splits data randomly.
df (t)
Finally, the IF-IDF is calculated by multiplying the TF and
IDF calculated in the steps above. IV. R ESULTS
A. Experimental Setup
TF-IDF(t, d, D) = TF(t, d) · IDF(t, D) (3)
The experiment was carried out on a jupyter notebook and
C. Algorithms the machine was equipped with a CPU of ryzen 5 5600G.
Logistic Regression: Logistic regression is a statistical The CPU has an integrated Graphics Processing Unit (GPU)
method which is mainly used for binary classification. It for carrying out deep learning tasks. Moreover, the machine
calculates relationship between input data and class variable also had 16 GB RAMs. Python was used as programming
to make prediction. language along with four libraries named Numpy, Pandas,
Support Vector Machines: Support Vector Machine (SVM) SKlearn and Tensorflow.
is used for both classification and regression. Finding
a hyperplane in an N-dimensional space that clearly B. Evaluation Matrices
differentiates the data points is the goal of the SVM.
Decision Tree: Decision trees constructs a tree data structure Evaluation matrices are used to measure the performance
based on information obtained from the data. While inference, of the model. Different evaluation matrix provides different
it traverses the tree to find the appropriate class. perspective of the result. In this paper, we have used five
K- Nearest Neighbor: The K-nearest neighbors algorithm matrices namely accuracy, precision, recall,F1 score and
(KNN) is based on the theory of proximity. It finds the class Matthews correlation coefficient (MCC). Accuracy calculates
label of a given data by calculating the majority of the K the percentage of correctly predicted instances. As shown in
nearest instances. Equation 4, it is calculated by the total number of correctly
Random Forest: Random Forest is a supervised machine predicted samples divided by the total number of samples.
learning algorithm which combines some decision trees using
the concept of attribute bagging. TP + TN
AdaBoost: AdaBoost algorithm, also known as Adaptive Accuracy = (4)
TP + TN + FP + FN
Boosting, is a Boosting method used in machine learning as
While precision calculates the true positive predictions over
an ensemble method. It is more useful for noisy data.
all positive predictions, recall measures the true positive
Bagging Classifier: An ensemble meta-estimator called a
predictions out of all actual positive instances. The formula
bagging classifier fits base classifiers one at a time to random
of precision and recall are demonstrated in Equation 5 and
subsets of the original dataset, and it then averages or votes
Equation 6 respectively.
on each classifier’s individual predictions to produce a final
prediction. TP
Precision = (5)
Gradient Boosting: In order to minimize a loss function, TP + FP
the functional gradient algorithm known as Gradient Boosting TP
repeatedly chooses a function that points in the direction Recall = (6)
TP + FN
TABLE I
P ERFORMANCE OF DIFFERNET CLASSIFIERS
Model Accuracy Precision Recall F1-Score MCC
Logistic Regression 0.74 0.73 0.73 0.73 0.48
Support Vector Machines 0.75 0.75 0.71 0.73 0.50
Decision Tree 0.63 0.75 0.79 0.67 0.29
K- Nearest Neighbor 0.69 0.67 0.68 0.67 0.37
Random Forest 0.76 0.73 0.81 0.76 0.53
AdaBoost 0.71 0.68 0.74 0.71 0.43
Bagging Classifier 0.74 0.71 0.75 0.73 0.47
Gradient Boosting 0.71 0.66 0.78 0.72 0.42
Multi-layer Perceptron 0.72 0.73 0.72 0.72 0.43
Long Short-Term Memory 0.73 0.73 0.77 0.75 0.46
Extremely Randomized Trees 0.77 0.74 0.78 0.76 0.54
As shown in Equation 7, F1 score is the mean of precision
and recall. This provides a more balanced measurement than
precision and recall.
Precision · Recall
F1-score = 2 · (7)
Precision + Recall
Finally, the Matthews Correlation Coefficient (MCC) is a more
stable matrix that takes all four coefficients into account. This
matrix is more significant than all other matrices mentioned
above [23]. The formula for calculating MCC is mentioned in
Equation 8.
TP · TN − FP · FN
MCC = p
(TP + FP)(TP + FN)(TN + FP)(TN + FN)
(8)
Fig. 6. ROC curve of all models compared
Form the results, it is clearly visible that extra tree classifier
outperforms all the classifiers with an accuracy of 77%.
Moreover, it also has the highest MCC score that is the
most informative evaluation matrix compared. The results also
demonstrate that some well known classifier such as K-Nearest
Neighbor and Decision Tree classifier performs poorly on this
dataset. For deep learning based Artificial Neural Network
and Long Short-Term Memory, it was trained on 15 epoch.
Although it had a high training accuracy, it performed poorly
on testing. Some regularization techniques may improve the
performance. For further investigation of the result, Figure and
6 presents roc curve. These results show that the model is not
biased to a particular class.
Fig. 5. Accuracy of different classifiers
V. C ONCLUSION
C. Experimental Results In this research, we proposed a model that can differentiate
between the text generated from human and ChatGPT. Since
We have tested the performance against nine machine generating AI has become very advanced, it is difficult to
learning classifiers along with an MLP and LSTM model distinguish between human text and machine generated text.
with different hyper-parameters. Table I holds the detailed However, in this paper, we have presented a machine learning
performance analysis of different models. Figure 5 shows a based approach that can effectively identify two types of text.
diagrammatic accuracy comparison of different classifiers. With the continuing research in this area we expect to see
more sophisticated models for solving this problem that will [21] S. Chakraborty, A. S. Bedi, S. Zhu, B. An, D. Manocha, and F. Huang,
ensure transparency and accountability in the day to day life. “On the possibilities of ai-generated text detection,” arXiv preprint
arXiv:2304.04736, 2023.
[22] M. Ntahobari, L. Kuhlmann, M. Boley, and Z. R. Hesabi, “Enhanced
R EFERENCES extra trees classifier for epileptic seizure prediction,” in 2022 5th
International Conference on Signal Processing and Information Security
[1] P. Hacker, A. Engel, and M. Mauer, “Regulating chatgpt and other large (ICSPIS). IEEE, 2022, pp. 175–179.
generative ai models,” arXiv preprint arXiv:2302.02337, 2023. [23] D. Chicco and G. Jurman, “The advantages of the matthews correlation
[2] J. Botha and H. Pieterse, “Fake news and deepfakes: A dangerous threat coefficient (mcc) over f1 score and accuracy in binary classification
for 21st century information security,” in ICCWS 2020 15th International evaluation,” BMC genomics, vol. 21, pp. 1–13, 2020.
Conference on Cyber Warfare and Security. Academic Conferences and
publishing limited, 2020, p. 57.
[3] J. Qadir, “Engineering education in the era of chatgpt: Promise and
pitfalls of generative ai for education,” 2022.
[4] Y. Cao, S. Li, Y. Liu, Z. Yan, Y. Dai, P. S. Yu, and L. Sun, “A
comprehensive survey of ai-generated content (aigc): A history of
generative ai from gan to chatgpt,” arXiv preprint arXiv:2303.04226,
2023.
[5] D. Khurana, A. Koli, K. Khatter, and S. Singh, “Natural language
processing: State of the art, current trends and challenges,” Multimedia
tools and applications, vol. 82, no. 3, pp. 3713–3744, 2023.
[6] A. Feder, K. A. Keith, E. Manzoor, R. Pryzant, D. Sridhar, Z. Wood-
Doughty, J. Eisenstein, J. Grimmer, R. Reichart, M. E. Roberts et al.,
“Causal inference in natural language processing: Estimation, prediction,
interpretation and beyond,” Transactions of the Association for Compu-
tational Linguistics, vol. 10, pp. 1138–1158, 2022.
[7] T. Zhang, A. M. Schoene, S. Ji, and S. Ananiadou, “Natural language
processing applied to mental illness detection: a narrative review,” NPJ
digital medicine, vol. 5, no. 1, p. 46, 2022.
[8] C.-F. R. Chen, Q. Fan, and R. Panda, “Crossvit: Cross-attention multi-
scale vision transformer for image classification,” in Proceedings of the
IEEE/CVF international conference on computer vision, 2021, pp. 357–
366.
[9] A. Gillioz, J. Casas, E. Mugellini, and O. Abou Khaled, “Overview of
the transformer-based models for nlp tasks,” in 2020 15th Conference
on Computer Science and Information Systems (FedCSIS). IEEE, 2020,
pp. 179–183.
[10] S. Singh and A. Mahmood, “The nlp cookbook: modern recipes for
transformer based deep learning architectures,” IEEE Access, vol. 9, pp.
68 675–68 702, 2021.
[11] S. Gehrmann, H. Strobelt, and A. M. Rush, “Gltr: Statistical detection
and visualization of generated text,” arXiv preprint arXiv:1906.04043,
2019.
[12] A. Bakhtin, S. Gross, M. Ott, Y. Deng, M. Ranzato, and A. Szlam, “Real
or fake? learning to discriminate machine from human generated text,”
arXiv preprint arXiv:1906.03351, 2019.
[13] E. Mitchell, Y. Lee, A. Khazatsky, C. D. Manning, and C. Finn,
“Detectgpt: Zero-shot machine-generated text detection using probability
curvature,” arXiv preprint arXiv:2301.11305, 2023.
[14] I. Solaiman, M. Brundage, J. Clark, A. Askell, A. Herbert-Voss,
J. Wu, A. Radford, G. Krueger, J. W. Kim, S. Kreps et al., “Release
strategies and the social impacts of language models,” arXiv preprint
arXiv:1908.09203, 2019.
[15] A. Harada, D. Bollegala, and N. P. Chandrasiri, “Discrimination of
human-written and human and machine written sentences using text
consistency,” in 2021 International Conference on Computing, Com-
munication, and Intelligent Systems (ICCCIS). IEEE, 2021, pp. 41–47.
[16] S. Mitrović, D. Andreoletti, and O. Ayoub, “Chatgpt or human? detect
and explain. explaining decisions of machine learning model for de-
tecting short chatgpt-generated text,” arXiv preprint arXiv:2301.13852,
2023.
[17] T. Fagni, F. Falchi, M. Gambini, A. Martella, and M. Tesconi, “Tweep-
fake: About detecting deepfake tweets,” Plos one, vol. 16, no. 5, p.
e0251415, 2021.
[18] V. S. Sadasivan, A. Kumar, S. Balasubramanian, W. Wang, and
S. Feizi, “Can ai-generated text be reliably detected?” arXiv preprint
arXiv:2303.11156, 2023.
[19] J. Kirchenbauer, J. Geiping, Y. Wen, J. Katz, I. Miers, and T. Gold-
stein, “A watermark for large language models,” arXiv preprint
arXiv:2301.10226, 2023.
[20] K. Krishna, Y. Song, M. Karpinska, J. Wieting, and M. Iyyer, “Para-
phrasing evades detectors of ai-generated text, but retrieval is an effective
defense,” arXiv preprint arXiv:2303.13408, 2023.