Adaptive REST API Testing with
Reinforcement Learning
Myeongsoo Kim Saurabh Sinha Alessandro Orso
Georgia Institute of Technology IBM Research Georgia Institute of Technology
Atlanta, Georgia, USA Yorktown Heights, New York, USA Atlanta, Georgia, USA
mkim754@[Link] sinhas@[Link] orso@[Link]
Abstract—Modern web services increasingly rely on REST for the development of automated testing techniques for such
APIs. Effectively testing these APIs is challenging due to the APIs, and numerous such techniques and tools have emerged
arXiv:2309.04583v1 [[Link]] 8 Sep 2023
vast search space to be explored, which involves selecting in recent years (e.g., [10]–[17]). In spite of this, testing
API operations for sequence creation, choosing parameters for
each operation from a potentially large set of parameters, and REST APIs continues to be a challenging task, with high code
sampling values from the virtually infinite parameter input coverage remaining an elusive goal for automated tools [18].
space. Current testing tools lack efficient exploration mecha- Testing REST APIs can be challenging because of the large
nisms, treating all operations and parameters equally (i.e., not search space to be explored, due to the large number of
considering their importance or complexity) and lacking prioriti- operations, potential execution orders, inter-parameter depen-
zation strategies. Furthermore, these tools struggle when response
schemas are absent in the specification or exhibit variants. To dencies, and associated input parameter value constraints [18],
address these limitations, we present an adaptive REST API [19]. Current techniques often struggle when exploring this
testing technique that incorporates reinforcement learning to space due to lack of effective exploration strategies for oper-
prioritize operations and parameters during exploration. Our ations and their parameters. In fact, existing testing tools tend
approach dynamically analyzes request and response data to to treat all operations and parameters equally, disregarding
inform dependent parameters and adopts a sampling-based
strategy for efficient processing of dynamic API feedback. We their relative importance or complexity, which leads to sub-
evaluated our technique on ten RESTful services, comparing it optimal testing strategies and insufficient coverage of crucial
against state-of-the-art REST testing tools with respect to code operation and parameter combinations. Moreover, these tools
coverage achieved, requests generated, operations covered, and rely on discovering producer-consumer relationships between
service failures triggered. Additionally, we performed an ablation response schemas and request parameters; this approach works
study on prioritization, dynamic feedback analysis, and sampling
to assess their individual effects. Our findings demonstrate that well when the parameter and response schemas are described
our approach outperforms existing REST API testing tools in in detail in the specification, but falls short in the common
terms of effectiveness, efficiency, and fault-finding ability. case in which the schemas are incomplete or imprecise.
Index Terms—Reinforcement Learning for Testing, Automated To address these limitations of the state of the art, we
REST API Testing present adaptive REST API testing with reinforcement learn-
ing (ARAT- RL), an advanced black-box testing approach. Our
I. I NTRODUCTION technique incorporates several innovative features, such as
The increasing adoption of modern web services has led leveraging reinforcement learning to prioritize operations and
to a growing reliance on REpresentational State Transfer parameters for exploration, dynamically constructing key-
(REST) APIs for communication and data exchange [1], [2]. value pairs from both response and request data, analyzing
REST APIs adhere to a set of architectural principles that these pairs to inform dependent operations and parameters,
enable scalable, flexible, and efficient interactions between and utilizing a sampling-based strategy for efficient processing
various software components through the use of standard of dynamic API feedback. The primary objectives of our
HTTP methods and a stateless client-server model [3]. To approach are to increase code coverage and improve fault-
facilitate their discovery and use by clients, REST APIs are detection capability.
often documented using various specification languages [4]– The core novelty in ARAT- RL is an adaptive testing strategy
[7] that let developers describe the APIs in a structured format driven by a reinforcement-learning-based prioritization algo-
and provide essential information, such as the available end- rithm for exploring the space of operations and parameters.
points, input parameters and their schemas, response schemas, The algorithm initially determines the importance of an op-
and so on. Platforms such as APIs Guru [8] and Rapid [9] eration based on the parameters it uses and the frequency
host thousands of RESTful API documents, emphasizing the of their use across operations. This targeted exploration en-
significance of formal API specifications in industry. ables efficient coverage of critical operations and parameters,
Standardized documentation formats, such as the OpenAPI thereby improving code coverage. The technique employs
specification [4], not only facilitate the development of REST reinforcement learning to adjust the priority weights associ-
APIs and their use by clients, but also provide a foundation ated with operations and parameters based on feedback, by
decreasing importance for successful responses and increasing / p r o d u c t s /{ productName }/ c o n f i g u r a t i o n s /{ c o n f i g u r a t i o n N a m e }/ f e a t u r e s /{ f e a t u r e N a m e}:
post :
it for failed responses. The technique also assigns weights to operationId : a d d F e a t u r e T o C o n f i g u r a t i o n
produces :
parameter-value mappings based on various sources of input - application / json
values (e.g., random, specified values, response values, request parameters :
- name : productName
values, and default values). The adaptive nature of ARAT- RL in : p a t h
required : t r u e
gives precedence to yet-to-be-explored or previously error- type : s t r i n g
- name : c o n f i g u r a t i o n N a m e
prone operations, paired with suitable value mappings, which in : p a t h
required : t r u e
type : s t r i n g
improves the overall efficiency of API exploration. - name : f e a t u r e N a m e
in : p a t h
Another innovative feature of ARAT- RL is dynamic construc- required : t r u e
tion of key-value pairs. In contrast to existing approaches that type : s t r i n g
responses :
rely heavily on resource schemas provided in the specification, default :
description : s u c c e s s f u l o p e r a t i o n
/ p r o d u c t s /{ productName }/ c o n f i g u r a t i o n s /{ c o n f i g u r a t i o n N a m e }/ f e a t u r e s :
our technique dynamically constructs key-value pairs by an- get :
alyzing POST operations (i.e., the HTTP methods that create operationId : g e t C o n f i g u r a t i o n A c t i v e d F e a t u r e s
produces :
- application / json
resources) and examining both response and request data. For parameters :
- name : productName
instance, suppose that an operation takes book title and price in : p a t h
as request parameters and, as response, produces a success required : t r u e
type : s t r i n g
status code along with a string message (e.g., “Successfully - name : c o n f i g u r a t i o n N a m e
in : p a t h
created”). Our technique leverages this information to create required : t r u e
type : s t r i n g
key-value pairs for book title and price, upon receiving a responses :
’200’ :
description : s u c c e s s f u l o p e r a t i o n
successful response, even if such data is not present in the schema :
type : a r r a y
response. In other words, the technique takes into account the items :
input parameters used in the request, as they correspond to the type : s t r i n g
created resource. Moreover, if the service returns incomplete
Fig. 1. A Part of Features Service’s OpenAPI Specification.
resources (i.e., only a subset of the data items for a given type
of resource), our technique still creates key-value pairs for the individual effects of prioritization, dynamic feedback analysis,
information available. This dynamic approach enables ARAT- and sampling on the overall effectiveness of ARAT- RL. Our re-
RL to identify resources from the API responses and requests, sults indicate that reinforcement-learning-based prioritization
as well as discover hidden dependencies that are not evident contributes the most to ARAT- RL’s effectiveness, followed by
from the specification alone. dynamic feedback analysis and sampling.
Finally, ARAT- RL employs a simple yet effective sampling- The main contributions of this work are:
based approach that allows it to process dynamic API feedback • A novel approach for adaptive REST API testing that
efficiently and adapt its exploration based on the gathered incorporates (1) reinforcement learning to prioritize ex-
information. By randomly sampling key-value pairs from ploration of operations and parameters, (2) dynamic anal-
responses, our technique reduces the overhead of processing ysis of request and response data to identify dependent
every response for each pair, resulting in more efficient testing. parameters, and (3) a sampling strategy for efficient
To evaluate ARAT- RL, we conducted a set of empirical processing of dynamic API feedback.
studies using 10 RESTful services and compared its perfor- • Empirical results demonstrating that ARAT- RL outper-
mance against three state-of-the-art REST API testing tools: forms state-of-the-art REST API testing tools by gen-
RESTler [12], EvoMaster [10], and Morest [20]. We assessed erating more valid and fault-inducing requests, covering
the effectiveness of ARAT- RL in terms of coverage achieved more operations, achieving higher code coverage, and
and service failures triggered, and its efficiency in terms of triggering more service failures.
valid and fault-inducing requests generated and operations • An artifact [21] containing the tool, the benchmark ser-
covered within a given time budget. Our results show that vices, and the empirical results.
ARAT- RL outperforms the competing tools for all the metrics
The rest of this paper is organized as follows. Section II
considered—it achieved the highest method, branch, and line
presents background information and a motivating example to
coverage rates, along with better fault-detection ability. Specif-
illustrate challenges in REST API testing. Section III describes
ically, ARAT- RL covered 119%, 60%, and 52% more branches,
our approach. Section IV presents our empirical evaluation
lines, and methods than RESTler; 37%, 21%, and 14% more
and results. Section V discusses related work and, finally,
branches, lines, and methods than EvoMaster; and 24%, 12%,
Section VI presents our conclusions and potential directions
and 10% more branches, lines, and methods than Morest.
for future work.
ARAT- RL also uncovered 9.3x, 2.5x, and 2.4x more bugs than
RESTler, EvoMaster, and Morest, respectively. In terms of
II. BACKGROUND AND M OTIVATING E XAMPLE
efficiency, ARAT- RL generated 52%, 41%, and 1,222% more
valid and fault-inducing requests and covered 15%, 24%, and We provide a brief introduction to REST APIs, OpenAPI
283% more operations than Morest, EvoMaster, and RESTler, specifications, and reinforcement learning; then, we illustrate
respectively. We also conducted an ablation study to assess the the novel features of our approach using a running example.
A. REST APIs exploring new actions to gather more knowledge or exploiting
REST APIs are web APIs that adhere to the RESTful known actions that offer the best reward based on its current
architectural style [3]. REST APIs facilitate communication understanding. The balance between exploration and exploita-
between clients and servers by exchanging data through stan- tion is often governed by parameters, such as ϵ in the ϵ-greedy
dardized protocols, such as HTTP [22]. Key principles of strategy [24].
REST include statelessness, cacheability, and a uniform in- Q-learning is a widely used model-free reinforcement learn-
terface, which simplify client-server interactions and promote ing algorithm that estimates the optimal action-value func-
loose coupling [23]. Clients communicate with web services tion, Q(s, a) [25]. The Q-function represents the expected
by sending HTTP requests. These requests access and/or ma- cumulative reward the agent can obtain by taking action a
nipulate resources managed by the service, where a resource in state s and then following the optimal policy. Q-learning
represents data that a client may want to create, delete, update, uses a table to store Q-values and updates them iteratively
or access. Requests are sent to an API endpoint, identified by based on the agent’s experiences. In the learning process, the
a resource path and an HTTP method specifying the action agent takes actions, receives rewards, and updates the Q-values
to be performed on the resource. The most commonly used using the Q-learning update rule, derived from the Bellman
methods are POST, GET, PUT, and DELETE, for creating, equation [24]:
reading, updating, and deleting a resource, respectively. The Q(s, a) ← Q(s, a) + α[r + γ max
′
Q(s′ , a′ ) − Q(s, a)] (1)
a
combination of an endpoint and an HTTP method is called
where α is the learning rate, γ is the discount factor, s′ is
an operation. Besides specifying an operation, a request can
the new state after taking action a, and r is the immediate
also optionally include HTTP headers containing metadata
reward received. The agent updates the Q-values to converge
and a body with the request’s payload. Upon receiving and
to their optimal values, which represent the expected long-term
processing a request, the web service returns a response
reward of taking each action in each state.
containing headers, possibly a body, and an HTTP status
code—a three-digit number indicating the request’s outcome. D. Motivating Example
Specifically, 2xx status codes denote successful responses, Next, we illustrate the salient features of ARAT- RL using the
4xx codes indicate client errors, and status code 500 suggests Feature-Service specification (Figure 1) as an example.
server error.
RL-based adaptive exploration: For the example in Figure 1,
B. OpenAPI Specification to perform the operation addFeatureToConfiguration, we
The OpenAPI Specification (OAS) [4] is a widely adopted must first create a product using a separate operation and
API description format for RESTful APIs, providing a stan- establish a configuration for it using another operation. The
dardized and human-readable way to describe the structure, sequence of operations should, therefore, be: create product,
functionality, and expected behavior of an API. Figure 1 illus- create configuration, and create feature name for the product
trates an example OAS file describing a part of the Features with the specified configuration name. This example empha-
Service API. This example shows two API operations. The sizes the importance of determining the operation sequence.
first operation, a POST request, is designed to add a feature Our technique initially assigns weights to operations and pa-
name to a product’s configuration. It requires three parameters: rameters based on their usage frequency in the specification. In
product name, configuration name, and feature name, all of this case, productName is the most frequently used parameter
which are specified in the path. Upon successful execution, the across all operations; therefore, our technique assigns higher
API responds with a JSON object, signaling that the feature weights to operations involving productName. Specifically, the
has been added to the configuration. The second operation, operation for creating a product gets the highest priority.
a GET request, retrieves the active features of a product’s Moreover, once an operation is executed, its priority must
configuration. Similar to the first operation, it requires the be adjusted so that it is not explored repeatedly, creating new
product name and configuration name as path parameters. The product instances unnecessarily. After processing a prioritized
API responds with a 200 status code and an array of strings operation, our technique employs RL to adjust the weights in
representing the active features in the specified configuration. response to the API response received. If a successful response
is obtained, negative rewards are assigned to the processed
C. Reinforcement Learning and Q-Learning parameters, as our objective is to explore other uncovered
Reinforcement learning (RL) is a type of machine learning operations. This method naturally leads to the selection of
where an agent learns to make decisions by interacting with the next priority operation and parameter, facilitating efficient
an environment [24]. The agent selects actions in various adjustments to the call sequence.
situations (states), observes the consequences (rewards), and Inter-parameter dependencies [19] can increase the com-
learns to choose the best actions to maximize the cumulative plexity of the testing process, as some parameters might have
reward over time. The learning process in RL is trial-and- mutual exclusivity or other constraints associated with them
error based, meaning the agent discovers the best actions by (e.g., only one of the parameters can be specified). RL-based
trying out different options and refining its strategy based on exploration guided by feedback received can also help with
the observed rewards. The agent must also decide between dealing with this complexity.
Algorithm 1 Q-Learning Table Initialization
1: procedure I NITIALIZE QL EARNING(operations)
2: Set learning rate (α) to 0.1
3: Set discount factor (γ) to 0.99
4: Set exploration rate (ϵ) to 0.1
5: Initialize empty dictionary q table
6: Initialize empty dictionary q value
7: for operation in operations do
8: operation id ← operation[’operationId’]
9: q value[operation id] ← new dictionary
10: for source in [S1, S2, S3, S4, S5] do
11: q value[operation id][source] ← 0
12: end for
13: for parameter in operation[’parameters’] do
Fig. 2. Overview of our approach. 14: param name ← parameter[’name’]
15: if key in q table then
Dynamic construction of key-value pairs: Existing REST 16: q table[param name] = q table[param name] + 1
17: else
API testing strategies (e.g., [11], [12], [20]) empha- 18: q table[param name] = 1
size the importance of identifying producer-consumer re- 19: end if
20: end for
lationships between response schemas and request pa- 21: for response data in operation [Link](’responses’) do
rameters. However, current tools face limitations when 22: for key in response [Link]() do
23: if key in q table then
operations produce unstructured output (e.g., plain text) 24: q table[key] = q table[key] + 1
or have incomplete response schemas in their specifica- 25: end if
26: end for
tions. For instance, the addFeatureToConfiguration oper- 27: end for
ation lacks structured response data (e.g., JSON format). 28: end for
29: return α, γ, ϵ, q table, q value
Despite this, our approach processes and generates key- 30: end procedure
value data {productName: <value>, configurationName:
<value>, featureName: <value>} from the request data, as and the Q-Learning parameters (α, γ, and ϵ) to Q-Learning
the POST HTTP method indicates that a resource is created Updater. The feedback is analyzed with the request and
using the provided inputs. response, storing key-value pairs extracted from them for
By analyzing and storing key-value pairs identified from re- future use. The Updater component then adjusts the Q-values
quest and response data, our dynamic key-value pair construc- based on the outcomes, enabling the approach to adapt its
tion method proves especially beneficial in cases of responses decision-making process over time. ARAT- RL iterates through
with plain-text descriptions or incomplete response schemas. this procedure until the specified time limit is reached. In the
The technique can effectively uncover hidden dependencies rest of this section, we present the details of the algorithm.
not evident from the specification.
A. Q-Learning Table Initialization
Sampling for efficient dynamic key-value pair construc-
The Q-Learning Table Initialization component, shown in
tion: API response data can sometimes be quite large and
Algorithm 1, is responsible for setting up the initial Q-table
processing every response for each key-value pair can be
and Q-value data structures that guide the decision-making
computationally expensive. To address this issue, we have
process throughout a testing session. Crucially, this process
incorporated a samplingd strategy into our dynamic key-value
happens without making any API calls.
pair construction method. This strategy efficiently processes
The algorithm begins by setting the learning rate α to 0.1,
the dynamic API feedback and adapts its exploration based
the discount factor γ to 0.99, and the exploration rate ϵ to 0.1
on the gathered information while minimizing the overhead
(lines 2–4). These parameters control the learning and ex-
of processing every response.
ploration process of the Q-Learning algorithm; the chosen
values are the ones that are commonly recommended and used
III. O UR A PPROACH
(e.g., [26]–[28]). The algorithm then initializes the Q-table and
In this section, we introduce our Q-Learning-based REST the Q-value with empty dictionaries (lines 5–6).
API testing approach, which intelligently prioritizes and se- The algorithm iterates through each operation in the API
lects operations, parameters, and value-mapping sources while (lines 7–24). For each operation, it extracts the operation’s
dynamically constructing key-value pairs. Figure 2 provides a unique identifier (operation id) and creates a new entry in
high-level overview of our approach. Initially, the Q-Learning the Q-value dictionary for the operation (lines 8–9). Next, it
Initialization module sets up the necessary variables and tables initializes the Q-value for each value-mapping source (S1–S5)
for the Q-learning process. Q-Learning Updater subsequently to zero (lines 10–12).
receives these variables and tables and passes them to the The algorithm proceeds to iterate through each parameter in
Prioritization module, which is responsible for selecting oper- the operation (lines 13–20). It extracts the parameter’s name
ations, parameters, and value-mapping sources. (param name) and, if param name already exists in the Q-
ARAT- RL then sends a request to the System Under Test table, increments the corresponding entry by one; otherwise,
(SUT) and receives a response. It also supplies the request, it initializes the entry to one. This step builds the Q-table with
response, selected operation, parameters, mapped value source, counts of occurrences of each parameter.
Algorithm 2 Q-Learning-based Prioritization Algorithm 3 Q-Learning-based API Testing
1: procedure S ELECTO PERATION(operations, q table) 1: procedure QL EARNING U PDATER(response, q table, q value, selected op, se-
2: Initialize max avg q value ← −∞ lected params, α, γ)
3: Initialize best operation ← None 2: operation id ← selected op[’operation id’]
4: for operation in operations do 3: if [Link] code is 2xx (successful) then
5: operation id ← operation[’operationId’] 4: Extract key-value pairs from request and response
6: Initialize sum q value ← 0 5: reward ← −1
7: Initialize num params ← len(operation[’parameters’]) 6: Update q value negatively
8: for parameter in operation[’parameters’] do 7: else if [Link] code is 4xx or 500 (unsuccessful) then
9: param name ← parameter[’name’] 8: reward ← 1
10: sum q value ← sum q value + q table[param name] 9: Update q value positively
11: end for 10: end if
12: avg q value ← sum q value / num params 11: for each param in selected params do
13: if avg q value > max avg q value then 12: for each param name, param value in [Link]() do
14: max avg q value ← avg q value 13: old q value ← q table[operation id][param name]
15: best operation ← operation 14: max q value next state ← max(q table[operation id].values())
16: end if 15: q table[operation id][param name] ← old q value + α * (reward + γ
17: end for * (max q value next state - old q value))
18: return best operation 16: end for
19: end procedure 17: end for
1: procedure S ELECT PARAMETERS(operation, ϵ) 18: return q table, q value
2: Set n randomly (0 ≤ n ≤ length of operation[’parameters’]) 19: end procedure
3: Initialize empty list selected parameters 1: procedure M AIN(API specification)
4: if [Link]() > ϵ then 2: Initialize ϵmax ← 1
5: Sort operation[’parameters’] by Q-values in descending order 3: Initialize ϵadapt ← 1.1
6: for i ← 0 to n − 1 do 4: Initialize time limit ← desired time limit in seconds
7: Append operation[’parameters’][i] to selected parameters 5: operations ← Load API specification
8: end for 6: α, γ, ϵ, q table, q value ← I NITIALIZE QL EARNING(operations)
9: else 7: while time limit not reached do
10: for param in [Link](operation[’parameters’], n) do 8: operation ← S ELECT O PERATION(operations, q table)
11: Append param to selected parameters 9: parameters ← S ELECT PARAMETERS(operation, ϵ)
12: end for 10: source ← S ELECT VALUE M APPING S OURCE(operation, ϵ)
13: end if 11: response ← Execute API request with operation, parameters, and source
14: return selected parameters 12: q table, q value ← QL EARNING U PDATER(response, q table, q value, op-
15: end procedure eration, parameters, α, γ)
1: procedure S ELECT VALUE M APPING S OURCE(operation, ϵ) 13: ϵ ← min(ϵmax , ϵadapt ∗ ϵ)
Source1: Example values in specification 14: end while
Source2: Random value generated based on parameter’s type and format 15: end procedure
Source3: Dynamically constructed key-value pairs from request
Source4: Dynamically constructed key-value pairs from response their Q-values. We present Algorithm 2 (SelectOperation,
Source5: Default values (string: string, number: 1.1, integer: 1, array: [], object: {})
2: operation id ← operation[’operationId’]
SelectParameters, and SelectValueMappingSource) to describe
3: sources ← [S1, S2, S3, S4, S5] the prioritization process.
4: if [Link]() > ϵ then
5: max q value ← −∞
The SelectOperation procedure is responsible for selecting
6: max q index ← −1 the best API operation to exercise next. The algorithm ini-
7: for s in sources do
8: if q value[operation id][s] > max q value then
tializes variables to store the maximum average Q-value and
9: max q value ← q value[operation id][s] the best operation (lines 2–3). It then iterates through each
10: max q index ← s
11: end if
operation, calculating the average Q-value for the operation’s
12: end for parameters (lines 4–17). The operation with the highest aver-
13: return max q index
14: else
age Q-value is selected as the best operation (line 15).
15: return [Link](1, 5) The SelectParameters procedure selects a subset of param-
16: end if
17: end procedure eters for the chosen API operation. This selection is guided
by the exploration rate ϵ. If a random value is greater than ϵ,
Next, the algorithm iterates through the response data of the algorithm selects the top n parameters sorted by their Q-
each operation (lines 21–27). It extracts keys from the response values in descending order (lines 4–8); otherwise, it randomly
and checks, for each key, whether it is present in the Q-table selects n parameters from the operation’s parameters (lines
for that operation (line 18). If a key is present, the algorithm 9–12). Finally, the selected parameters are returned (line 14).
increments the corresponding entry in the Q-table by one (lines The SelectValueMappingSource procedure is responsible
23–25). This step populates the Q-table with the frequency of for selecting the value-mapping source for the chosen API
occurrence of each response key. operation. The technique leverages five sources of values.
Finally, the algorithm returns the learning rate α, the dis- • Source 1 (example values in specification): These values
count factor γ, the exploration rate ϵ, the Q-table, and the are provided in the API documentation as examples for a
Q-value (line 29). This initial setup provides the Q-Learning parameter. We consider three types of OpenAPI keywords
algorithm with basic information about the API operations and that can specify example values: enum, example, and
their relationships, which is further refined during testing. description [4]. The OpenAPI documentation [4] states
that users can specify example values in the description
B. Q-Learning-based Prioritization field, and a recent study also shows the importance
In this step, ARAT- RL prioritizes API operations and selects of leveraging example values from descriptions [29].
the best parameters and value-mapping sources based on However, such examples are not provided in a structured
format but as natural-language text. To extract example updates the Q-values based on the response status codes.
values from the description field, we create a list contain- Algorithm 3 (QLearningUpdater and Main) describes the API
ing each word in the text, as well as each quoted phrase. testing process and the update of Q-values using the learning
• Source 2 (random value generated based on parameter’s rate (α) and discount factor (γ).
type, format, and constraints): This source generates The QLearningUpdater procedure updates the Q-values
random values for each parameter based on its type, based on the response status codes. It first extracts the op-
format, and constraints. To generate random values, we eration ID from the selected operation (line 2). If the response
utilize Python’s built-in random library [30]. For date and status code indicates success (2xx), the algorithm extracts key-
date-time formats, we employ the datetime library [31] value pairs from the request and response, assigns a reward
to randomly select dates and times. If the parameter of −1, and updates the Q-values negatively (lines 3–6). If the
has a regular expression pattern specified in the API response status code indicates an unsuccessful request (4xx or
documentation, we generate the value randomly using 500), the algorithm assigns a reward of 1 and updates the Q-
the rstr library [32]. When a minimum or maximum values positively (lines 7–10). The Q-values are updated for
constraint is present, we pass it to the random library to each parameter in the selected parameters using the Q-learning
ensure that the generated values adhere to the specified update rule (Equation 1 in §II-C (lines 11–16), and the updated
constraints. This approach allows ARAT- RL to explore a Q-values are returned (line 18).
broader range of values compared to the example values The Main procedure orchestrates the Q-Learning-based API
provided in the API specification. testing process. It initializes the exploration rate (ϵ), its max-
• Source 3 (dynamically constructed key-value pairs from imum value (ϵmax ), its adaptation factor (ϵadapt ), and the
request): This source extracts key-value pairs from dy- time budget for testing (lines 2–4). The API specification
namically constructed request key-value pairs. We em- is loaded and the Q-Learning table is initialized (lines 5–
ploy Gestalt pattern matching [33] (also known as Rat- 6). The algorithm then enters a loop that continues until
cliff/Obershelp pattern recognition [34]) to identify the the time limit is reached (line 7). In each iteration, the best
key most similar to the parameter name. Gestalt matching operation, selected parameters, and selected mapping source
is a lightweight but effective technique that calculates the are determined (lines 8–10). The API operation is executed
similarity between two strings by identifying the longest with the selected parameters and mapping source, and the
common subsequences and recursively comparing the response is obtained (line 11). The Q-values are then updated
remaining unmatched substrings. This technique aids in based on the response (line 12), and the exploration rate (ϵ)
discovering producer-consumer relationships. is updated (line 13).
• Source 4 (dynamically constructed key-value pairs from By continuously updating the Q-values based on the re-
response): Similar to Source 3, this source obtains key- sponse status codes and adapting the exploration rate, Q-
value pairs from dynamically constructed response key- Learning-based API testing process aims to effectively explore
value pairs. We use the same Gestalt pattern matching the space of API operations and parameters.
approach [33] to identify the key, further assisting in the
identification of producer-consumer relationships. IV. E VALUATION
• Source 5 (default values): This source uses predefined In this section, we present the results of empirical studies
default values for each data type: “string” for strings, conducted to assess ARAT- RL. Our evaluation aims to address
1.1 for numbers, 1 for integers: 1, [] for arrays, and {} the following research questions:
for objects. These default values can be useful for testing 1) RQ1: How does ARAT- RL compare with state-of-the-art
how the API behaves when it receives the simplest or REST API testing tools for in terms of code coverage?
common forms of inputs; such default values are used by 2) RQ2: How does the efficiency of ARAT- RL, measured in
other tools as well (e.g., [12]). terms of valid and fault-inducing requests generated and
Similar to the SelectParameters procedure, the selection of operations covered within a given time budget, compare
the value-mapping source is guided by ϵ. If a random value to that of other REST API testing tools?
is greater than ϵ, the algorithm selects the mapping source 3) RQ3: In terms of error detection, how does ARAT- RL
with the highest Q-value for the chosen operation (lines 4–6). perform in identifying 500 responses in REST APIs
This helps the algorithm focus on the most-promising mapping compared to state-of-the-art REST API testing tools?
sources based on prior experience. Otherwise, the algorithm 4) RQ4: How do the main components of ARAT- RL—
randomly selects a mapping source from the available sources prioritization, dynamic key-value pair construction, and
(line 8). This randomness ensures that the algorithm occasion- sampling—contribute to its overall performance?
ally explores less-promising mapping sources to avoid getting
A. Experiment Setup
stuck in a suboptimal solution.
We performed our experiments using Google Cloud E2
C. Q-Learning-based API Testing machines, each equipped with 24-core CPU and 96 GB of
In this step, ARAT- RL executes the selected API operations RAM. We created a machine image containing all the services
with the selected parameters and value-mapping sources, and and tools in our benchmark. For each experiment, we deleted
Fig. 3. Branch, line, and method coverage achieved by the tools on the benchmark services.
and recreated the machines using this image to minimize tools only in our evaluation. We believe that adding white-box
potential dependency issues. Each machine hosted all the tools to the evaluation would result in an unfair comparison,
services and tools under test, but we ran one tool at a time as these tools leverage information about the code, rather than
during the experiments. We monitored CPU and memory just information in the specification, to generate test inputs.
usage throughout the testing process to ensure that the testing We identified an initial set of 10 tools based on a recent
tools were not affected by a lack of memory or CPU resources. study [18]. From this list, we chose (the black-box version
To evaluate the effectiveness and efficiency of our approach, of) EvoMaster [10] and RESTler [12]. EvoMaster employs
we compared its against three state-of-the-art tools: EvoMas- an evolutionary algorithm for automated test case generation
ter [10], RESTler [12], and Morest [20]. We selected 10 and was the best-performing tool in that study and in an-
RESTful services from a recent REST API testing study [18] other recent comparison [35]. Its strong performance makes
as our benchmark. We explain the selection process of these it an appropriate candidate for comparison. RESTler adopts
tools and services next. a grammar-based fuzzing approach to explore APIs. It is a
Testing Tools Selection: As a preliminary note, because well-established tool in the field and, in fact, the most popular
ARAT- RL is a black-box approach, we considered black-box REST API testing tool in terms of GitHub stars.
Recently, two new tools have been published. Morest [20] achieved by each tool for each RESTful service, whereas the
has been shown to have superior results compared to Evo- boxplot at the bottom summarizes of each tool’s performance
Master. We, therefore, included Morest in our set of tools on the three coverage metrics across all the services.
for comparison, as it could potentially outperform the other As the boxplot illustrates, ARAT- RL consistently outper-
tools. The other recent tool, RestCT, was also considered for forms the other tools in all three coverage metrics over the
inclusion in our evaluation. However, we encountered failures subject services. Looking at the performance breakdown by
while running it. We contacted the authors, who confirmed the services (shown in the bar charts), ARAT- RL performed the
issues and said they will work on resolving them. best (or was tied as the best-performing tool) for all services
RESTful Services Selection: As benchmarks for our eval- in terms of branch coverage (with one ties), eight of the 10
uation, we selected 10 out of 20 RESTful services from a services in terms of line coverage (with no tie), and nine of the
recent study [18]. We had to exclude 10 services for various 10 services in terms of method coverage (with four ties). In
reasons. Specifically, we omitted the News service, developed cases where ARAT- RL did not achieve the best results, it still
by the author of one of our baseline tools (EvoMaster), to achieved similar coverage rates as the best-performing tool.
avoid possible bias. Problem Controller and Spring Batch ARAT- RL is especially effective when there is operation
REST were excluded because they require specific domain dependency, parameter dependency, and value-mapping depen-
knowledge to generate meaningful tests, so using them pro- dency. For example, the highest coverage gains occurred for
vides limited information about the tools. We excluded Erc20 Language Tool, which has a complex set of inter-parameter
Rest Service and Spring Boot Actuator because some APIs in and value-mapping dependencies. Meanwhile, ARAT- RL strug-
these services did not provide valid responses due to external gles with semantic parameters. For instance, its effectiveness
dependencies being updated without corresponding updates in was the lowest on Market Service, although it still matched the
the service code. Proxyprint, OCVN, and Scout API could not best-performing tool on this service. The reason for this is that
be included due to authentication issues that prevented them the service requires input data, such as address, email, name,
from generating meaningful responses. Finally, we excluded password, and phone number in specific formats, but ARAT- RL
CatWatch and CWA Verification because of their restrictive failed to generate valid values for these. Consequently, it was
rate limits, which slowed down the testing process and made unable to create market users and then use that information
it impossible to collect results in a reasonable amount of time. for other operations in producer-consumer relationships.
Our final set of services consisted of Features Service, On average, ARAT- RL attained 36.25% branch coverage,
LanguageTool, NCS, REST Countries, SCS, Genome Nexus, 58.47% line coverage, and 59.42% method coverage. In com-
Person Controller, User Management Microservice, Market parison, Morest, which exhibited the second-best performance,
Service, and Project Tracking System. reached an average of 29.31% branch coverage, 52.27% line
Result Collection: We ran each testing tool with the time coverage, and 54.24% method coverage. Thus, the coverage
budget of one hour per execution, as a previous study [18] gain of ARAT- RL over Morest is 23.69% for branch coverage,
showed that code coverage achieved by these tools tends to 11.87% for line coverage, and 9.55% for method coverage.
plateau within this duration. To accommodate randomness, we EvoMaster and RESTler yield lower average coverage rates on
replicated the experiments ten times and calculated the average all three metrics, with respective results of 26.45%, 48.37%,
metrics across the runs. and 52.07% for EvoMaster and 16.54%, 36.58%, and 38.99%
Data collection for code coverage and status codes was for RESTler for branch, line, and method coverage. The
done using JaCoCo [36] and Mitmproxy [37], respectively. We coverage gains of ARAT- RL compared to EvoMaster is 37.03%
focused on identifying unique instances of 500 codes, which for branch coverage, 20.87% for line coverage, and 14.13%
indicate server-side faults. The methodology was as follows: for method coverage; compared to RESTler, the gains are
1) Stack Trace Collection: For services that provided stack 119.17% for branch coverage, 59.83% for line coverage, and
traces with 500 errors, we collected these traces, treating 52.42% for method coverage.
each unique trace as a separate fault. In the majority of These results provide evidence that our technique can more
cases, the errors we collected fall into this category. effectively explore REST services, achieving superior code
2) Response Text Analysis: In the absence of stack traces, coverage, compared to existing tools, and demonstrate its
we analyzed the response text. After removing unrelated potential in addressing the challenges in REST API testing.
components (e.g., timestamps), we classified unique in-
stances of response text linked to 500 status codes as ARAT- RL consistently outperforms RESTler, EvoMaster,
individual faults. and Morest in terms of branch, line, and method coverage
This systematic approach allowed us to compile a compre- across the subject services. However, ARAT- RL can strug-
hensive and unique tally of faults for our analysis. gle with parameters that require inputs in specific formats.
B. RQ1: Effectiveness
To answer RQ1, we compared the tools in terms of branch, C. RQ2: Efficiency
line, and method coverage achieved. Figure 3 presents the To address RQ2, we compared ARAT- RL to Morest, Evo-
results of the study. The bar charts represent the coverage Master, and RESTler in terms of the number of (1) valid
TABLE I
C OMPARISON OF OPERATIONS COVERED AND VALID AND FAILURE - INDUCING REQUESTS GENERATED (2 XX AND 500 STATUS CODES ) BY THE TOOLS .
ARAT-RL Morest EvoMaster RESTler
#operations #requests #operations #requests #operations #requests #operations #requests
Service covered 2xx+500 2xx 500 covered 2xx+500 2xx 500 covered 2xx+500 2xx 500 covered 2xx+500 2xx 500
Features Service 18 95,479 43,460 52,019 18 103,475 4,920 98,555 18 113,136 33,271 79,865 17 4,671 1,820 2,851
Language Tool 2 77,221 67,681 9,540 1 1,273 1,273 0 2 22,006 17,838 4,168 1 32,796 32,796 0
NCS 6 62,618 62,618 0 5 18,389 18,389 0 2 61,282 61,282 0 2 140 140 0
REST Countries 22 36,297 35,486 811 22 8,431 7,810 621 16 9,842 9,658 184 6 259 255 4
SCS 11 115,328 115,328 0 11 110,147 110,147 0 10 66,313 66,313 0 10 5,858 5,857 1
Genome Nexus 23 15,819 14,010 1,809 23 32,598 10,661 21,937 19 8,374 8,374 0 1 182 182 0
Person Controller 12 101,083 47,737 53,346 11 104,226 10,036 94,190 12 91,316 37,544 53,772 1 167 102 65
User Management Microservice 21 44,121 13,356 30,765 17 1,111 948 163 18 29,064 13,003 16,061 4 79 64 15
Market Service 12 29,393 6,295 23,098 6 1,399 394 1,005 5 10,697 4,302 6,395 2 1,278 0 1,278
Project Tracking System 53 23,958 21,858 2,100 42 14,906 12,904 2,002 43 15,073 13,470 1,603 3 72 65 7
Average 18 60,132 42,783 17,349 15.6 39,595 17,748 21,847 14.5 42,710 26,505 16,205 4.7 4,550 4,128 422
TABLE II Given a one-hour testing time budget, ARAT- RL gener-
T OTAL FAULTS DETECTED BY THE TOOLS OVER 10 RUNS .
ates 52.01%, 40.79%, and 1222% more valid and fault-
Service RESTler EvoMaster Morest ARAT-RL inducing requests than Morest, EvoMaster, and RESTler,
Features Service 10 10 10 10
Language Tool 0 48 0 122
respectively. Additionally, it covers 15.38%, 24.14%, and
NCS 0 0 0 0 282.98% more operations than these tools.
REST Countries 9 10 10 10
SCS 3 0 0 0
Genome Nexus 0 0 5 10
Person Controller 58 221 274 943 D. RQ3: Fault-Detection Capability
User Management Microservice 10 10 8 10
Market Service 10 10 10 10 Table II presents the total number of faults detected by each
Project Tracking System 10 10 10 10
Total 110 319 327 1125
tool across 10 runs for the RESTful services in our benchmark.
We note that this cumulative count might include multiple
and fault-inducing requests generated (as indicated by HTTP detection of the same fault in different runs. For clarity in
status codes 2xx and 500, respectively) and (2) operations discussion, we provide the average faults detected per run in
covered within a given time budget. Although efficiency is the text below. As indicated, ARAT- RL exhibits the best fault-
not only dependent on these metrics, due to factors such as detection capability, uncovering on average 113 faults over the
API response time, we feel that they still represent meaningful services. In comparison, Morest and EvoMaster detected 33
proxies because they indicate the extent to which the tools are and 32 faults on average, respectively, whereas RESTler found
exploring the API and identifying faults. the least—11 faults on average. ARAT- RL thus uncovered 9.3x,
Table I shows these metrics for the 10 subject services. For 2.5x, and 2.4x more faults than RESTler, EvoMaster, and
each service, the table lists the number of operations covered Morest, respectively.
and the number of requests made under the categories 2xx+500 Comparing this data against the data on 500 response codes
(sum of 2xx code and 500 status code), 2xx, and 500, for each from Table I, we can see that, although ARAT- RL generated
of the four tools. In the last row, the table presents the average 20.59% fewer 500 responses, it found 250% more faults
number of operations covered and requests made by each tool than Morest. Compared to EvoMaster, ARAT- RL generated
across all the services. 7.06% more 500 responses, but also detected 240% more
In the testing time budget of one hour, ARAT- RL generated faults. These results suggest that irrespective of whether ARAT-
more valid and failure-inducing requests, resulting in more RL generates more or fewer error responses, it is better at
exploration of the testing search space. Specifically, ARAT- RL uncovering more unique faults.
generated 60,132 valid and fault-inducing requests on average, ARAT- RL’s superior fault-detection capability is particularly
which is 52.01% more than Morest (39,595 requests), 40.79% evident in the Language Tool and Person Controller services,
more than EvoMaster (42,710 requests), and 1222% more than where it detected on average 12 and 94 faults, respectively.
RESTler (4,550 requests). What makes this significant is that these services have larger
This difference in the number of requests can be attributed sets of parameters. Our strategy performed well in efficiently
to ARAT- RL’s approach of processing only a sample of key- prioritizing and testing various combinations of these parame-
value pairs from the response, rather than the entire response. ters to reveal faults. For example, Language Tool’s main oper-
By focusing on sampling key-value pairs, ARAT- RL efficiently ation /check, which checks text grammar, has 11 parameters.
identifies potential areas of improvement, contributing to a Similarly, Person Controller’s main operation /api/person,
more effective REST API testing process. which creates/modifies person instances, has eight parameters.
Moreover, ARAT- RL covered more operations on average In contrast, the other services’ operations usually have three
(18 operations) compared to Morest (15.6 operations), Evo- or fewer parameters.
Master (14.5 operations), and RESTler (4.7 operations). This ARAT- RL intelligently tries various parameter combinations
indicates that ARAT- RL efficiently generates more requests in with a reward system in Q-learning because it gives negative
a given time budget, which leads to covering more API oper- rewards for the parameters in the successful requests. This
ations, contributing to more comprehensive testing process. ability to explore various parameter combinations is a signifi-
TABLE III F. Threats to Validity
R ESULTS OF THE ABLATION S TUDY.
Branch Line Method Faults Detected
In this section, we address potential threats to our study’s
ARAT-RL 36.25 58.47 59.42 112.10 validity and the steps taken to mitigate them. Internal validity
ARAT-RL (no prioritization) 28.70 (+26.3%) 53.27 (+9.8%) 55.51 (+7%) 100.10 (+12%)
ARAT-RL (no feedback) 32.69 (+10.9%) 54.80 (+6.9%) 56.09 (+5.9%) 110.80 (+1.2%)
is influenced by tool implementations and configurations. To
ARAT-RL (no sampling) 34.10 (+6.3%) 56.39 (+3.7%) 57.20 (+3.9%) 112.50 (-0.4%) minimize these threats, we used the latest tool versions and
used default options. Some testing tools might have random-
cant factor in revealing more bugs, especially in services with a
ness, but we addressed this issue by running each tool 10 times
larger number of parameters. These results indicate that ARAT-
and computing the average results.
RL’s reinforcement-learning-based approach can effectively
Our data-collection process relies on an automated script.
discover faults in REST services.
Despite thorough testing, there remains the possibility that the
script could contain errors, potentially affecting our results. To
ARAT- RL exhibits superior fault-detection ability, uncover-
mitigate this risk, we conducted several rounds of pilot testing
ing 9.3x, 2.5x, and 2.4x more bugs than RESTler, Evo-
on a representative subset of our services before applying the
Master, and Morest, respectively. This is mainly attributed
script to the full study.
to its intelligent RL-based exploration of various parameter
External validity is affected by the selection of RESTful
combinations in services with large number of parameters.
services, the limited number of RESTful services tested, im-
pacting the generalizability of our findings. We tried to ensure
a fair evaluation by selecting a diverse set of 10 services, but
E. RQ4: Ablation Study
we will try to have a larger and more diverse set in future
To address RQ4, we conducted an ablation study to assess work.
the impact of the main novel components of ARAT- RL on its Construct validity concerns the metrics and tool compar-
performance. We compared the performance of ARAT- RL to isons used. Metrics such as branch, line, and method coverage,
three variants: (1) ARAT- RL without prioritization, (2) ARAT- number of requests, and 500 status codes as faults, although
RL without dynamic key-value construction from feedback, commonly, may not capture the test tools’ full quality. Addi-
and (3) ARAT- RL without sampling. Table III presents the tional metrics, such as mutation scores, could provide a better
results of this study. understanding of tool effectiveness. We measured efficiency
As illustrated in the table, the removal of any component by considering the number of meaningful requests generated
results in reductions in branch, line, and method coverage, by the tools, including valid and fault-inducing ones, as well
as well as the number of faults detected. Eliminating the as the number of operations each tool covered within a given
prioritization component leads to the most substantial decline time budget. While these metrics give us some perspective on
in performance, with branch, line, and method coverage de- a tool’s performance, they do not consider all possible factors.
creasing to 28.70%, 53.27%, and 55.51%, respectively, and Other aspects, such as the response times from the services,
the number of detected faults dropping to 100. This highlights may also significantly impact overall efficiency.
the critical role of the prioritization mechanism in ARAT- RL’s Including more tools, services, and metrics in future studies
effectiveness. would allow a more comprehensive evaluation of our tech-
The absence of feedback and sampling components also nique’s performance.
negatively affects performance. Without feedback, ARAT- RL’s
branch, line, and method coverage decreases to 32.69%, V. R ELATED W ORK
54.80%, and 56.09%, respectively, and the number of found
In this section, we provide an overview of related work
faults is reduced to 110.80. Likewise, without sampling,
in automated REST API testing, requirements based test
branch, line, and method coverage drops to 34.10%, 56.39%,
case generation, and reinforcement learning based test case
and 57.20%, respectively, while the number of found faults ex-
generation.
periences a slight increase to 112.50, which may be attributed
Automated REST API testing: EvoMaster [10] is a
to random variations.
technique that offers both white-box and black-box testing
These findings emphasize that each component of ARAT-
modes. It uses evolutionary algorithms for test case generation,
RL is essential for its overall effectiveness. The prioritization
refining tests based on its fitness function and checking for
mechanism, feedback loop, and sampling strategy work to-
500 status code. Other black-box techniques include various
gether to optimize the tool’s performance in terms of code
tools with different strategies. RESTler [12] generates stateful
coverage and fault detection capabilities.
test cases by inferring producer-consumer dependencies and
targets internal server failures. RestTestGen [11] exploits data
Each component of ARAT- RL–prioritization, feedback, and dependencies and uses oracles for status codes and schema
sampling—contributes to ARAT- RL’s overall effectiveness. compliance. QuickREST [13] is a property-based technique
The prioritization mechanism, in particular, plays a signif- with stateful testing, checking non-500 status codes and
icant role in enhancing ARAT- RL’s performance in code schema compliance. Schemathesis [16] is a tool designed for
coverage achieved and faults detected. detecting faults using five types of oracles to check response
compliance in OpenAPI or GraphQL web APIs via property- VI. C ONCLUSION AND F UTURE W ORK
based testing. RESTest [14] accounts for inter-parameter de-
We introduced ARAT- RL, a reinforcement-learning-based
pendencies, producing nominal and faulty tests with five types
approach for the automated testing of REST APIs. To assess
of oracles. Morest [20] uses a dynamically updating RESTful-
the effectiveness, efficiency, and fault-detection capability of
service Property Graph for meaningful test case generation.
our approach, we compared its performance to that of three
RestCT [17] employs Combinatorial Testing for RESTful API
state-of-the-art testing tools. Our results show that ARAT-
testing, generating test cases based on Swagger specifications.
RLoutperformed all three tools considered in terms of branch,
line, and method coverage achieved, requests generated, and
Open-source black-box tools such as Dredd [38], fuzz-
faults detected. We also conducted an ablation study, which
lightyear [39], and Tcases [40] offer various testing capabil-
highlighted the important role played by the novel features of
ities. Dredd [38] tests REST APIs by comparing responses
our technique—prioritization, dynamic key-value construction
with expected results, validating status codes, headers, and
based on response data, and sampling from response data
body content. Using Swagger for stateful fuzzing, fuzz-
to speed up key-value construction—in enhancing ARAT- RL’s
lightyear [39] detects vulnerabilities in micro-service ecosys-
performance.
tems, offers customized test sequences, and alerts on unex-
In future work, we will further improve ARAT- RL’s perfor-
pected successful attacks. Tcases [40] is a model-based tool
mance by addressing the limitations we observed in recogniz-
that constructs an input space model from OpenAPI specifi-
ing semantic constraints and generating inputs with specific
cations, generating test cases covering valid input dimensions
formatting requirements (e.g., email addresses, physical ad-
and checking response status codes for validation.
dresses, or phone numbers). We plan to investigate ways to
combine the RESTful-service Property Graph (RPG) approach
Requirements based test case generation: In requirements
of Morest with our feedback model to better handle complex
based testing, notable works include ucsCNL [41], which uses
operation dependencies. We will also study the impact of
controlled natural language for use case specifications, and
RL prioritization over time. Additionally, we plan to develop
UML Collaboration Diagrams [42]. Requirements by Con-
an advanced technique based on multi-agent reinforcement
tracts [43] proposed a custom language for functional require-
learning. This approach will rely on multiple agents that
ments, while SCENT-Method [44] employed a scenario-based
apply natural language analysis on the text contained in
approach with statecharts. SDL-based Test Generation [45]
server responses, utilize a large language model to sequence
transformed SDL requirements into EFSMs, and RTCM [46]
operations more efficiently, and employ an enhanced value
introduced a natural language-based framework with tem-
generation technique using extensive parameter datasets from
plates, rules, and keywords. These approaches typically deal
Rapid API [9] and APIs-guru [8]. Lastly, we aim to detect
with the goal of generating test cases as accurately as possible
a wider variety of bugs, including bugs that result in status
according to the intended requirements. In contrast, ARAT- RL
codes other than 500.
focuses on more formally specified REST APIs, which provide
a structured basis for testing and effectively explore and adapt ACKNOWLEDGMENTS
during the testing process to find faults.
We thank our anonymous reviewers for their invaluable
Reinforcement learning based test case generation: Sev- feedback and the developers of the testing tools we em-
eral recent studies have investigated the use of reinforcement ployed in our empirical evaluation (EvoMasterBB, Morest,
learning in software testing, focusing primarily on web appli- and RESTler) for making their tools accessible. This research
cations and mobile apps. These challenges often arise from was partially supported by NSF, under grant CCF-0725202,
hidden states, whereas in our approach, we have access to all DARPA, under contract N66001-21-C-4024, DOE, under con-
states through the API specification but face more constraints tract DE-FOA-0002460, and gifts from Facebook, Google,
on operations, parameters, and mapping values. Zheng et IBM Research, and Microsoft Research. This publication
al. [47] proposed an automatic web client testing approach reflects the views only of the authors; the sponsoring agencies
utilizing curiosity-driven reinforcement learning. Pan et al. cannot be held responsible for such views and any use which
[48] introduced a similar curiosity-driven approach for testing may be made of the information contained in the paper.
Android applications. Koroglu et al. [49] presented QBE,
DATA AVAILABILITY
a Q-learning-based exploration method for Android apps.
Mariani et al. [50] proposed AutoBlackTest, an automatic The artifact associated with this submission, which includes
black-box testing approach for interactive applications. Adamo code, datasets, and other relevant materials, is available in
et al. [51] developed a reinforcement learning-based technique our GitHub repository [21]. The artifact has been designed to
specifically for Android GUI testing. Vuong and Takada [52] reproduce the experiments presented in the paper and support
also applied reinforcement learning to automated testing of further research in this domain. While we have archived
Android apps. Köroğlu and Sen [53] presented a method the submission version of our artifact on Zenodo [54], we
for generating functional tests from UI test scenarios using recommend visiting our GitHub repository to access the most
reinforcement learning for Android applications. recent version of our tool.
R EFERENCES Software Engineering, ser. ICSE ’22. New York, NY, USA: Association
for Computing Machinery, 2022, p. 1406–1417. [Online]. Available:
[1] L. Richardson, M. Amundsen, and S. Ruby, RESTful Web APIs: Services [Link]
for a Changing World. O’Reilly Media, Inc., 2013. [21] “Experiment infrastructure and data for arat-rl,” 2023. [Online].
[2] S. Patni, Pro RESTful APIs. Springer, 2017. [Online]. Available: Available: [Link]
[Link] [22] A. Rodriguez, “Restful web services: The basics,” IBM developerWorks,
[3] R. T. Fielding and R. N. Taylor, “Architectural styles and the design of vol. 33, no. 2008, p. 18, 2008.
network-based software architectures,” Ph.D. dissertation, University of [23] S. Tilkov, “A brief introduction to rest,” InfoQ, Dec, vol. 10, 2007.
California, Irvine, 2000. [24] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction.
[4] The Linux Foundation, “Openapi specification,” 2023. [Online]. Cambridge, MA, USA: A Bradford Book, 2018.
Available: [Link] [25] C. J. C. H. Watkins and P. Dayan, “Q-learning,” Machine Learning,
[5] SmartBear Software, “Swagger,” 2022. [Online]. Available: https: vol. 8, no. 3, pp. 279–292, 1992. [Online]. Available: https:
//[Link]/specification/v2/ //[Link]/10.1007/BF00992698
[6] MuleSoft, LLC, a Salesforce company, “Raml,” 2020. [Online]. [26] “Q learning practical guide,” 2023. [Online]. Available: https:
Available: [Link] //[Link]/fa-paio/agir-et-apprendre-a-agir/q-learning
[7] API Blueprint, “Api blueprint,” 2023. [Online]. Available: https: [27] N. Kumar, “Reinforcement learning guide with python example,” 2023.
//[Link]/ [Online]. Available: [Link]
[8] [Link], “Apis-guru,” 2023. [Online]. Available: [Link] reinforcement-learning-in-machine-learning-with-python-example/
[9] R Software Inc., “Rapidapi,” 2023. [Online]. Available: https: [28] A. Masadeh, Z. Wang, and A. E. Kamal, “Reinforcement learning
//[Link]/terms/ exploration algorithms for energy harvesting communications systems,”
[10] A. Arcuri, “Restful api automated test case generation with evomaster,” in 2018 IEEE International Conference on Communications (ICC),
ACM Transactions on Software Engineering and Methodology (TOSEM), 2018, pp. 1–6. [Online]. Available: [Link]
vol. 28, no. 1, jan 2019. [Online]. Available: [Link] 8422710
3293455 [29] M. Kim, D. Corradini, S. Sinha, A. Orso, M. Pasqua, R. Tzoref-Brill,
[11] D. Corradini, A. Zampieri, M. Pasqua, E. Viglianisi, M. Dallago, and M. Ceccato, “Enhancing REST API Testing with NLP Techniques,”
and M. Ceccato, “Automated black-box testing of nominal and error in Proceedings of the 32nd ACM SIGSOFT International Symposium on
scenarios in restful apis,” Software Testing, Verification and Reliability, Software Testing and Analysis, ser. ISSTA 2023, 2023, p. 1232–1243.
vol. 32, 01 2022. [Online]. Available: [Link] [30] “Python random library,” 2023. [Online]. Available: [Link]
[12] V. Atlidakis, P. Godefroid, and M. Polishchuk, “Restler: Stateful rest org/3/library/[Link]
api fuzzing,” in Proceedings of the 41st International Conference on [31] “Python datetime library,” 2023. [Online]. Available: [Link]
Software Engineering, ser. ICSE ’19. Piscataway, NJ, USA: IEEE [Link]/3/library/[Link]
Press, 2019, p. 748–758. [Online]. Available: [Link] [32] “Rstr library,” 2023. [Online]. Available: [Link]
ICSE.2019.00083 [33] “Difflib sequencematcher (gestalt pattern matching),” 2023. [Online].
[13] S. Karlsson, A. Čaušević, and D. Sundmark, “Quickrest: Property- Available: [Link]
based test generation of openapi-described restful apis,” in 2020 [34] P. E. Black, “Ratcliff/obershelp pattern recognition,” 2021. [Online].
IEEE 13th International Conference on Software Testing, Validation Available: [Link]
and Verification (ICST), 2020, pp. 131–141. [Online]. Available: [35] M. Zhang and A. Arcuri, “Open problems in fuzzing restful apis: A
[Link] comparison of tools,” ACM Trans. Softw. Eng. Methodol., may 2023,
[14] A. Martin-Lopez, S. Segura, and A. Ruiz-Cortés, “Restest: Automated just Accepted. [Online]. Available: [Link]
black-box testing of restful web apis,” in Proceedings of the 30th [36] “Jacoco,” 2023. [Online]. Available: [Link]
ACM SIGSOFT International Symposium on Software Testing and [37] “Mitmproxy,” 2023. [Online]. Available: [Link]
Analysis, ser. ISSTA 2021. New York, NY, USA: Association [38] “Dredd,” 2023. [Online]. Available: [Link]
for Computing Machinery, 2021, p. 682–685. [Online]. Available: [39] “Fuzz-lightyear,” 2023. [Online]. Available: [Link]
[Link] fuzz-lightyear
[15] S. Karlsson, A. Čaušević, and D. Sundmark, “Automatic property-based [40] “Tcases,” 2023. [Online]. Available: [Link]
testing of graphql apis,” in 2021 IEEE/ACM International Conference tcases/tree/master/tcases-openapi
on Automation of Software Test (AST), 2021, pp. 1–10. [Online]. [41] F. Barros, L. Neves, E. Hori, and D. Torres, “The ucscnl: A controlled
Available: [Link] natural language for use case specifications,” in SEKE 2011 - Proceed-
[16] Z. Hatfield-Dodds and D. Dygalo, “Deriving semantics-aware fuzzers ings of the 23rd International Conference on Software Engineering and
from web api schemas,” in Proceedings of the ACM/IEEE 44th Knowledge Engineering, 01 2011, pp. 250–253.
International Conference on Software Engineering: Companion [42] M. Badri, L. Badri, and M. Naha, “A use case driven testing process:
Proceedings, ser. ICSE ’22. New York, NY, USA: Association Towards a formal approach based on uml collaboration diagrams,” in
for Computing Machinery, 2022, p. 345–346. [Online]. Available: Formal Approaches to Software Testing, A. Petrenko and A. Ulrich, Eds.
[Link] Berlin, Heidelberg: Springer Berlin Heidelberg, 2004, pp. 223–235.
[17] H. Wu, L. Xu, X. Niu, and C. Nie, “Combinatorial testing of restful [Online]. Available: [Link] 16
apis,” in Proceedings of the 44th International Conference on Software [43] C. Nebut, F. Fleurey, Y. Le Traon, and J.-M. Jezequel, “Requirements
Engineering, ser. ICSE ’22. New York, NY, USA: Association by contracts allow automated system testing,” in 14th International
for Computing Machinery, 2022, p. 426–437. [Online]. Available: Symposium on Software Reliability Engineering, 2003. ISSRE 2003.,
[Link] 2003, pp. 85–96. [Online]. Available: [Link]
[18] M. Kim, Q. Xin, S. Sinha, and A. Orso, “Automated test generation 2003.1251033
for rest apis: No time to rest yet,” in Proceedings of the 31st [44] J. Ryser and M. Glinz, “A scenario-based approach to validating and
ACM SIGSOFT International Symposium on Software Testing and testing software systems using statecharts,” 1999. [Online]. Available:
Analysis, ser. ISSTA 2022. New York, NY, USA: Association [Link]
for Computing Machinery, 2022, p. 289–301. [Online]. Available: [45] L. Tahat, B. Vaysburg, B. Korel, and A. Bader, “Requirement-
[Link] based automated black-box test generation,” in 25th Annual
[19] A. Martin-Lopez, S. Segura, and A. Ruiz-Cortés, “A catalogue International Computer Software and Applications Conference.
of inter-parameter dependencies in restful web apis,” in Service- COMPSAC 2001, 2001, pp. 489–495. [Online]. Available:
Oriented Computing: 17th International Conference, ICSOC 2019, [Link]
Toulouse, France, October 28–31, 2019, Proceedings. Berlin, [46] T. Yue, S. Ali, and M. Zhang, “Rtcm: A natural language
Heidelberg: Springer-Verlag, 2019, p. 399–414. [Online]. Available: based, automated, and practical test case generation framework,” in
[Link] 31 Proceedings of the 2015 International Symposium on Software Testing
[20] Y. Liu, Y. Li, G. Deng, Y. Liu, R. Wan, R. Wu, D. Ji, S. Xu, and Analysis, ser. ISSTA 2015. New York, NY, USA: Association
and M. Bao, “Morest: Model-based restful api testing with execution for Computing Machinery, 2015, p. 397–408. [Online]. Available:
feedback,” in Proceedings of the 44th International Conference on [Link]
[47] Y. Zheng, Y. Liu, X. Xie, Y. Liu, L. Ma, J. Hao, and Y. Liu, 1109/ICST.2012.88
“Automatic web testing using curiosity-driven reinforcement learning,” [51] D. Adamo, M. K. Khan, S. Koppula, and R. Bryce, “Reinforcement
in Proceedings of the 43rd International Conference on Software learning for android gui testing,” in Proceedings of the 9th ACM
Engineering, ser. ICSE ’21. IEEE Press, 2021, p. 423–435. [Online]. SIGSOFT International Workshop on Automating TEST Case Design,
Available: [Link] Selection, and Evaluation, ser. A-TEST 2018. New York, NY,
[48] M. Pan, A. Huang, G. Wang, T. Zhang, and X. Li, “Reinforcement USA: Association for Computing Machinery, 2018, p. 2–8. [Online].
learning based curiosity-driven testing of android applications,” in Available: [Link]
Proceedings of the 29th ACM SIGSOFT International Symposium on [52] T. A. T. Vuong and S. Takada, “A reinforcement learning based
Software Testing and Analysis, ser. ISSTA 2020. New York, NY, USA: approach to automated testing of android applications,” in Proceedings
Association for Computing Machinery, 2020, p. 153–164. [Online]. of the 9th ACM SIGSOFT International Workshop on Automating
Available: [Link] TEST Case Design, Selection, and Evaluation, ser. A-TEST 2018.
[49] Y. Koroglu, A. Sen, O. Muslu, Y. Mete, C. Ulker, T. Tanriverdi, New York, NY, USA: Association for Computing Machinery, 2018, p.
and Y. Donmez, “QBE: QLearning-Based Exploration of Android 31–37. [Online]. Available: [Link]
Applications,” in 2018 IEEE 11th International Conference on Software [53] Y. Köroğlu and A. Sen, “Functional test generation from ui test
Testing, Verification and Validation, 2018, pp. 105–115. [Online]. scenarios using reinforcement learning for android applications,”
Available: [Link] Software Testing, Verification and Reliability, vol. 31, 10 2020.
[50] L. Mariani, M. Pezze, O. Riganelli, and M. Santoro, “Autoblacktest: [Online]. Available: [Link]
Automatic black-box testing of interactive applications,” in 2012 IEEE [54] M. Kim, “Replication package for the paper ’adaptive rest api
Fifth International Conference on Software Testing, Verification and testing with reinforcement learning’,” 2023. [Online]. Available:
Validation, 2012, pp. 81–90. [Online]. Available: [Link] [Link]