Web API Search

Web API Search: Discover Web API
and Its Endpoint with Natural

Language Queries
Lei Liu(B) , Mehdi Bahrami, Junhee Park, and Wei-Peng Chen
Fujitsu Laboratories of America, Inc.,

1240 E Arques Avenue, Sunnyvale, CA 94085, USA
{lliu,mbahrami,jpark,wchen}@fujitsu.com
Abstract. In recent years, Web Application Programming Interfaces

(APIs) are becoming more and more popular with the development of
the Internet industry and software engineering. Many companies provide
public Web APIs for their services, and developers can greatly accelerate
the development of new applications by relying on such APIs to execute
complex tasks without implementing the corresponding functionalities
themselves. The proliferation of web APIs, however, also introduces a
challenge for developers to search and discover the desired API and its
endpoint. This is a practical and crucial problem because according to
ProgrammableWeb, there are more than 22,000 public Web APIs each
of which may have tens or hundreds of endpoints. Therefore, it is diffi-
cult and time-consuming for developers to find the desired API and its
endpoint to satisfy their development needs. In this paper, we present
an intelligent system for Web API searches based on natural language
queries by using a two-step transfer learning. To train the model, we col-
lect a significant amount of sentences from crowdsourcing and utilize an
ensemble deep learning model to predict the correct description sentences
for an API and its endpoint. A training dataset is built by synthesizing
the correct description sentences and then is used to train the two-step
transfer learning model for Web API search. Extensive evaluation results
show that the proposed methods and system can achieve high accuracy
to search a Web API and its endpoint.
Keywords: Web APIs · Neural networks · Deep learning
1 Introduction
A Web API is an application programming interface exposed via the Web, com-
monly used as representational state transfer (RESTful) services through Hyper-
Text Transfer Protocol (HTTP). As the Internet industry progresses, Web APIs
become more concrete with emerging best practices and more popular for mod-
ern application development [1]. Web APIs provide an interface for easy software
development through abstracting a variety of complex data and web services,
c Springer Nature Switzerland AG 2020
W.-S. Ku et al. (Eds.): ICWS 2020, LNCS 12406, pp. 96–113, 2020.
https://doi.org/10.1007/978-3-030-59618-7_7
Web API Search: Discover Web API and Its Endpoint 97
Fig. 1. Examples of ProgrammableWeb page of the Dropbox API (left) and documen-
tations for Dropbox API (right).
which can greatly accelerate application development. Web APIs have also been
widely used by technical companies due to their inherent flexibility. For example,
Twitter offers public APIs to enable third parties to access and analyze historical
tweets. Amazon provides free advertising APIs to developers as a way to promote
their products. On the other hand, developers also benefit from the burgeoning
API economy [2]. Developers can access various datasets and services via Web
APIs and incorporate these resources into their development [3].
Due to these advantages, Web APIs have been widely developed in recent
years. According to ProgrammableWeb1 , there are more than 22,000 public Web
APIs available today, and this number is rapidly increasing. Moreover, an API
has a number of endpoints, which specify the location of resources that develop-
ers need to access to carry out their functions. As an example shown in Fig. 1,
the Dropbox API has 136 endpoints, and each endpoint has its own concrete
function. In order to achieve the given function, an HTTP request has to be sent
to the corresponding endpoint using a given HTTP method, as shown in Fig. 1.
The proliferation of Web APIs, however, makes it difficult for developers
to search and discover a desired API and its endpoint. As aforementioned, the
developer needs to know the endpoint in order to call an API. Therefore, the
API level search is insufficient. In light of this, in this paper, we focus on building
a Web API search system that can provide endpoint level search results based
on a natural language query describing developers’ needs. With the proposed
dataset collection and generation methods and the two-step transfer learning
model, the API search system can achieve high accuracy to search a Web API
and its endpoint to satisfy developers’ requirements.
2 Related Works and Our Contributions
API search or recommendation for developers has been extensively studied in

the past. However, there are several key differences from our work:
1
Available at: https://www.programmableweb.com.
98 L. Liu et al.
Keyword-Based Search vs. Deep Learning-Based Search: Online Web

API platforms such as ProgrammableWeb, Rapid API2 , and API Harmony3
provides API search functions. However, strict keyword matching is used in
these platforms to return a list of APIs given a user’s query. Strict keyword
search (syntactical representation) does not allow users to search semantically.
On the other hand, deep learning-based method can enable semantic search,
which means that users can search content for its meaning in addition to key-
words, and maximize the chances the users will find the information they are
looking for. For example, in the Fitbit API documentation, the word “glucose” is
used. If the users search “blood sugar”, the keyword-based search engine cannot
return the Fitbit API. Thanks to the word embedding, the deep learning-based
search engine, with semantic representation, is more advanced and intelligent to
handle the scenario where exact keyword matching does not succeed. Supporting
semantic search would help developers identify appropriate APIs in particular
during the stage of application development when developers would not have
clear ideas about which specific APIs to utilize or they would like to look for
other similar APIs.
Programming-Language API Search vs. RESTful Web API Search: In
the past decade, there are many works investigated search approaches for pro-
gramming language APIs, for example, Java APIs or C++ APIs. [4] proposed
RACK, a programming language API recommendation system that leverages
code search queries from Stack Overflow4 to recommend APIs based on a devel-
oper’s query. The API text description has also been used for searching APIs.
[5] proposed sourcerer API search to find Java API usage examples in large code
repositories. [6] conducted a study by using a question and answering system to
guide developers with unfamiliar APIs. [7] identified the problems of API search
by utilizing Web search. Based on the observations, the authors presented a pro-
totype search tool called Mica that augments standard Web search results to
help programmers find the right API classes and methods given a description of
the desired function as input, and help programmers find examples when they
already know which methods to use. Compared to programming language API
search, the Web API and its endpoint search is a new domain, and the major
challenge is the lack of training datasets.
Web API-Level Search vs. Endpoint Level Search: There are several stud-
ies related to Web API search in recent years. [8] developed a language model
based on the collected text such as the API description on the Web to support
API queries. [10] used the API’s multi-dimensional descriptions to enhance the
search and ranking of Web APIs. The recent works in [11,12] investigated API
level recommendation with a natural language query. As multiple APIs are fre-
quently being used together for software development, many researchers have
2
Available at: https://rapidapi.com.
3
Available at: https://apiharmony-open.mybluemix.net.
4
Available at: https://stackoverflow.com.
Fig. 2. Procedure for building the training dataset.
also focused on recommending API mashups, or recommending an API based

on another API as the input, such as [13–17].
However, although there are many existing works regarding Web API search
or recommendation, there is no study supporting API endpoint level discovery
with natural language input. In addition, majority of the related works have
been carried out using internal APIs or a limited number of APIs, not on a large
number of public Web APIs. In practice, the recommendation with only API
information is insufficient for developers because, in order to actually use the API
in the application design, developers need endpoint level information. A web API
may have tens or hundreds of endpoints. With only API level search results, it
is still burdensome for developers to discover which endpoint should be used. To
the best of our knowledge, this paper is the first work that proposes the endpoint
level search on a large number of Web APIs whereas all the previous works only
support API level search or recommendation. The novelty and contributions of
this work are threefold:
– Currently, the key bottleneck for building a machine learning-based Web API
search system is the lack of publicly available training datasets. Although the
services like API Harmony provide structured API specification, where we
can collect information such as API and its endpoint descriptions to build the
training dataset, the majority of existing web APIs lack such structured API
specifications. For example, API Harmony only supports 1,179 APIs, which
is just a small percentage of all the Web APIs. In this paper, we propose a
method to collect useful information directly from API documentation, and
then build a training dataset, which can support more than 9,000 Web APIs
for search purposes.
– As the information we collected, in particular, the endpoint descriptions from
the API documentation may contain a lot of noise. We propose deep learning
methods to predict correct API endpoint description sentences. The evalua-
tion results show that decent accuracy can be achieved.
– We propose a two-step transfer learning method to support endpoint level
Web API search, whereas all the previous works only support API level search.
The evaluation results show that our proposed model can achieve high search
accuracy.
100 L. Liu et al.
3 Training Dataset Generation

In this section, the method for collecting and generating the training dataset is
detailed. The entire procedure is shown in Fig. 2.
Fig. 3. Correct endpoint descriptions Fig. 4. Ensemble LSTM+ANN model to

sentence span from endpoint name. support endpoint description prediction.
3.1 Crowdsourcing from ProgrammableWeb

The information collection starts from ProgrammableWeb, which provides a
directory service for more than 22,000 Web APIs. We can use web scraping
to extract API data for each Web API. As an example, Fig. 1 shows the Pro-
grammableWeb page regarding the Dropbox API. From this page, we can collect
the API title, API description and API keywords, which are essential for API
level search. However, as we target on endpoint level search, we need to collect
descriptions for each endpoint of the API, which are not available at Program-
mmableWeb. However, in programmableWeb page, we can find an URL for the
documentation of the given API. In general, API documentation includes the
list of endpoints and the description of each endpoint, as the example of the
Dropbox API documentation shown in Fig. 1. Therefore, by further scraping the
API documentation, we can collect data regarding endpoint descriptions.
However, by checking documentation of many Web APIs, we found it is a
challenge to identify the correct endpoint descriptions. This is because there is no
standard template for providers to write API documentations, and the quality of
API documentation varies significantly. In some good quality and well-structured
API documentation (such as the Dropbox API documentation in Fig. 1), the
endpoint descriptions may directly follow the endpoint name, and there is an
“description” tag for readers to easily to find the correct endpoint descriptions.
On the other hand, in bad quality or poorly structured API documentation, the
endpoint descriptions may have a certain distance from the endpoint name and
without any tag, which are relatively hard to find.
We take the API Harmony data as an example to evaluate if there is any
pattern regarding the distances between the endpoint name and its descriptions
in API documentation. API Harmony is a catalog service of 1,179 web APIs,

which provides structured data for these APIs according to the OpenAPI Spec-
ification (formerly Swagger Specification). From the structured data for a given
API, we can retrieve its endpoints and the descriptions for each endpoint. We
can assume these descriptions are ground-truth endpoint descriptions as they
have been evaluated by users. We also collect the corresponding API documen-
tation (i.e. a couple of HTML pages) for each given API from the websites of
API providers using Web crawlers. After that, we check where the endpoint
description is located in the corresponding documentation and how far it is from
the endpoint name. The result is depicted in Fig. 3. It can be seen that only
about 30% endpoint descriptions are the first sentence after the endpoint name.
From this result, we can see that the correct endpoint descriptions can be 1 to 6
sentences before endpoint name or after endpoint name, so there is no particular
pattern regarding the sentence distance between endpoint name and its correct
endpoint descriptions in API documentation.
Based on the observation of the results in Fig. 3. We defined two terms: raw
endpoint descriptions and correct endpoint descriptions. Raw endpoint descrip-
tions are the sentences surrounding endpoint names in Web API documenta-
tion. For example, according to Fig. 3, we may define raw endpoint descriptions
include 12 sentences (6 sentences before and 6 sentences after) for each endpoint
name in API documentation. Among these raw endpoint descriptions, 1 or more
sentences accurately describe the endpoint functions, and these sentences are
referred to as correct endpoint descriptions. Such correct endpoint descriptions
are essential in order to achieve accurate endpoint level search.
Due to the challenge of identifying the correct endpoint descriptions from raw
endpoint descriptions, as well as the rapid growth of Web APIs, we need to design
an automatic approach that can predict the correct endpoint descriptions from
API documentation. To address this issue, in Sects. 3.2 and 3.3, we present how
to collect raw endpoint descriptions from API documentation, and how to predict
correct endpoint descriptions from raw endpoint descriptions, respectively.
3.2 Collection of Raw API Endpoint Descriptions from API

Documentations
As aforementioned, raw endpoint descriptions are the surrounding sentences of

each endpoint name in API documentation. An API may have a list of end-
points (E) and each endpoint is defined as Ei . For each API, we extract each of
its endpoints (Ei ) and raw description for this endpoint (Ei,D ) from API doc-
umentation for a large number of APIs. We use a regular expression method,
similar to [8], to extract the list of API endpoints (E). Regarding Ei,D , different
API providers use different approaches to list endpoints and explain the descrip-
tion of endpoints. For example, some use a semi-structured information to list
endpoints, some explain an endpoint and its description through a paragraph,
and some use endpoint as header and explain the description. Our objective
of information extraction is collecting all possible Ei,D for each endpoint Ei in
102 L. Liu et al.
each HTML page, where i ∈ [1, ..l] and l represents the total number of end-
points in one HTML page. We extract information from semi-structured pages
by processing HTML table tags and the table headers. To this end, we define
a placeholder Pi that contains both Ei and Ei,D . Pi represents a section of the
HTML page which appears between two same HTML headers ([h1 , h2 , ..., h6 ]),
and Ei is located in the section. Therefore, a raw endpoint description is the text
around API endpoint as EiD = [SM , SN ] which denotes M sentences before and
N sentences after appearing Ei inside Pi . Algorithm 1 explains the detailed infor-
mation of extracting endpoint description for the given endpoint list. By using
the proposed method and setting M = 6 and N = 6, we collected 2, 822, 997
web-pages with the size of 208.6 GB for more than 20, 000 public APIs. Such
huge raw data contains a lot of noises. Therefore, in Sect. 3.3, we propose a deep
learning method to predict and filter the correct endpoint descriptions from the
raw descriptions.
Algorithm 1. Raw Endpoint Description Extraction

1: procedure get Raw Desc(html,E,M,N)
Extracts raw endpoint description Ei,D ∀ i ∈ E
2: root = html.root
3: for Ei in E do
4: T ag = F ind T ag of Ei in html
5: if Tag!=Null then
6: while Tag!=root do
7: for hi in [h1 , h2 , ..., h6 ] do
8: if hi == T ag.name then
9: Pi = T ag
10: Break else
11: T ag = T ag.parent
12: sents = sent token(Pi .content)
13: pos = f ind(Ei ) in sents
14: raw sents = sents[pos − M : pos + N ]
15: return raw sents
3.3 Prediction of Correct Endpoint Descriptions

Deep Learning Models: We propose a deep learning method to predict the correct
endpoint description sentences from raw endpoint descriptions for each endpoint.
To train the deep learning model, we generated a training dataset based on API
Harmony. For each endpoint of the APIs in API Harmony, a correct description
is presented, which can be considered as ground-truth. For each API in API Har-
mony, we directly collect its documentation. After that, for each of the endpoint
that this API has, we extract M sentences before and N sentences after the given
endpoint name from the documentations, by using the algorithm presented in
Sect. 3.2. Next, we compare the similarity for each of these M +N sentences with
the ground-truth endpoint description in API Harmony using spaCy5 , which cal-
culates the similarity score by comparing word vectors. The sentence with the
highest similarity can be considered as the correct endpoint description (i.e.
ground-truth selected by API Harmony). The remaining N + M − 1 sentences
which are not selected by API Harmony as endpoint descriptions can be treated
as incorrect endpoint descriptions.
If the ground-truth endpoint description in API Harmony for a given end-
point contains K sentences where K > 1, in this case, “K-grams” of M sentences
before and N sentences after need to be generated. For example, if K = 2 which
means ground-truth endpoint description contains two sentences (GTS1 , GTS2 ),
we need to collect “2-grams” sentence pairs (Ti , Tj ) from API documentation,
such as (before 3rd, before 2nd), (before 2rd, before 1st), (after 1st, after 2nd),
(after 2nd, after 3rd) where “before 3rd” means the 3rd sentence before end-
point name. After that, the average similarity score is computed according to
the following equation:
Simscore = (Sim(GTS1 , Ti ) + Sim(GTS1 , Tj )

(1)
+ Sim(GTS2 , Ti ) + Sim(GTS2 , Tj ))/4
where Sim represents the similarity score between two given inputs. Simi-
larly, the “K-gram” with the highest similarity can be considered as the correct
endpoint descriptions (i.e. selected by API Harmony). The remaining “K-grams”
which are not selected by API Harmony can be treated as incorrect endpoint
descriptions.
For each correct or incorrect endpoint descriptions (with a label 1 or 0), we
compute the following features to be used for the deep learning models:
– Endpoint Vector: Vector representation of endpoint names.

– Description Vector: Vector representation of correct or incorrect description
sentences.
– HTTP Verb: HTTP method verbs (such as GET, POST, PUT, PATCH,
DELETE, OPTIONS, HEAD) presented in the given sentence. If no such
verb in the sentence, mark it as NONE. Those keywords are encoded by
one-hot labels.
– Cosine Similarity: Cosine similarity between Endpoint and Description Vec-
tors.
– spaCy Similarity: The average similarity score between the endpoint and
description text calculated by SpaCy.
– HTML Section: Check if the given sentence and the endpoint name are in the
same HTML section by checking the HTML tag. If yes, mark the sentence as
“1”, otherwise “0”.
– Description Tag: Check if there is any HTML header tag with name “Descrip-
tion” or “description” in the given HTML section. If yes, mark the sentence
as “1”, otherwise “0”.
5
Available at: https://spacy.io/.
104 L. Liu et al.
– Number of Tokens: Number of words including special characters in the

description.
– Endpoint Count: Number of times that the endpoint name is found in the
given sentence.
Note that the features for the description classification model are selected
by observing ground-truth examples. For example, we observed that in many
cases, the endpoint name and its description are similar, then we utilized spaCy
similarity as one of the features. Extensive convincing examples can be found: in
Gmail API, the endpoint “/userId/drafts” has the description “lists the drafts in
the user’s mailbox”; in Spotify API, the endpoint “/albums” has the description
“get several albums” etc. Other features are also based on such observations.
Fig. 5. Proposed deep learning models to predict correct endpoint descriptions: (a)
CNN+ANN; (b) LSTM+ANN.
In many cases, the endpoint name is concatenated by multiple words, such as

“/UserID”. To compute word vectors for it, we firstly split such endpoint names
into individual words, such as “User” and “ID”, and then get the correspond-
ing word embedding. Such splitting is achieved by building a word frequency
list according to all the words that appeared in API and endpoint description
sentences. We assume the words with higher frequency will have a lower cost
for the splitting. By using dynamic programming, we can split such endpoint
names with the target of minimizing the overall splitting cost. The GloVe [18]
pre-trained word vectors with 6B tokens, 400K vocabulary, and 300-dimensional
vectors are used for word embedding. If the endpoint and description have mul-
tiple words, we get their vectors by averaging the embedding of all the words.
The deep learning models predict whether a given sentence is a correct end-
point description or not. Fig. 5(a) presents our designed convolutional neural
network (CNN)+ANN model and Fig. 5(b) shows a long short-term memory
(LSTM)+ANN model. Here, ANN refers to an artificial neural network with
Dense, Batch Normalization and Dropout layers. The inputs to the models are
the features aforementioned, and the output is a binary indication (0 or 1) repre-

senting whether the sentence is a correct endpoint description or not for the given
endpoint. In CNN+ANN model, the endpoint vectors and description vectors are
sent to CNNs, and all the other features are sent to an ANN. The outputs of the
CNNs and the ANN are merged and then sent to another ANN with multiple
Dense, Dropout and Batch Normalization layers. The overall architecture for the
LSTM+ANN model is similar to the CNN+ANN model. The only difference is
that endpoint vector and description vector features are sent to LSTM networks,
rather than CNNs, as shown in Fig. 5(a) and Fig. 5(b) respectively.
Performance Evaluation of Models: By using the above method, we extract the

training dataset from API documentation, as summarized in Table 1. Note that
correct and incorrect endpoint description sentences are imbalanced because
only a small percentage of sentences are correct endpoint description sentences
(selected by API Harmony), whereas all the rest sentences are incorrect descrip-
tion sentences. Therefore, we collect more incorrect endpoint description sen-
tences compared with correct endpoint description sentences.
Table 1. Collected training dataset. Table 2. Testing results for the deep
learning models in Fig. 5 and tradi-
Training Dataset # of Records tional machine learning models.
Correct endpoint 5,464
description sentences Models Testing
accuracy
Incorrect endpoint 33,757
description sentences Decision Tree [19] 76.64%
Random Forest [20] 79.92%
CNN+ANN (Fig. 5(a)) 90.31%
LSTM+ANN (Fig. 5(b)) 98.13%
Since the training dataset for correct and incorrect endpoint description sen-
tences is imbalanced, we firstly randomly select 5,588 sentences out of the 33,757
incorrect endpoint description sentences, and together with the 5,464 correct
endpoint description sentences, we train the deep learning models depicted in
Fig. 5. We use 65%, 20%, and 15% of the dataset for training, validation, and
testing respectively. The testing result is shown in Table 2, which shows that
both CNN+ANN and LSTM+ANN models can achieve more than 90% testing
accuracy, and the LSTM+ANN model outperforms the CNN+ANN model. For
comparison purposes, we also evaluate the performance of two traditional learn-
ing models: Decision Tree and Random Forest. Decision Tree is a flowchart graph
or diagram that helps explore all of the decision alternatives and their possible
outcomes [19]. Random Forest is an ensemble learning method for classification,
regression, and other tasks, that operates by constructing a multitude of deci-
sion trees at training time and outputting the class based on voting [20]. The
106 L. Liu et al.
testing result in Table 2 shows that the proposed deep learning models greatly
outperform the traditional learning models such as Decision Tree and Random
Forest.
Blind Testing and Model Improvement: In the above testing, the training and
testing datasets are all retrieved from API documentation related to the APIs
included in API Harmony. However, API Harmony only covers a small percent-
age of Web APIs, and most of these APIs are made by big providers which are
likely to have high-quality documentations. However, as we are targeting a wider
coverage of Web APIs in the recommendation system, it is essential to evaluate
the model performance over a large API documentation corpus, in particular for
those not covered by API Harmony.
To conduct this blind testing, we manually label 632 sentences in documen-
tations of APIs that are not covered by API Harmony. We compute all the fea-
tures of these 632 sentences and send them as input to our trained LSTM+ANN
model aforementioned. The results are summarized in Table 3. From the results,
we can see that with only one trained model, the blind testing performance is
not good as the model cannot distinguish the incorrect endpoint descriptions
well. The reason is that when we train the model, we use the random under-
sampling method in order to have a balanced training dataset between correct
and incorrect description sentences. However, this method may discard poten-
tially useful information which could be important for training the model. The
samples chosen by random under-sampling may be biased samples, and thus,
they may not be an accurate representation and provide sufficient coverage for
incorrect descriptions, thereby, causing inaccurate results. To improve the model
to cover a wider range of APIs, we applied an ensemble approach, as shown in
Fig. 4.
Table 3. Blind testing results of Table 4. Summary of training dataset.

ensemble method with multiple mod-
els. Dataset Number Number Number
of APIs of of queries
# of Models Accuracy Recall Precision endpoints
API 1,127 9,004 232,296
1 Model 31.80% 96.05% 14.57%
Harmony
3 Models 78.80% 69.74% 32.32% Popular 1,603 12,659 447,904
5 Models 80.70% 76.32% 35.80% API List
7 Models 84.97% 82.99% 43.45% Full API 9,040 49,083 1,155,821
List
In Fig. 4, each model Mi is one trained model mentioned above. Here, the
LSTM+ANN model is used as it outperforms CNN+ANN. Each Mi is trained
with correct endpoint description sentences and different incorrect endpoint
description sentences. This is achievable because we have much more incor-
rect endpoint descriptions sentences than the correct ones. In this case, each
Mi makes different decisions based on the learned features. They can predict
independently and vote to jointly decide whether the input sentence is correct
endpoint descriptions or not.
Table 3 shows the performance of the ensemble approach. It can be seen that
the ensemble approach can improve the overall performance in terms of accuracy
and precision, compared with only one model. Moreover, the ensemble approach
with 7 models outperforms others, which will be used in the rest of this paper.
The only issue is that some incorrect endpoint descriptions are wrongly predicted
as correct endpoint descriptions, which result in more false-positive predictions
and will introduce some noise to the training dataset of the API search model.
3.4 Synthesizing Queries for Training Dataset
In the previous steps, we have collected the data regarding API titles, API key-
words, API descriptions, and correct endpoint descriptions. The API descriptions
and correct endpoint descriptions may contain many sentences. Therefore, we
firstly conduct sentence tokenization, and in turn, for each of the tokenized sen-
tence, text normalization is carried out, including conducting word-stemming
lemmatization and removing stop words, symbols, special characters, HTML
tags, unnecessary spaces, and very short description sentence with only 1 word.
After that, these processed sentences are used to build the training dataset.
We consider there are 4 major types of queries when a developer wants to
search a Web API:
– Question type queries: developers may enter a question to search a Web API,
for example, a question type query might be “which API can get glucose?”
– Command type queries: instead of asking a question, developers may directly
enter a command type query to search an API, such as “get weather infor-
mation.”
– Keyword type queries: in many cases, developers may just input a couple of
keywords to search an API. One example query is “fitness, health, wearable.”
– API title-based queries: in some cases, developers may already have an idea
regarding what API to use. Developers may just need to search an endpoint
for this given API. One example of such a query is “post photos to Instagram.”
In this case, the search engine should return the endpoint of the Instagram
API, rather than the endpoint of other similar APIs.
We define rule-based methods to synthesize training queries based on part-of-

speech (POS) tagging and dependency parsing (also known as syntactic parsing).
POS tagging is the process of marking up a word in a text as corresponding to
a particular part of speech, based on both its definition and its context. Depen-
dency parsing is the task of recognizing a sentence and assigning a syntactic
structure to it. Figure 6(a) shows an example sentence with its POS tagging and
dependency parsing results, which can be generated by many NLP tools such are
108 L. Liu et al.
Spacy, NLTK, CoreNLP, etc. In this work, we use SpaCy, and the annotation of
the POS tagging6 and dependencies7 can be found in SpaCy documentations.
Considering the fact that most of the sentences in API descriptions and
endpoint descriptions are long, whereas in real practice, developers are unlikely
to enter a very long query to a search engine, therefore we use POS tagging and
dependency parsing to synthesize simplified question-type and command-type
queries. We defined several rules, and if the description sentence satisfies a rule,
simplified question-type and command-type queries are generated. For example,
a rule is defined as
for subject in sentence:

If subject.dep == nsubj and subject.head.pos == VERB:
# Simplified question-type query:
Q_query = \Which endpoint" + VERB + dobj NOUN phrase
# Simplified command-type query:
C_query = VERB + dobj NOUN phrase
Such a rule is feasible because the syntactic relations form a tree, every word
has exactly one head. We can, therefore, iterate over the arcs in the depen-
dency tree by iterating over the words in the sentence. If the original endpoint
description sentence is “this endpoint gets a music playlist according to an artist
ID,” by using the rule, we can generate a simplified question-type query “which
endpoint get a music playlist?”, and a simplified command-type query “get a
music playlist”. The training dataset includes the API and endpoint description
sentence, as well as the simplified question-type query and simplified command-
type query. If an API or endpoint description sentence cannot be applied to
any of the pre-defined rules, no simplified question-type query and simplified
command-type query can be generated. In this case, only the API or endpoint
description sentences will be included in the training dataset.
The keyword-based queries are generated from the API keywords that we col-
lected from ProgrammableWeb. For example, the Spotify API has two category
keywords “music” and “data mining” on ProgrammableWeb. So the keyword
query can be “music, data mining”. The keyword-based query can also be gen-
erated by concatenating the noun phrases of an API or endpoint description
sentence. Given the same example, “this endpoint gets a music playlist accord-
ing to an artist ID,” the corresponding keyword-based query is “this endpoint,
a music playlist, an artist ID.”
The API title-based queries can be generated by using the API title col-
lected from ProgrammableWeb. In addition, to emulate an API title-based query,
we also attach the API title to the end of the short question-type queries and
command-type queries. For the same example, the API title-based queries are
“which endpoint get a music playlist with Spotify?” and “get a music playlist
with Spotify.”
6
https://spacy.io/api/annotation#pos-tagging.
7
https://spacy.io/api/annotation#dependency-parsing.
By using the proposed methods, we can build a training dataset for API/
endpoint search. The dataset has 3 columns, which are the query, its correspond-
ing API and endpoint respectively. Note that, we cannot judge which endpoint
should be used for the synthesized queries related to API description, API title,
and ProgrammableWeb keywords. In this case, the endpoint field in the training
dataset regarding these queries is marked as “N/A”.
4 Web API Search: Deep Learning Model

and Performance Evaluation
4.1 Deep Learning Model
The target of the API search engine is to search both API and its endpoint
based on a natural language query. In this paper, we propose a two-step transfer
learning method. The proposed model was designed by a performance compar-
ison of multiple different architectures, and we picked up the model with the
best performance. Figures 6(b) and (c) show the first step, which is to predict
API and endpoint separately, and Fig. 6(d) shows the second step, which pre-
dicts API and its endpoints jointly, by reusing the models trained in the first
step. The recurrent neural network model has also been widely used for text
modeling. Especially, LSTM is gaining popularity as it specifically addresses the
issue of learning long-term dependencies [21]. Therefore, we implement the rec-
ommendation model based on LSTM networks. As shown in Fig. 6(b) and 6(c),
LSTM models in the first step include four layers: an input layer to instantiate
a tensor and define the input shape, an embedding layer to convert tokens into
Fig. 6. (a) Ensemble LSTM+ANN model to support endpoint description prediction;

(b) Bidirectional LSTM model to predict API (first step); (c) Bidirectional LSTM
model to predict endpoint (first step); (d) Final model to predict both API and its
endpoint (second step).
110 L. Liu et al.
vectors, a bidirectional LSTM layer to access the long-range context in both

input directions, and a dense layer with softmax activation function to linearize
the output into a prediction result.
The training data that we used to train Fig. 6(b) and Fig. 6(c) are different.
The model in Fig. 6(b) is trained by query and its corresponding API data,
while the model in Fig. 6(c) is trained by query and its corresponding endpoint
data. We fine-tune parameters in different models in order to make the training
process consistent across different models. We train all models with a batch size
of 512 examples. The maximum sentence length is 25 tokens. The GloVe [18]
pre-trained word vectors with 6B tokens, 400K vocabulary, and 300-dimensional
vectors are used for word embedding. A sentence is represented by averaging
the embedding of all the words. We choose 500 hidden units for bi-directional
LSTM.
After the two models in Fig. 6(b) and Fig. 6(c) are trained and well-fitted,
we re-use these models for the second step to build the final model to predict
both API and its endpoint simultaneously. The network architecture is shown
in Fig. 6(d). The last dense layer of models in Fig. 6(b) and Fig. 6(c) is removed,
and the rest layers are re-used in the final model. The parameters and weights in
the corresponding layers are frozen, which means that these layers are not train-
able when we train the final model in step 2. A concatenate layer is deployed
to merge the output from two bidirectional LSTM layers. In turn, the output of
the concatenate layer is sent to a neural network with dense layers and dropout
layers. The final prediction is given by the last dense layer with softmax acti-
vation function. The loss function is categorical cross-entropy loss. The query
and API/endpoint combination in the training dataset is used to train the final
model. We set the dropout rate to 0.4 and used early stopping to avoid overfit-
ting. The size of the output layer for models in Fig. 6(b), Fig. 6(c), and Fig. 6(d)
equals to the number of APIs, number of endpoints, and number of API/endpoint
pairs, respectively.
4.2 Performance Evaluation
To evaluate the performance of the API search, we test the model performance
for the following dataset:
– API Harmony: As aforementioned, we consider API Harmony as the ground-

truth dataset for API endpoint descriptions. Therefore, the testing results of
the API search model for API Harmony can validate the overall efficiency of
the proposed model when the training dataset is accurate.
– Full API list: The full API list is a comprehensive dataset that we collected in
Sect. 3. The full API list covers 9,040 APIs. This number is smaller than the
number of APIs in ProgrammableWeb because some of the documentation
links in ProgrammableWeb cannot be accessed or the documentations are
lacking endpoint information.
– Popular API list: We use the full API list and extract metadata about APIs
from ProgrammableWeb and GitHub to rank the APIs in the full API list
based on their popularity. The popularity rank is computed based on the fol-
lowing extracted subjects where each one contains a number for a given API:
(1) number of SDKs, (2) number of articles, (3) number of changelogs, (4)
number of sample source codes, (5) number of “how to” articles, (6) number
of libraries, (7) number of developers, (8) number followers, and (9) number
of Github projects using this API. Items (1)–(8) are collected directly from
ProgrammableWeb. Item (9) is collected from GitHub projects by searching
API’s host address and base path via GitHub APIs. The numbers collected
in (1)-(9) are normalized and considered with the same weight for ranking
API popularity. Based on the final ranking, we select the top 1,000 APIs for
this dataset. If the providers of the top 1,000 APIs have other APIs which
are not ranked in the top 1,000, we also add those APIs into this dataset. By
doing so, the popular API list covers 1,603 APIs, which can be considered as
the most popular Web APIs.
Performance Evaluation: The summary of the training dataset of the API Har-
mony, popular API list, and full API list is shown in Table 4. The training
dataset is split into 80% for training and 20% for testing. The testing accuracy
for the API search model is shown in Table 5. In this table, the top 1 accu-
racy shows the possibility that the correct API/endpoint is ranked as the first
search result. Similarly, the top 10 accuracy represents the possibility that the
correct API/endpoint is ranked as one of the first 10 search results. All the
APIs/endpoints in search results are ranked by the possibility score given by the
softmax function. This evaluation result shows that the proposed method can
achieve very good accuracy for endpoint level search.
We compare the performance of the proposed 2 step transfer learning with the
models that use traditional LSTM [21] or bi-LSTM [22] to recommend both API
and its endpoint using the API Harmony dataset. The result is shown in Table 6,
which validates that the proposed 2 step transfer learning model outperforms
previous LSTM or bi-LSTM in terms of endpoint search accuracy.
Table 5. API/endpoint search accu- Table 6. Comparison of the proposed

racy. 2 step transfer learning model with the
LSTM and bi-LSTM for endpoint level
Input Dataset Top 1 Top 10 search by using API Harmony dataset.
accuracy accuracy
API Harmony 91.13% 97.42% Model Top 1 Top 10
accuracy accuracy
Popular API List 82.72% 93.81%
LSTM 72.15% 80.27%
Full API List 78.85% 89.69%
Bi-LSTM 74.48% 82.98%
2 step 91.13% 97.42%
transfer
learning
112 L. Liu et al.
5 Conclusions
In this paper, we propose novel approaches to support an end-to-end procedure
to build a Web API search system on a large number of public APIs. To the
best of our knowledge, it is the first work that provides API endpoint level
searches with a large API coverage (over 9,000 APIs) and high search accuracy.
Our future work is to open the system to the public and collect users’ query
and their feedback. It is worth noting that the problem/application of Web
API search is very practical for both academia and industry. Considering the
fact that the state-of-the-art works only has a small API coverage (e.g. 1,179
APIs in API Harmony), constructing an API search system with 9,040 APIs and
49,083 endpoints is a significant improvement to this application. As Web APIs
are rapidly growing and becoming more and more important for future software
engineering, we hope the proposed application and its associated methods would
be beneficial for the whole community.
References
1. Richardson, L., Ruby, S.: RESTful web services. O’Reilly Media Inc., Reading
(2008)
2. Tan, W., Fan, Y., Ghoneim, A., et al.: From the service-oriented architecture to
the Web API economy. IEEE Internet Comput. 20(4), 64–68 (2016)
3. Verborgh, R., Dumontier, M.: A Web API ecosystem through feature-based reuse.
IEEE Internet Comput. 22(3), 29–37 (2018)
4. Rahman, M.M., Roy, C., Lo, D.: Rack: automatic API recommendation using
crowdsourced knowledge. In: IEEE 23rd International Conference on Software
Analysis, Evolution, and Reengineering (SANER), pp. 349–359. IEEE (2016)
5. Bajracharya, S., Ossher, J., Lopes, C.: Searching API usage examples in code repos-
itories with sourcerer API search. In: ICSE Workshop on Search-driven Develop-
ment: Users, Infrastructure, Tools and Evaluation, pp. 5–8. ACM (2010)
6. Duala-Ekoko, E., Robillard, M.: Asking and answering questions about unfamil-
iar APIs: An exploratory study. In: 34th International Conference on Software
Engineering (ICSE), pp. 266–276. IEEE (2012)
7. Stylos, J., et al.: MICA: a web-search tool for finding API components and exam-
ples. In: Visual Languages and Human-Centric Computing, pp. 195–202. IEEE
(2006)
8. Bahrami, M., et al.: API learning: applying machine learning to manage the rise of
API economy. In: Proceedings of the Web Conference, pp. 151–154. ACM (2018)
9. Gu, X., et al.: Deep API learning. In: Proceedings of the International Symposium
on Foundations of Software Engineering, pp. 631–642. ACM (2016)
10. Bianchini, Devis., De Antonellis, Valeria, Melchiori, Michele: A multi-perspective
framework for web API search in enterprise mashup design. In: Salinesi, Camille,
Norrie, Moira C., Pastor, Óscar (eds.) CAiSE 2013. LNCS, vol. 7908, pp. 353–368.
Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-38709-8 23
11. Su, Y., et al.: Building natural language interfaces to web APIs. In: Proceedings
of Conference on Information and Knowledge Management, pp. 177–186. ACM
(2017)
12. Lin, C., Kalia, A., Xiao, J., et al.: NL2API: a framework for bootstrapping service
recommendation using natural language queries. In: IEEE International Conference
on Web Services (ICWS), pp. 235–242. IEEE (2018)
13. Torres, R., Tapia, B.: Improving web API discovery by leveraging social infor-
mation. In: IEEE International Conference on Web Services, pp. 744–745. IEEE
(2011)
14. Cao, B., et al.: Mashup service recommendation based on user interest and social
network. In: International Conference on Web Services, pp. 99–106. IEEE (2013)
15. Li, C., et al.: A novel approach for API recommendation in mashup development.
In: International Conference on Web Services, pp. 289–296. IEEE (2014)
16. Gao, W., et al.: Manifold-learning based API recommendation for mashup creation.
In: International Conference on Web Services, pp. 432–439. IEEE (2015)
17. Yang, Y., Liu, P., Ding, L., et al.: ServeNet: a deep neural network for web service
classification. arXiv preprint arXiv:1806.05437 (2018)
18. Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representa-
tion. In: Proceedings of the Conference on Empirical Methods in Natural Language
Processing (EMNLP), pp. 1532–1543. ACM (2014)
19. Apté, C., Weiss, S.: Data mining with decision trees and decision rules. Fut. Gen.
Comput. Syst. 13(2–3), 197–210 (1997)
20. Shi, T., Horvath, S.: Unsupervised learning with random forest predictors. J. Com-
put. Graph. Stat. 15(1), 118–138 (2006)
21. Jozefowicz, R., et al.: An empirical exploration of recurrent network architectures.
In: International Conference on Machine Learning, pp. 2342–2350 (2015)
22. Schuster, M., Paliwal, K.: Bidirectional recurrent neural networks. IEEE Trans.
Signal Process. 45(11), 2673–2681 (1997)

Web API Search

Uploaded by

Web API Search

Uploaded by

Web API Search: Discover Web API

and Its Endpoint with Natural

Lei Liu(B) , Mehdi Bahrami, Junhee Park, and Wei-Peng Chen

Fujitsu Laboratories of America, Inc.,

Abstract. In recent years, Web Application Programming Interfaces

Keywords: Web APIs · Neural networks · Deep learning

2 Related Works and Our Contributions

API search or recommendation for developers has been extensively studied in

Keyword-Based Search vs. Deep Learning-Based Search: Online Web

Fig. 2. Procedure for building the training dataset.

also focused on recommending API mashups, or recommending an API based

3 Training Dataset Generation

Fig. 3. Correct endpoint descriptions Fig. 4. Ensemble LSTM+ANN model to

3.1 Crowdsourcing from ProgrammableWeb

in API documentation. API Harmony is a catalog service of 1,179 web APIs,

3.2 Collection of Raw API Endpoint Descriptions from API

As aforementioned, raw endpoint descriptions are the surrounding sentences of

Algorithm 1. Raw Endpoint Description Extraction

3.3 Prediction of Correct Endpoint Descriptions

Simscore = (Sim(GTS1 , Ti ) + Sim(GTS1 , Tj )

– Endpoint Vector: Vector representation of endpoint names.

– Number of Tokens: Number of words including special characters in the

In many cases, the endpoint name is concatenated by multiple words, such as

the features aforementioned, and the output is a binary indication (0 or 1) repre-

Performance Evaluation of Models: By using the above method, we extract the

Table 3. Blind testing results of Table 4. Summary of training dataset.

3.4 Synthesizing Queries for Training Dataset

We deﬁne rule-based methods to synthesize training queries based on part-of-

for subject in sentence:

4 Web API Search: Deep Learning Model

Fig. 6. (a) Ensemble LSTM+ANN model to support endpoint description prediction;

vectors, a bidirectional LSTM layer to access the long-range context in both

4.2 Performance Evaluation

– API Harmony: As aforementioned, we consider API Harmony as the ground-

Table 5. API/endpoint search accu- Table 6. Comparison of the proposed

You might also like