AIOS Compiler - LLM As Interpreter For Natural Language Programming and Flow Programming of AI Agents
AIOS Compiler - LLM As Interpreter For Natural Language Programming and Flow Programming of AI Agents
Shuyuan Xu ∗ Zelong Li ∗
Rutgers University Rutgers University
shuyuan.xu@rutgers.edu zelong.li@rutgers.edu
arXiv:2405.06907v2 [cs.CL] 21 May 2024
Abstract
Since their inception, programming languages have trended towards greater read-
ability and lower barriers for programmers. Following this trend, natural language
can be a promising type of programming language that provides great flexibil-
ity and usability and helps towards the democracy of programming. However,
the inherent vagueness, ambiguity, and verbosity of natural language pose sig-
nificant challenges in developing an interpreter that can accurately understand
the programming logic and execute instructions written in natural language. For-
tunately, recent advancements in Large Language Models (LLMs) have demon-
strated remarkable proficiency in interpreting complex natural language. Inspired
by this, we develop a novel system for Code Representation and Execution
(CoRE), which employs LLM as interpreter to interpret and execute natural
language programs (NLPg). The proposed system unifies natural language pro-
gramming, pseudo-code programming, and flow programming under the same
representation for constructing language agents, while LLM serves as the inter-
preter to interpret and execute the agent programs. In this paper, we begin with
defining the programming syntax that structures natural language instructions
logically. During the execution, we incorporate external memory to minimize
redundancy. Furthermore, we equip the designed interpreter with the capabil-
ity to invoke external tools, compensating for the limitations of LLM in spe-
cialized domains or when accessing real-time information. This work is open-
source at https://github.com/agiresearch/CoRE, https://github.com/
agiresearch/OpenAGI, and https://github.com/agiresearch/AIOS.
1 Introduction
Programming is crucial for computers as it enables them to execute specific tasks based on a predefined
set of instructions. It allows us to utilize logical algorithms to enable computers to solve problems.
Programming has evolved significantly since its inception, with new technologies and innovations
driving its growth. Initially, programming languages were based on binary machine language, such
as punched cards, which can be directly executed by the machine. However, machine language
was hardly readable to humans. Subsequently, low-level programming languages, such as assembly
language, use mnemonic instructions and operands to represent machine code, which enhances the
readability [17]. However, due to the requirement of controlling memory locations and registers,
assembly language still has a high entry barrier for programmers. With the design of high-level
∗
Both authors contributed equally to this work.
†
Corresponding author
Given noisy image Given English Given noisy image
Given clozed English
how to return the text, how to return the how to return the
Queries object names in summarization in
text how to generate a
object names in
image step by step?
English step by step? English step by step? German step by step?
Figure 1: In our CoRE system, we design the CoRE language to unify natural language programming,
pseudo-code programming, and flow programming in the same syntax representative. We use the
program for OpenAGI [13] platform as an example.
programming languages like C/C++, Java and Python, coding has become more user-friendly and
efficient. They offer programmers a more productive and accessible approach, leading to increased
participation in programming and software development. Consequently, programming languages are
becoming more integrated into everyday life.
From the history of programming languages, we can observe a clear trend toward increased usability,
readability, and democracy of programming. Following this trend, natural language can be a desir-
able choice for coding due to its accessibility, readability, and minimal training requirements for
programmers. However, the application of natural language programming presents challenges due to
the inherent vagueness, ambiguity, and verbosity of natural language. The recently emerged Large
Language Models (LLMs) serve as a solution to this challenge due to their extraordinary capability
in language understanding [37, 7], tool use and function calling [13, 41], as well as interacting with
human or environments [42, 11]. In this work, we propose a novel system for Code Representation
and Execution (CoRE), which takes LLM as the interpreter to interpret and execute the instructions
in natural language, enabling agent programming in natural language.
CoRE can be used for natural language programming, pseudo-code programming, and flow program-
ming, as the three forms of agent programs unify into our CoRE language, as shown by the example
in Figure 1. In the realm of programming, the fundamental task involves designing and developing
logically structured instructions to address specific problems. Natural language programming offers
a method where instructions are formulated in everyday language, making the code intuitive and
accessible. When we structure all natural language instructions in a logical way, it inherently mirrors
the essence of pseudo-code programming. Pseudo-code, by design, simplifies the coding process by
stripping down syntax complexities and focusing on the algorithmic logic for easy understanding.
Therefore, when the instructions are expressed in natural language, the structured instructions can be
identified as pseudo-code. Moreover, pseudo-code shares a direct relationship with flow programming,
as it essentially represents the algorithm’s logic that can seamlessly be visualized as a workflow.
2
Memory Query: Can you assist in devising a 7-day trip for one person, commencing in
Medford and involving visits to 3 distinct cities in Colorado from March 23rd
1. CitySearch [ Colorado ] to March 29th, 2022? The budget for this trip is set at $2,400.
2. FlightSearch [Medford, Step 10:::Process:::Find restaurant list at the first destination city.:::next::Step 11
Denver, 2022-03-23]
1 2 3 4
3. RestaurantSearch [ Denver ]
Information Retrieval Construct Prompt Execution Decide Next Step
Call RestaurantSearch
CitySearch [ Colorado ] Get Input Prompt Step 11
[ Denver ] Tool
Figure 2: An example showing how the CoRE system executes one step.
• We design a CoRE language that unifies natural language programming, pseudo-code programming
and flow programming. The CoRE language logically structures natural language instructions.
• We propose the CoRE system, which utilizes Large Language Model (LLM) as an interpreter to
interpret and execute instructions step-by-step. During execution, the LLM follows the instructions
and leverages both information retrieval and external tools to enhance its effectiveness.
• We verify the effectiveness and efficiency of our system based on public benchmark datasets.
Specifically, we employ our proposed system for agent task solving based on natural language
programs, showcasing its practical capabilities.
In the following part of this paper, we first provide the related work in Section 2. In Section 3 we
present the CoRE framework and how the framework can be applied to LLM agents. We provide the
experimental results in Section 4, and conclude the work together with future directions in Section 5.
2 Related Work
2.1 Natural Language Programming
Research in natural language programming [16, 4, 45, 27, 9, 12] primarily focus on addressing the
ambiguity in translating natural language into programming language statements. Heidorn [16]
proposes to adopt heuristic NLP encoding and decoding rules to develop an automatic programming
system that can accept natural language dialogues. Vadas and Curran [45] introduce a prototype
3
Prompt Creation Decide Next Step
Task + Progress + Observation + Instruction Get next step based on branch
Analyze Output
Use Tool? No Save Output
Yes
Decide Tool Name Decide Tool Args Execute the Tool Save Observation
system that can translate certain English instructions into executable Python code using Combinatory
Categorial Grammar (CCG) parser, which uses unrestricted syntax to cover a wide range of user
instruction semantics. Mihalcea et al. [34] implement a procedural natural language programming
system to convert natural language to programming language. Early natural language programming
techniques are restricted in extensibility by the need to create domain-specific languages (DSLs). To
avoid the problems of repeatedly designing new DSLs, Desai et al. [9] propose a general generative
framework for constructing a program that takes natural language input and produces the expressions
in the target DSL. Further, Ernst [12] leverages neural networks, i.e., the recurrent neural networks
(RNN), to convert English specifications of file system operations into corresponding bash commands.
Large Language Models (LLMs) have emerged as powerful tools for problem solving, encompassing
tasks in reasoning, planning, and code generation. LLM reasoning typically involves decomposing a
complex task into a sequence of steps, also known as a reasoning chain [48]. Prominent approaches
in LLM reasoning include Chain-of-Thought (CoT) and its derivatives [48, 25]. To further improve
the reasoning ability of LLM, several work has been proposed. The Self-consistency method [46]
samples multiple reasoning paths and selects the most consistent outcome by voting. Additionally,
classical data structures like trees and graphs are utilized to enhance reasoning efficiency and accuracy
in fewer steps [51, 2]. Apart from reasoning, planning is also an important task that can be used
to solve problems. LLM Planning involves generating a series of actions to achieve the predefined
goals [15]. Recent advancements include direct prompting of LLMs for planning tasks, showing
promising results [19, 44, 10]. Finite state machines have been integrated into LLM to enhance
the planning ability [28, 50]. ReAct [52] proposes to leverage external tools like search engine to
enhance the LLM planning. Besides, considering the powerful ability of LLM in programming,
recent work propose to generate programming code to solve problems [31, 21, 30, 5, 22, 39, 35].
Furthermore, the “self-reflection” mechanism [32, 38, 43] enables LLMs to critique their own outputs,
significantly enhancing performance in tasks such as reasoning [2] and code generation [6]. In
contrast to existing methods that directly use LLMs for generating solutions, the proposed CoRE
system utilizes LLMs as interpreters, executing solutions designed by humans to address complex
questions. This approach leverages human creativity in solution design, coupled with LLM’s ability,
to enhance problem-solving capabilities in natural language programming contexts.
In this section, we will introduce how we define the natural language programming syntax and how
to use LLM as an interpreter to interpret and execute natural language programs.
To organize natural language instructions, we define the basic structural representation for each step,
which consists of four components. An example can be found in Figure 1.
4
Query: Can you assist in devising a 7-day trip for one person, commencing in
Medford and involving visits to 3 distinct cities in Colorado from March 23rd
to March 29th, 2022? The budget for this trip is set at $2,400.
2. FlightSearch [Medford,
Memory 1. CitySearch [ Colorado ]
Denver, 2022-03-23]
Figure 4: An example showing how the CoRE system retrieves relevant information.
• Step Name: Each step in the program is uniquely identified by a step name. This identifier is
analogous to function identifiers in traditional programming languages, which facilitates navigation
and reference within the program structure, ensuring that each operation within the program can be
distinctly addressed and accessed.
• Step Type: The step type categorizes the nature of the operation being performed in each step,
analogous to control structures in conventional programming. We define three primary step types:
– Process: Akin to a procedural statement in traditional programming, this step type executes a
specific operation and transitions to the next specified step.
– Decision: Corresponding to conditional statements (e.g., “if-else”), this step involves branching
the program flow based on evaluated conditions, leading to multiple potential paths.
– Terminal: Similar to the “end” or “return” statement, this step marks the conclusion of the
program, indicating that no further steps are to be executed.
• Step Instruction: The step instruction explicates the task to be conducted at a step. This component
is integral as it provides the instruction and content for execution, paralleling the statement block
in traditional programming languages. By demonstrating operations in natural language, NLPg
lowers the barrier to programming, making it more readable for non-expert programmers.
• Step Connection: Step connections define the progression from one step to another, establishing
the flow of the program execution. In process steps, a single subsequent step is specified. In
decision steps, multiple pathways are delineated based on conditions. Terminal steps, by definition,
do not lead to any future steps, indicating the end of program execution.
For each step in the program, the above four components are separated by “:::” (as illustrated in the
CoRE language in Figure 1). Other special tokens can also be used to separate different components.
In programming languages, there are three basic control constructs in programming [8, 40]: sequence,
selection and iteration. These three basic constructs can be easily designed within the CoRE language.
• Sequence: Sequence in programming is the execution of statements in a linear order, with each
statement leading to the next. In the CoRE framework, this construct is designed by setting the
“Step Connection” to point to the subsequent step. Each step operates under the Process type until
the sequence concludes.
• Selection: Selection in programming languages facilitates conditional branching, allowing the
program to execute different sequences of steps based on specific conditions. This is implemented
using the Decision step type where the “Step Connection” part explicitly outlines multiple potential
paths. Each branch is defined by a condition stated within the “Step Connection” part, guiding the
program flow to various steps depending on the conditions.
• Iteration: Iteration involves repeating a set of operations until a certain condition is met, akin to
loops in conventional programming. In the CoRE framework, we utilize a step with the Decision
type to assess whether the loop condition has been fulfilled. At the end of one loop cycle, the
“Step Connection” is configured to point back to the previous Decision step, thereby enabling the
continuation of the loop.
5
Query: Can you assist in devising a 7-day trip for one person, commencing in
Medford and involving visits to 3 distinct cities in Colorado from March 23rd
to March 29th, 2022? The budget for this trip is set at $2,400.
LLM Interpreter
To provide you with a list of restaurant options, I'll use the tool: RestaurantSearch [ Denver ].
This will give us a range of dining choices in the city.
RestaurantSearch
Yes RestaurantSearch Denver [ Denver ]
Figure 5: An example showing how the CoRE system analyze the output from the LLM interpreter.
In this section, we will discuss how the CoRE system utilizes a Large Language Model (LLM) as an
interpreter to execute programs written in the CoRE language. We will demonstrate the execution of
a single step within the CoRE system, which is illustrated in Figure 3. More specifically, the system
executes a single step in four procedures. First of all, the interpreter determines the useful information
to execute the current step. Then the interpreter will integrate all relevant information to construct the
prompt. Based on the generated prompt, the interpreter will generate response and may utilize tools
to execute the current step. Finally, after executing the current step, the interpreter will determine the
next step based on step type and execution results. We will introduce the four parts in details.
• Task Description: The query that defines the entire program, acting as the primary input to guide
the system’s operations.
• Current Progress: Summarizes the previous steps including what has been done or decided,
helping maintain a narrative flow.
• Observation: This part may not be included in every step. When relevant information is retrieved
from the memory by the interpreter, it is incorporated here.
• Current Instruction: Specifies the action to be taken in natural language, directing the interpreter
on how to proceed in the current step.
6
Query: Can you assist in devising a 7-day trip for one person, commencing in
Medford and involving visits to 3 distinct cities in Colorado from March 23rd
to March 29th, 2022? The budget for this trip is set at $2,400.
LLM Interpreter
The destination is a state (Colorado).
Which condition
The next step is closer?
state Step 3
Figure 6: An example showing how the CoRE system determines the next step in the flow.
4 Experiments
7
Mixtral (open source) as LLM interpreter GPT-4 (closed-source) as LLM interpreter
Metrics / Task Zero CoT Few CoRE (Ours) Zero CoT Few CoRE (Ours)
Task 1 (CLIP Score) 0.0 0.0 0.1839 0.1825 0.0 0.2732 0.1837 0.3030
Task 2 (BERT Score) 0.1092 0.1987 0.0687 0.2593 0.2076 0.2266 0.5277 0.5756
Task 3 (ViT Score) 0.1949 0.1562 0.5501 0.2437 0.5058 0.6736 0.6916 0.6611
Average over tasks 0.1206 0.1736 0.1887 0.2483 0.2378 0.3359 0.5391 0.5744
% of Valid Plans 23.08 38.46 46.15 56.92 53.85 60.00 83.08 92.31
Table 1: OpenAGI [13] benchmark task performances under different settings. Zero is for Zero-shot
Learning, Few is for Few-shot Learning. The boldface numbers denote the highest score under each
task type using the same LLM.
We conduct experiments on a benchmark dataset, OpenAGI [13]. The OpenAGI benchmark tasks
are categorized based on their output type and ground-truth label type (Task 1, 2, and 3). Then, based
on different task types, different metrics are employed to gauge the performance: CLIP Score [18],
assessing the similarity between text and image, is utilized for Text-to-Image tasks; BERT Score
[54], evaluating text generation with BERT, is applied when both data labels and the expected outputs
are texts; and ViT Score [49] gauges the similarity between the image label and image output.
Our framework and all baselines are implemented by PyTorch, an open-source library. We follow
the implementation setting of the OpenAGI platform [13] for Zero-shot and few-shot learnings. We
leverage the DSPy framework [23, 24] to apply the CoT strategy to the OpenAGI platform. We
also tried Program-of-Thought [5] and ReAct [53] strategies on the OpenAGI platform. However,
the ReAct strategy requires text observation, which is unsuitable for our OpenAGI task since some
observations are in image format, and Program-of-Thought cannot generate executable codes. Thus,
we did not include them as the baselines.
The experiment results on the OpenAGI benchmark are shown in Table 1. Each row stands for a
type of task, each column represents the planning schema of an LLM interpreter, and every four
columns are the results of the same LLM interpreter. From the results, we can see that our CoRE
planning schema is better on average performance than any baseline under both Mixtral and GPT-4
as the interpreters. When using Mixtral as the interpreter, CoRE outperforms Zero-shot and CoT
under each type of task, and is better than Few-shot learning on Task 2 and average score, though
worse on Task 3 and slightly worse on Task 1. When using GPT-4 as the interpreter, CoT, Few-shot
has similar performance on Task 1 and Task 3, while on Task 2 and average score, CoRE is still the
best. It may be worth noting that it is unfair to compare CoRE with Few-shot learning since we do
not directly provide the output format and output example in the prompt. However, even without
using such examples, the CoRE planning strategy is still better than the Few-shot strategy on average.
We also find that even for the same CoRE program, the system may perform differently when using
different LLM as interpreters, which means that the performance of natural language programming
depends on the natural language understanding ability of the LLM interpreter.
In this study, we introduce a novel system, CoRE, for Code Representation and Execution. CoRE
is designed to bridge natural language programming, pseudo-code, and flow programming through
the development of a unified CoRE language for the construction of AI Agents. CoRE leverages
natural language as the programming interface, which lowers the programming barrier and advocates
the democracy of programming, so that even ordinary users can create their AI Agents. Our system
leverages Large Language Models (LLMs) as interpreters to process and execute natural language
instructions. Throughout execution, the interpreter dynamically retrieves necessary information,
utilizes appropriate external tools, and navigates through instructions based on previous outputs. The
experimental outcomes validate the efficacy of the CoRE system in natural language programming.
8
While CoRE demonstrates promising results, it currently relies on manually crafted programs, which
may introduce inefficiencies due to the inherent ambiguities of natural language. To address this,
future research could explore the development of automated systems for generating natural language
programming instructions. This automation would help standardize instruction clarity and precision,
potentially improving system performance. Additionally, a future direction is to expand CoRE’s
language support to facilitate international use and implement real-time debugging features to aid in
education and assist novice programmers, further broadening the system’s utility and accessibility.
References
[1] Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David,
Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al. 2022. Do as i
can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691
(2022).
[2] Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas
Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. 2024.
Graph of thoughts: Solving elaborate problems with large language models. In Proceedings of
the AAAI Conference on Artificial Intelligence, Vol. 38. 17682–17690.
[3] Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie
Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark,
et al. 2022. Improving language models by retrieving from trillions of tokens. In International
conference on machine learning. PMLR, 2206–2240.
[4] Amy Bruckman and Elizabeth Edwards. 1999. Should we leverage natural-language knowledge?
An analysis of user errors in a natural-language-style programming language. In Proceedings of
the SIGCHI conference on Human Factors in Computing Systems. 207–214.
[5] Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen. 2023. Program of Thoughts
Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks. Trans-
actions on Machine Learning Research (2023).
[6] Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. 2023. Teaching large language
models to self-debug. arXiv preprint arXiv:2304.05128 (2023).
[7] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li,
Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. 2024. Scaling instruction-finetuned
language models. Journal of Machine Learning Research 25, 70 (2024), 1–53.
[8] Ole-Johan Dahl, Edsger Wybe Dijkstra, and Charles Antony Richard Hoare. 1972. Structured
programming. Academic Press Ltd.
[9] Aditya Desai, Sumit Gulwani, Vineet Hingorani, Nidhi Jain, Amey Karkare, Mark Marron,
and Subhajit Roy. 2016. Program synthesis using natural language. In Proceedings of the 38th
International Conference on Software Engineering. 345–356.
[10] Yan Ding, Xiaohan Zhang, Chris Paxton, and Shiqi Zhang. 2023. Task and motion planning with
large language models for object rearrangement. In 2023 IEEE/RSJ International Conference
on Intelligent Robots and Systems (IROS). IEEE, 2086–2092.
[11] Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter,
Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. 2023. Palm-e: An embodied
multimodal language model. arXiv preprint arXiv:2303.03378 (2023).
[12] Michael D Ernst. 2017. Natural language is a programming language: Applying natural
language processing to software development. In 2nd Summit on Advances in Programming
Languages (SNAPL 2017). Schloss-Dagstuhl-Leibniz Zentrum für Informatik.
[13] Yingqiang Ge, Wenyue Hua, Kai Mei, Jianchao Ji, Juntao Tan, Shuyuan Xu, Zelong Li, and
Yongfeng Zhang. 2023. OpenAGI: When LLM Meets Domain Experts. In Advances in Neural
Information Processing Systems (NeurIPS) (2023).
9
[14] Yingqiang Ge, Yujie Ren, Wenyue Hua, Shuyuan Xu, Juntao Tan, and Yongfeng Zhang. 2023.
LLM as OS, Agents as Apps: Envisioning AIOS, Agents and the AIOS-Agent Ecosystem.
arXiv e-prints (2023), arXiv–2312.
[15] Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting
Hu. 2023. Reasoning with language model is planning with world model. arXiv preprint
arXiv:2305.14992 (2023).
[16] George E Heidorn. 1976. Automatic programming through natural language dialogue: A survey.
IBM Journal of research and development 20, 4 (1976), 302–313.
[17] John L Hennessy and David A Patterson. 2011. Computer architecture: a quantitative approach.
Elsevier.
[18] Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. 2021. CLIPScore:
A Reference-free Evaluation Metric for Image Captioning.
[19] Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng,
Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, et al. 2022. Inner monologue: Embodied
reasoning through planning with language models. arXiv preprint arXiv:2207.05608 (2022).
[20] Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris
Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand,
et al. 2024. Mixtral of experts. arXiv preprint arXiv:2401.04088 (2024).
[21] Ana Jojic, Zhen Wang, and Nebojsa Jojic. 2023. Gpt is becoming a turing machine: Here are
some ways to program it. arXiv preprint arXiv:2303.14310 (2023).
[22] Martin Josifoski, Lars Klein, Maxime Peyrard, Yifei Li, Saibo Geng, Julian Paul Schnitzler,
Yuxing Yao, Jiheng Wei, Debjit Paul, and Robert West. 2023. Flows: Building blocks of
reasoning and collaborating ai. arXiv preprint arXiv:2308.01285 (2023).
[23] Omar Khattab, Keshav Santhanam, Xiang Lisa Li, David Hall, Percy Liang, Christopher Potts,
and Matei Zaharia. 2022. Demonstrate-Search-Predict: Composing Retrieval and Language
Models for Knowledge-Intensive NLP. arXiv preprint arXiv:2212.14024 (2022).
[24] Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri
Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller,
Matei Zaharia, and Christopher Potts. 2023. DSPy: Compiling Declarative Language Model
Calls into Self-Improving Pipelines. arXiv preprint arXiv:2310.03714 (2023).
[25] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022.
Large language models are zero-shot reasoners. Advances in neural information processing
systems 35 (2022), 22199–22213.
[26] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman
Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-
augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information
Processing Systems 33 (2020), 9459–9474.
[27] Jiwei Li and Eduard Hovy. 2015. The NLP engine: A universal turing machine for nlp. arXiv
preprint arXiv:1503.00168 (2015).
[28] Zelong Li, Wenyue Hua, Hao Wang, He Zhu, and Yongfeng Zhang. 2024. Formal-LLM:
Integrating Formal Language and Natural Language for Controllable LLM-based Agents.
arXiv:2402.00798 (2024).
[29] Yaobo Liang, Chenfei Wu, Ting Song, Wenshan Wu, Yan Xia, Yu Liu, Yang Ou, Shuai Lu, Lei
Ji, Shaoguang Mao, et al. 2023. Taskmatrix. ai: Completing tasks by connecting foundation
models with millions of apis. arXiv preprint arXiv:2303.16434 (2023).
[30] Bo Liu, Yuqian Jiang, Xiaohan Zhang, Qiang Liu, Shiqi Zhang, Joydeep Biswas, and Peter
Stone. 2023. Llm+ p: Empowering large language models with optimal planning proficiency.
arXiv preprint arXiv:2304.11477 (2023).
10
[31] Qing Lyu, Shreya Havaldar, Adam Stein, Li Zhang, Delip Rao, Eric Wong, Marianna Apidi-
anaki, and Chris Callison-Burch. 2023. Faithful chain-of-thought reasoning. arXiv preprint
arXiv:2301.13379 (2023).
[32] Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe,
Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. 2024. Self-refine: Iterative
refinement with self-feedback. Advances in Neural Information Processing Systems 36 (2024).
[33] Kai Mei, Zelong Li, Shuyuan Xu, Ruosong Ye, Yingqiang Ge, and Yongfeng Zhang. 2024.
AIOS: LLM Agent Operating System. arXiv (2024).
[34] Rada Mihalcea, Hugo Liu, and Henry Lieberman. 2006. NLP (natural language processing)
for NLP (natural language programming). In Computational Linguistics and Intelligent Text
Processing: 7th International Conference, CICLing 2006, Mexico City, Mexico, February 19-25,
2006. Proceedings 7. Springer, 319–330.
[35] Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese,
and Caiming Xiong. 2022. Codegen: An open large language model for code with multi-turn
program synthesis. arXiv preprint arXiv:2203.13474 (2022).
[36] Josh et al OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
[37] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin,
Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language
models to follow instructions with human feedback. Advances in neural information processing
systems 35 (2022), 27730–27744.
[38] Debjit Paul, Mete Ismayilzada, Maxime Peyrard, Beatriz Borges, Antoine Bosselut, Robert
West, and Boi Faltings. 2023. Refiner: Reasoning feedback on intermediate representations.
arXiv preprint arXiv:2304.01904 (2023).
[39] Gabriel Poesia, Oleksandr Polozov, Vu Le, Ashish Tiwari, Gustavo Soares, Christopher Meek,
and Sumit Gulwani. 2022. Synchromesh: Reliable code generation from pre-trained language
models. arXiv preprint arXiv:2201.11227 (2022).
[40] Ronald E Prather. 1997. Regular expressions for program computations. The American
mathematical monthly 104, 2 (1997), 120–130.
[41] Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong,
Xiangru Tang, Bill Qian, et al. 2023. Toolllm: Facilitating large language models to master
16000+ real-world apis. arXiv preprint arXiv:2307.16789 (2023).
[42] Steven I Ross, Fernando Martinez, Stephanie Houde, Michael Muller, and Justin D Weisz.
2023. The programmer’s assistant: Conversational interaction with a large language model for
software development. In Proceedings of the 28th International Conference on Intelligent User
Interfaces. 491–514.
[43] Noah Shinn, Beck Labash, and Ashwin Gopinath. 2023. Reflexion: an autonomous agent with
dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366 (2023).
[44] Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay,
Dieter Fox, Jesse Thomason, and Animesh Garg. 2023. Progprompt: Generating situated robot
task plans using large language models. In 2023 IEEE International Conference on Robotics
and Automation (ICRA). IEEE, 11523–11530.
[45] David Vadas and James R Curran. 2005. Programming with unrestricted natural language. In
Proceedings of the Australasian Language Technology Workshop 2005. 191–199.
[46] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha
Chowdhery, and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in
language models. arXiv preprint arXiv:2203.11171 (2022).
[47] Yubo Wang, Xueguang Ma, and Wenhu Chen. 2023. Augmenting black-box llms with medical
textbooks for clinical question answering. arXiv preprint arXiv:2309.02233 (2023).
11
[48] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le,
Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models.
Advances in neural information processing systems 35 (2022), 24824–24837.
[49] Bichen Wu, Chenfeng Xu, Xiaoliang Dai, Alvin Wan, Peizhao Zhang, Zhicheng Yan, Masayoshi
Tomizuka, Joseph Gonzalez, Kurt Keutzer, and Peter Vajda. 2020. Visual Transformers: Token-
based Image Representation and Processing for Computer Vision. arXiv:2006.03677 [cs.CV]
[50] Yiran Wu, Tianwei Yue, Shaokun Zhang, Chi Wang, and Qingyun Wu. 2024. StateFlow: En-
hancing LLM Task-Solving through State-Driven Workflows. arXiv preprint arXiv:2403.11322
(2024).
[51] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik
Narasimhan. 2024. Tree of thoughts: Deliberate problem solving with large language models.
Advances in Neural Information Processing Systems 36 (2024).
[52] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan
Cao. 2022. React: Synergizing reasoning and acting in language models. arXiv preprint
arXiv:2210.03629 (2022).
[53] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan
Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. In International
Conference on Learning Representations (ICLR).
[54] Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020.
BERTScore: Evaluating Text Generation with BERT.
12