Code Generation With LLMs
Code Generation With LLMs
Josh Payne
Agenda
● Intro
● Brief history of AI for code generation
● Benchmarking code gen performance
● Applications and agents
● AI x software engineering
Intro
● 👋 I’m Josh
CodeNN (Iyer et al., 2016) Aroma (Luan et al, 2019) Code2Seq (Alon et al., 2019)
Code summarization Code search (early copilot) Better code summarization
Pre-LLM era: Early applications: “Oh wow, AI can actually write code now”: AI x software engineering:
RNNs and search GPT-3, Codex, GitHub Copilot GPT-3.5, GPT-4, OSS LLMs Agents and integrated workflows
Brief History
Pre-LLM era: Early applications: “Oh wow, AI can actually write code now”: AI x software engineering:
RNNs and search GPT-3, Codex, GitHub Copilot GPT-3.5, GPT-4, OSS LLMs Agents and integrated workflows
Brief History
Pre-LLM era: Early applications: “Oh wow, AI can actually write code now”: AI x software engineering:
RNNs and search GPT-3, Codex, GitHub Copilot GPT-3.5, GPT-4, OSS LLMs Agents and integrated workflows
Brief History
Still in its infancy!
Pre-LLM era: Early applications: “Oh wow, AI can actually write code now”: AI x software engineering:
RNNs and search GPT-3, Codex, GitHub Copilot GPT-3.5, GPT-4, OSS LLMs Agents and integrated workflows
Benchmarking code generation
How do we measure this?
The prompt provided to the model is shown with a black background, and a successful model-generated
completion is shown in a blue background. To be successful, it must pass the unit tests.
How do we measure this?
Techniques
Fine-tuning /
Instruct-tuning
Base models
Benchmarking code generation
Techniques
Benchmark: HumanEval
Fine-tuning /
Instruct-tuning
Base Models are the GPTs and Llamas of the world: Base models
Base Models are the GPTs and Llamas of the world: Base models
Gemma-7B
Benchmarking code generation
Techniques
Benchmark: HumanEval
Fine-tuning /
Instruct-tuning
Example: Synthesis
Benchmarking code generation
Benchmark: HumanEval
scale it up?
As LLMs and datasets get larger, we
increasingly need to think creatively about
how to gather data in order to improve.
79.3
GPT-4 (1.7T parameters (est)) 💡 GPT-3.5 beats GPT-4
Technique: None with LATS, despite
being 10x smaller!
Benchmarking code generation
Benchmark: HumanEval
Input
Chain-of-Thought
Reasoning Method
S1
Chain-of-Thought (CoT) prompts LLMs to
sequentially generate reasoning steps from
input to output. It was first introduced in PaLM:
S2
Scaling Language Modeling with Pathways.
(Chowdhery, Catasta et al., 2022)
S3
However, it suffers from error propagation as
the chain length increases.
Output
Benchmarking code generation
Benchmark: HumanEval
Input
Tree-of-Thoughts
Reasoning Method
S1 S1 S1
Tree-of-Thoughts (ToT) extends CoT by
exploring multiple reasoning paths using
search algorithms like BFS and DFS.
(Yao et al., May 2023) S2 S2 S2 S2
Output
Benchmarking code generation
Benchmark: HumanEval
Input
Reasoning via Planning
Reasoning Method
S1 S1
Reasoning via Planning (RAP)
(Hao et al., October 2023) uses
Monte Carlo Tree Search for
S2 S2 S2 S2
planning chains of reasoning.
Output Output
Benchmarking code generation
Benchmark: HumanEval
Input
Observation
ReAct
Decision-making method
S1
ReAct prompts LLMs with alternating
actions and observations for
decision-making in interactive
S2
environments. (Yao et al., March 2023)
Output
Benchmarking code generation
Benchmark: HumanEval
Input
Observation
Reflexion
Decision-making method Reflection
S1
Reflexion adds self-reflection to ReAct.
This improves overall performance by
allowing the LLM more time to think
S2
through the problem, similar to CoT.
(Shinn et al., October 2023)
S3
However, it does not consider alternative
options at each step.
Output
Benchmarking code generation
Benchmark: HumanEval
Reflection Input
Language Agent Tree Search
Decision-making method Reasoning Method Observation
S1 S1
LATS unifies the strengths of both
reasoning and decision-making methods
through principled search, while
S2 S2 S2 S2
overcoming limitations via environmental
feedback and self-reflection. (Zhou et al.,
December 2023)
S3 S3 S3 S3
GPT-4 + LATS is the current best
performer on the HumanEval
benchmark, with a score of 94.4. Output Output
Applications and agents
Applications and agents
AI has tackled every aspect of software engineering.
(Category list below not exhaustive.)
Coffee by
Cody by
Project Issues &
creation Migrations IDE tests Docs
Applications and agents
AI has tackled every aspect of software engineering.
(Category list below not exhaustive.)
Coffee by
Cody by
Project Issues &
creation Migrations IDE tests Docs
Applications and agents
AI has tackled every aspect of software engineering.
(Category list below not exhaustive.)
Coffee by
Cody by
Project Issues &
creation Migrations IDE tests Docs
Applications and agents
AI has tackled every aspect of software engineering.
(Category list below not exhaustive.)
Coffee by
Cody by
Project Issues &
creation Migrations IDE tests Docs
Applications and agents
AI has tackled every aspect of software engineering.
(Category list below not exhaustive.)
Coffee by
Cody by
Project Issues &
creation Migrations IDE tests Docs
Applications and agents
Deep dive: GPT-Migrate
Applications and agents
Deep dive: Coffee
AI x Software Engineering
● AI-driven development
AI x Software Engineering
Using code generation wisely
● Learning is important
● Understanding your code is important
● Maintainability and knowledge transfer is important
○ Fully LLM-written projects tend to produce
“spaghetti code”. I know first-hand!
AI x Software Engineering
Prompt engineering for code gen
AlphaCodium
(Ridnik et al., January 2024)
AI x Software Engineering
Prompt engineering for code gen
Prompt composition can become complex when you’re
dealing with code-writing agents performing multiple
types of software engineering tasks.
Language preference
LLMs do better with more Try to keep files and LLMs need context.
popular languages. They modular. Use headers and Interfaces (input,
also benefit from the TDDs to help the LLM output, transformation,
clarity of typed navigate and generate types) give this. Use
languages. files. IOP in prompts.
AI x Software Engineering
AI-driven development: practical pointers
LLMs do better with more Try to keep files and LLMs need context.
popular languages. They modular. Use headers and Interfaces (input,
also benefit from the TDDs to help the LLM output, transformation,
clarity of typed navigate and generate types) give this. Use
languages. files. IOP in prompts.
Logs-in-the-loop
When debugging (or in a
background loop), LLMs
can digest logs and
error traces. Very
helpful!
AI x Software Engineering
AI-driven development: practical pointers
LLMs do better with more Try to keep files and LLMs need context.
popular languages. They modular. Use headers and Interfaces (input,
also benefit from the TDDs to help the LLM output, transformation,
clarity of typed navigate and generate types) give this. Use
languages. files. IOP in prompts.
LLMs do better with more Try to keep files and LLMs need context.
popular languages. They modular. Use headers and Interfaces (input,
also benefit from the TDDs to help the LLM output, transformation,
clarity of typed navigate and generate types) give this. Use
languages. files. IOP in prompts.
● Michele Catasta
● Pavlo Razumovskyi
● Glavin Wiechert
● Tinah Hong
● Alex Korshuk
● John Whaley
Thank you!
Questions
josh@coframe.ai