0% found this document useful (0 votes)
39 views

Code Generation With LLMs

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

Code Generation With LLMs

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Code generation with LLMs

Generative AI && software engineering:


analysis, learnings, practical insights

Josh Payne
Agenda
● Intro
● Brief history of AI for code generation
● Benchmarking code gen performance
● Applications and agents
● AI x software engineering
Intro
● 👋 I’m Josh

● Founder of Coframe (AI for UI optimization + code gen),


prev two other companies (one AI-focused)

● Created GPT-Migrate (LLM-powered codebase migration),


Coffee (LLM-powered UI code gen)

● Stanford CS (AI) alum!


Brief History

CodeNN (Iyer et al., 2016) Aroma (Luan et al, 2019) Code2Seq (Alon et al., 2019)
Code summarization Code search (early copilot) Better code summarization

(Try it! -> https://code2seq.org/ )

Pre-LLM era: Early applications: “Oh wow, AI can actually write code now”: AI x software engineering:
RNNs and search GPT-3, Codex, GitHub Copilot GPT-3.5, GPT-4, OSS LLMs Agents and integrated workflows
Brief History

Pre-LLM era: Early applications: “Oh wow, AI can actually write code now”: AI x software engineering:
RNNs and search GPT-3, Codex, GitHub Copilot GPT-3.5, GPT-4, OSS LLMs Agents and integrated workflows
Brief History

Pre-LLM era: Early applications: “Oh wow, AI can actually write code now”: AI x software engineering:
RNNs and search GPT-3, Codex, GitHub Copilot GPT-3.5, GPT-4, OSS LLMs Agents and integrated workflows
Brief History
Still in its infancy!

Pre-LLM era: Early applications: “Oh wow, AI can actually write code now”: AI x software engineering:
RNNs and search GPT-3, Codex, GitHub Copilot GPT-3.5, GPT-4, OSS LLMs Agents and integrated workflows
Benchmarking code generation
How do we measure this?

1 Benchmark Tasks 2 Competitions 3 Real-world impact

HumanEval (Chen et al., 2021) is the most widely-recognized research


benchmark for code generation.

This paper also introduced Codex,


the first major code-specific LLM.

HumanEval is 164 handwritten


programming problems, each with
several unit tests.

The prompt provided to the model is shown with a black background, and a successful model-generated
completion is shown in a blue background. To be successful, it must pass the unit tests.
How do we measure this?

1 Benchmark Tasks 2 Competitions 3 Real-world impact

There have also been extensions of HumanEval and other datasets:

- MultiPL-E is a dataset for evaluating large language models


for code generation that supports 18 programming
languages. It translates HumanEval problems into other
languages.

- HumanEval-X consists of 820 high-quality human-crafted


data samples, compared with HumanEval’s 164.

- MBPP (Mostly Basic Python Problems) is a dataset of 1000


crowd-sourced Python programming problems.
How do we measure this?

1 Benchmark Tasks 2 Competitions 3 Real-world impact

Some companies will create internal datasets on which to evaluate.

- Google introduced Gemini alongside a new benchmark,


Natural2Code, which is a held-out internal dataset.

GPT-4 (OpenAI) was slightly better on HumanEval (OpenAI), while


Gemini (Google) was slightly better on Natural2Code (Google).
TestGen-LLM
- Meta has internal unit test sets for its internal LLMs.
How do we measure this?

1 Benchmark Tasks 2 Competitions 3 Real-world impact

Why are held-out (non-published) benchmarks valuable?


How do we measure this?
1 Benchmark Tasks
2 Competitions 3 Real-world impact

AlphaCode by DeepMind (Li et al., Dec 2022) created CodeContests, a dataset of


compiled competitive programming problems.

Increasingly, datasets from


real-world tasks for humans are
needed as models approach
human-level performance.

Other examples: the LSAT, USMLE,


AlphaGeometry (IMO problems)
How do we measure this?
1 Benchmark Tasks
2 Competitions 3 Real-world impact
How do we measure this?
1 Benchmark Tasks 2 Competitions
3 Real-world impact

As models begin to surpass human


performance, they will be increasingly
measured on impact.

Example: AlphaDev (Mankowitz and Michi,


June 2023) discovered a faster sorting
algorithm for small lists that has now been
implemented in the C++ standard lib.

SWE KPIs (bug rate, PRs merged, etc) are


starting to become more commonplace.
Benchmarking code generation
Benchmark: HumanEval

Techniques

Fine-tuning /
Instruct-tuning

Base models
Benchmarking code generation
Techniques
Benchmark: HumanEval
Fine-tuning /
Instruct-tuning

Base Models are the GPTs and Llamas of the world: Base models

not fine-tuned for a particular task.

Open LLMs Closed LLMs


Weights are open, easy to do custom Weights are closed, tuning and
tuning and experimentation experimentation are limited

● CodeLlama (WizardCoder) ● GPT-4


● StarCoder ● Gemini Ultra
● Replit-code-v1-3b ● Claude 2.1
● Mixtral-8x7b ● Grok
Benchmarking code generation
Techniques
Benchmark: HumanEval
Fine-tuning /
Instruct-tuning

Base Models are the GPTs and Llamas of the world: Base models

not fine-tuned for a particular task.

Gemma-7B
Benchmarking code generation
Techniques
Benchmark: HumanEval
Fine-tuning /
Instruct-tuning

Instruct-tuned models are models that are Base models

fine-tuned with instructions: in this case, for code.


Benchmarking code generation
Benchmark: HumanEval

Instruct-tuned models are models that are


fine-tuned with instructions: in this case, for
code.

Instruct-tuning involves a prompt which


contains an instruction, and a response.
Including the instruction is important for the
model to know how to understand new
instructions at inference time.

Example: Synthesis
Benchmarking code generation
Benchmark: HumanEval

Instruct-tuned models are models that are


fine-tuned with instructions: in this case, for
code.

Instruct-tuning involves a prompt which


contains an instruction, and a response.
Including the instruction is important for the
model to know how to understand new
instructions at inference time.

Example: Fix a bug


Benchmarking code generation
Benchmark: HumanEval

Instruct-tuned models are models that are


fine-tuned with instructions: in this case, for
code.

Instruct-tuning involves a prompt which


contains an instruction, and a response.
Including the instruction is important for the
model to know how to understand new
instructions at inference time.

Example: Explain code


Benchmarking code generation
Techniques
Benchmark: HumanEval
Fine-tuning /
Instruct-tuning

Instruct-tuning is clearly useful. How can we Base models

scale it up?
As LLMs and datasets get larger, we
increasingly need to think creatively about
how to gather data in order to improve.

One example of this is COMMITPACK: 4 terabytes of Git commits across 350


programming languages (Muennighoff et al, Jan 2024; ICLR preprint).

Git commits naturally pair code changes with human instructions.


Benchmarking code generation
Techniques
Benchmark: HumanEval
Fine-tuning /
Instruct-tuning
Technique can make all the difference. This is broadly broken Base models
down into reasoning methods and decision-making methods.

GPT-3.5 (175B parameters)


83.8 Technique: LATS

79.3
GPT-4 (1.7T parameters (est)) 💡 GPT-3.5 beats GPT-4
Technique: None with LATS, despite
being 10x smaller!
Benchmarking code generation
Benchmark: HumanEval
Input
Chain-of-Thought
Reasoning Method

S1
Chain-of-Thought (CoT) prompts LLMs to
sequentially generate reasoning steps from
input to output. It was first introduced in PaLM:
S2
Scaling Language Modeling with Pathways.
(Chowdhery, Catasta et al., 2022)
S3
However, it suffers from error propagation as
the chain length increases.

Output
Benchmarking code generation
Benchmark: HumanEval
Input
Tree-of-Thoughts
Reasoning Method
S1 S1 S1
Tree-of-Thoughts (ToT) extends CoT by
exploring multiple reasoning paths using
search algorithms like BFS and DFS.
(Yao et al., May 2023) S2 S2 S2 S2

That said, it is limited by relying solely


on the LLM's internal knowledge. S3 S3 S3 S3 S3

Output
Benchmarking code generation
Benchmark: HumanEval
Input
Reasoning via Planning
Reasoning Method
S1 S1
Reasoning via Planning (RAP)
(Hao et al., October 2023) uses
Monte Carlo Tree Search for
S2 S2 S2 S2
planning chains of reasoning.

However, it also lacks external


feedback. S3 S3 S3 S3

Output Output
Benchmarking code generation
Benchmark: HumanEval
Input
Observation
ReAct
Decision-making method

S1
ReAct prompts LLMs with alternating
actions and observations for
decision-making in interactive
S2
environments. (Yao et al., March 2023)

However, it greedily follows one S3


trajectory and cannot adapt.

Output
Benchmarking code generation
Benchmark: HumanEval
Input
Observation
Reflexion
Decision-making method Reflection

S1
Reflexion adds self-reflection to ReAct.
This improves overall performance by
allowing the LLM more time to think
S2
through the problem, similar to CoT.
(Shinn et al., October 2023)
S3
However, it does not consider alternative
options at each step.

Output
Benchmarking code generation
Benchmark: HumanEval
Reflection Input
Language Agent Tree Search
Decision-making method Reasoning Method Observation
S1 S1
LATS unifies the strengths of both
reasoning and decision-making methods
through principled search, while
S2 S2 S2 S2
overcoming limitations via environmental
feedback and self-reflection. (Zhou et al.,
December 2023)
S3 S3 S3 S3
GPT-4 + LATS is the current best
performer on the HumanEval
benchmark, with a score of 94.4. Output Output
Applications and agents
Applications and agents
AI has tackled every aspect of software engineering.
(Category list below not exhaustive.)

Coffee by
Cody by
Project Issues &
creation Migrations IDE tests Docs
Applications and agents
AI has tackled every aspect of software engineering.
(Category list below not exhaustive.)

Coffee by
Cody by
Project Issues &
creation Migrations IDE tests Docs
Applications and agents
AI has tackled every aspect of software engineering.
(Category list below not exhaustive.)

Coffee by
Cody by
Project Issues &
creation Migrations IDE tests Docs
Applications and agents
AI has tackled every aspect of software engineering.
(Category list below not exhaustive.)

Coffee by
Cody by
Project Issues &
creation Migrations IDE tests Docs
Applications and agents
AI has tackled every aspect of software engineering.
(Category list below not exhaustive.)

Coffee by
Cody by
Project Issues &
creation Migrations IDE tests Docs
Applications and agents
Deep dive: GPT-Migrate
Applications and agents
Deep dive: Coffee
AI x Software Engineering

● Using code generation wisely

● Prompt engineering for code gen

● AI-driven development
AI x Software Engineering
Using code generation wisely

Credit to Joshua Morony


AI x Software Engineering
Using code generation wisely

Credit to Joshua Morony


AI x Software Engineering
Using code generation wisely

Credit to Joshua Morony


AI x Software Engineering
Using code generation wisely

Credit to Joshua Morony


AI x Software Engineering
Using code generation wisely

Credit to Joshua Morony


AI x Software Engineering
Using code generation wisely

Credit to Joshua Morony


AI x Software Engineering
Using code generation wisely
Why?

● Learning is important
● Understanding your code is important
● Maintainability and knowledge transfer is important
○ Fully LLM-written projects tend to produce
“spaghetti code”. I know first-hand!
AI x Software Engineering
Prompt engineering for code gen

Prompt engineering is likely more important to


code generation than it is to any other area due
to the precision required. Luckily, engineers are
naturally good prompt engineers.

Principled Instructions are All You Need gives 26


general prompting guidelines (see chart).

Worth adding: only add the minimum viable


context; context windows aren’t all made equal. Principled Instructions are All You Need
(Bsharat et al., December 2023)
AI x Software Engineering
Prompt engineering for code gen
AlphaCodium formalized “Flow Engineering” for software
engineering workflows, which many practitioners had been using
already. Using GPT-4 on the CodeContests validation set, the
pass@5 accuracy improved from 19% with a well-crafted single
prompt to 44% with AlphaCodium.

AlphaCodium
(Ridnik et al., January 2024)
AI x Software Engineering
Prompt engineering for code gen
Prompt composition can become complex when you’re
dealing with code-writing agents performing multiple
types of software engineering tasks.

One solution is organizing them into a hierarchy and


creating a constructor that can compose these prompts
together, along with any variables you need to pass in
from your code.

The simplest way to do this is using text files in labeled


directories in your /prompts/ directory. I’m sure there will
be headless prompt CMS’s at some point. Prompt hierarchy in GPT-Migrate
AI x Software Engineering
Prompt engineering for code gen
Sudolang is a natural language constraint-based
programming pseudolanguage, with an LLM as the
interpreter. What?

More simply, it combines natural language elements and


simple coding conventions for better prompting.

SudoLang prompts can often be written with 20% - 30%


fewer tokens than natural language.

The expressiveness and precision helps when writing


code, as well as when “programming” the LLM to serve
as an application itself.
AI x Software Engineering
AI-driven development: practical pointers
AI x Software Engineering
AI-driven development: practical pointers

Language preference

LLMs do better with more


popular languages. They
also benefit from the
clarity of typed
languages.
AI x Software Engineering
AI-driven development: practical pointers

Language preference Project structure


LLMs do better with more Try to keep files and
popular languages. They modular. Use headers and
also benefit from the TDDs to help the LLM
clarity of typed navigate and generate
languages. files.
AI x Software Engineering
AI-driven development: practical pointers

Language preference Project structure Interface-oriented programming

LLMs do better with more Try to keep files and LLMs need context.
popular languages. They modular. Use headers and Interfaces (input,
also benefit from the TDDs to help the LLM output, transformation,
clarity of typed navigate and generate types) give this. Use
languages. files. IOP in prompts.
AI x Software Engineering
AI-driven development: practical pointers

Language preference Project structure Interface-oriented programming

LLMs do better with more Try to keep files and LLMs need context.
popular languages. They modular. Use headers and Interfaces (input,
also benefit from the TDDs to help the LLM output, transformation,
clarity of typed navigate and generate types) give this. Use
languages. files. IOP in prompts.

Logs-in-the-loop
When debugging (or in a
background loop), LLMs
can digest logs and
error traces. Very
helpful!
AI x Software Engineering
AI-driven development: practical pointers

Language preference Project structure Interface-oriented programming

LLMs do better with more Try to keep files and LLMs need context.
popular languages. They modular. Use headers and Interfaces (input,
also benefit from the TDDs to help the LLM output, transformation,
clarity of typed navigate and generate types) give this. Use
languages. files. IOP in prompts.

Logs-in-the-loop Tests, tests, tests


When debugging (or in a When generating entire
background loop), LLMs functions and files, test
can digest logs and coverage is CRUCIAL.
error traces. Very (LLMs can write these
helpful! too!)
AI x Software Engineering
AI-driven development: practical pointers

Language preference Project structure Interface-oriented programming

LLMs do better with more Try to keep files and LLMs need context.
popular languages. They modular. Use headers and Interfaces (input,
also benefit from the TDDs to help the LLM output, transformation,
clarity of typed navigate and generate types) give this. Use
languages. files. IOP in prompts.

Logs-in-the-loop Tests, tests, tests Output structure


When debugging (or in a When generating entire YAML uses as little as
background loop), LLMs functions and files, test 50% of the tokens that
can digest logs and coverage is CRUCIAL. JSON output does. Even
error traces. Very (LLMs can write these with JSON mode, YAML
helpful! too!) wins.
Acknowledgements

● Michele Catasta
● Pavlo Razumovskyi
● Glavin Wiechert
● Tinah Hong
● Alex Korshuk
● John Whaley

Thank you!
Questions

josh@coframe.ai

You might also like