0% found this document useful (0 votes)

39 views

Code Generation With LLMs

Uploaded by

algorithmicaindia1

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views

Code Generation With LLMs

Uploaded by

algorithmicaindia1

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 59

Code generation with LLMs

Generative AI && software engineering:

analysis, learnings, practical insights

Josh Payne
Agenda
● Intro
● Brief history of AI for code generation
● Benchmarking code gen performance
● Applications and agents
● AI x software engineering
Intro
● 👋 I’m Josh

● Founder of Coframe (AI for UI optimization + code gen),

prev two other companies (one AI-focused)

● Created GPT-Migrate (LLM-powered codebase migration),

Coffee (LLM-powered UI code gen)

● Stanford CS (AI) alum!

Brief History

CodeNN (Iyer et al., 2016) Aroma (Luan et al, 2019) Code2Seq (Alon et al., 2019)
Code summarization Code search (early copilot) Better code summarization

(Try it! -> https://code2seq.org/ )

Pre-LLM era: Early applications: “Oh wow, AI can actually write code now”: AI x software engineering:
RNNs and search GPT-3, Codex, GitHub Copilot GPT-3.5, GPT-4, OSS LLMs Agents and integrated workflows
Brief History

1 Benchmark Tasks 2 Competitions 3 Real-world impact

HumanEval (Chen et al., 2021) is the most widely-recognized research

benchmark for code generation.

This paper also introduced Codex,

the first major code-specific LLM.

HumanEval is 164 handwritten

programming problems, each with
several unit tests.

The prompt provided to the model is shown with a black background, and a successful model-generated
completion is shown in a blue background. To be successful, it must pass the unit tests.
How do we measure this?

1 Benchmark Tasks 2 Competitions 3 Real-world impact

There have also been extensions of HumanEval and other datasets:

- MultiPL-E is a dataset for evaluating large language models

for code generation that supports 18 programming
languages. It translates HumanEval problems into other
languages.

- HumanEval-X consists of 820 high-quality human-crafted

data samples, compared with HumanEval’s 164.

- MBPP (Mostly Basic Python Problems) is a dataset of 1000

crowd-sourced Python programming problems.
How do we measure this?

1 Benchmark Tasks 2 Competitions 3 Real-world impact

Some companies will create internal datasets on which to evaluate.

- Google introduced Gemini alongside a new benchmark,

Natural2Code, which is a held-out internal dataset.

GPT-4 (OpenAI) was slightly better on HumanEval (OpenAI), while

Gemini (Google) was slightly better on Natural2Code (Google).
TestGen-LLM
- Meta has internal unit test sets for its internal LLMs.
How do we measure this?

1 Benchmark Tasks 2 Competitions 3 Real-world impact

Why are held-out (non-published) benchmarks valuable?

How do we measure this?
1 Benchmark Tasks
2 Competitions 3 Real-world impact

AlphaCode by DeepMind (Li et al., Dec 2022) created CodeContests, a dataset of

compiled competitive programming problems.

Increasingly, datasets from

real-world tasks for humans are
needed as models approach
human-level performance.

Other examples: the LSAT, USMLE,

AlphaGeometry (IMO problems)
How do we measure this?
1 Benchmark Tasks
2 Competitions 3 Real-world impact
How do we measure this?
1 Benchmark Tasks 2 Competitions
3 Real-world impact

As models begin to surpass human

performance, they will be increasingly
measured on impact.

Example: AlphaDev (Mankowitz and Michi,

June 2023) discovered a faster sorting
algorithm for small lists that has now been
implemented in the C++ standard lib.

SWE KPIs (bug rate, PRs merged, etc) are

starting to become more commonplace.
Benchmarking code generation
Benchmark: HumanEval

Techniques

Fine-tuning /
Instruct-tuning

Base models
Benchmarking code generation
Techniques
Benchmark: HumanEval
Fine-tuning /
Instruct-tuning

Base Models are the GPTs and Llamas of the world: Base models

not fine-tuned for a particular task.

Open LLMs Closed LLMs

Weights are open, easy to do custom Weights are closed, tuning and
tuning and experimentation experimentation are limited

● CodeLlama (WizardCoder) ● GPT-4

● StarCoder ● Gemini Ultra
● Replit-code-v1-3b ● Claude 2.1
● Mixtral-8x7b ● Grok
Benchmarking code generation
Techniques
Benchmark: HumanEval
Fine-tuning /
Instruct-tuning

Base Models are the GPTs and Llamas of the world: Base models

not fine-tuned for a particular task.

Gemma-7B
Benchmarking code generation
Techniques
Benchmark: HumanEval
Fine-tuning /
Instruct-tuning

Instruct-tuned models are models that are Base models

fine-tuned with instructions: in this case, for code.

Benchmarking code generation
Benchmark: HumanEval

Instruct-tuned models are models that are

fine-tuned with instructions: in this case, for
code.

Instruct-tuning involves a prompt which

contains an instruction, and a response.
Including the instruction is important for the
model to know how to understand new
instructions at inference time.

Example: Synthesis
Benchmarking code generation
Benchmark: HumanEval

Instruct-tuned models are models that are

fine-tuned with instructions: in this case, for
code.

Instruct-tuning involves a prompt which

contains an instruction, and a response.
Including the instruction is important for the
model to know how to understand new
instructions at inference time.

Example: Fix a bug

Benchmarking code generation
Benchmark: HumanEval

Instruct-tuned models are models that are

fine-tuned with instructions: in this case, for
code.

Instruct-tuning involves a prompt which

contains an instruction, and a response.
Including the instruction is important for the
model to know how to understand new
instructions at inference time.

Example: Explain code

Benchmarking code generation
Techniques
Benchmark: HumanEval
Fine-tuning /
Instruct-tuning

Instruct-tuning is clearly useful. How can we Base models

scale it up?
As LLMs and datasets get larger, we
increasingly need to think creatively about
how to gather data in order to improve.

One example of this is COMMITPACK: 4 terabytes of Git commits across 350

programming languages (Muennighoff et al, Jan 2024; ICLR preprint).

Git commits naturally pair code changes with human instructions.

Benchmarking code generation
Techniques
Benchmark: HumanEval
Fine-tuning /
Instruct-tuning
Technique can make all the difference. This is broadly broken Base models
down into reasoning methods and decision-making methods.

GPT-3.5 (175B parameters)

83.8 Technique: LATS

79.3
GPT-4 (1.7T parameters (est)) 💡 GPT-3.5 beats GPT-4
Technique: None with LATS, despite
being 10x smaller!
Benchmarking code generation
Benchmark: HumanEval
Input
Chain-of-Thought
Reasoning Method

S1
Chain-of-Thought (CoT) prompts LLMs to
sequentially generate reasoning steps from
input to output. It was first introduced in PaLM:
S2
Scaling Language Modeling with Pathways.
(Chowdhery, Catasta et al., 2022)
S3
However, it suffers from error propagation as
the chain length increases.

Output
Benchmarking code generation
Benchmark: HumanEval
Input
Tree-of-Thoughts
Reasoning Method
S1 S1 S1
Tree-of-Thoughts (ToT) extends CoT by
exploring multiple reasoning paths using
search algorithms like BFS and DFS.
(Yao et al., May 2023) S2 S2 S2 S2

That said, it is limited by relying solely

on the LLM's internal knowledge. S3 S3 S3 S3 S3

Output
Benchmarking code generation
Benchmark: HumanEval
Input
Reasoning via Planning
Reasoning Method
S1 S1
Reasoning via Planning (RAP)
(Hao et al., October 2023) uses
Monte Carlo Tree Search for
S2 S2 S2 S2
planning chains of reasoning.

However, it also lacks external

feedback. S3 S3 S3 S3

Output Output
Benchmarking code generation
Benchmark: HumanEval
Input
Observation
ReAct
Decision-making method

S1
ReAct prompts LLMs with alternating
actions and observations for
decision-making in interactive
S2
environments. (Yao et al., March 2023)

However, it greedily follows one S3

trajectory and cannot adapt.

Output
Benchmarking code generation
Benchmark: HumanEval
Input
Observation
Reflexion
Decision-making method Reflection

S1
Reflexion adds self-reflection to ReAct.
This improves overall performance by
allowing the LLM more time to think
S2
through the problem, similar to CoT.
(Shinn et al., October 2023)
S3
However, it does not consider alternative
options at each step.

Output
Benchmarking code generation
Benchmark: HumanEval
Reflection Input
Language Agent Tree Search
Decision-making method Reasoning Method Observation
S1 S1
LATS unifies the strengths of both
reasoning and decision-making methods
through principled search, while
S2 S2 S2 S2
overcoming limitations via environmental
feedback and self-reflection. (Zhou et al.,
December 2023)
S3 S3 S3 S3
GPT-4 + LATS is the current best
performer on the HumanEval
benchmark, with a score of 94.4. Output Output
Applications and agents
Applications and agents
AI has tackled every aspect of software engineering.
(Category list below not exhaustive.)

Coffee by
Cody by
Project Issues &
creation Migrations IDE tests Docs
Applications and agents
AI has tackled every aspect of software engineering.
(Category list below not exhaustive.)

Coffee by
Cody by
Project Issues &
creation Migrations IDE tests Docs
Applications and agents
Deep dive: GPT-Migrate
Applications and agents
Deep dive: Coffee
AI x Software Engineering