Approximately Aligned Decoding
Approximately Aligned Decoding
A BSTRACT
It is common to reject undesired outputs of Large Language Models (LLMs);
however, current methods to do so require an excessive amount of computation,
or severely distort the distribution of outputs. We present a method to balance the
distortion of the output distribution with computational efficiency, allowing for the
generation of long sequences of text with difficult-to-satisfy constraints, with less
amplification of low probability outputs compared to existing methods. We show
through a series of experiments that the task-specific performance of our method
is comparable to methods that do not distort the output distribution, while being
much more computationally efficient.
1 I NTRODUCTION
Language models sometimes generate undesirable outputs, such as syntactically-incorrect code, hal-
lucinated PII, or profanity. These conditions, which we collectively refer to as errors for the remain-
der of the paper, can be detected with incremental parsers, regular expression matching, or even
simple substring searches. However, once detection occurs, there are several competing methods for
mitigating errors in the output.
One set of methods, constrained generation (Beurer-Kellner et al., 2024; Geng et al., 2024; Melcer
et al., 2024), avoids errors by disabling the generation of any token that immediately leads to such
an error. While this method is effective, it can lead to the amplification of low-probability outputs.
Another class of methods avoids errors without any amplification of low-probability outputs, at the
cost of additional computation. Rejection sampling is the simplest such method; i.e. if the output
contains an error, simply generate another sample until the output is acceptable. Adaptive Sam-
pling with Approximate Expected Futures (ASAp) (Park et al., 2024) provides a performance im-
provement over rejection sampling while maintaining the output distribution by effectively sampling
without replacement, but there are still many situations in which it may converge too slowly.
∗
Work performed at Amazon
1
We present a method that allows for a useful midpoint in the tradeoff between computational effi-
ciency and maintenance of the output distribution. In our experiments, we show that our method
obtains task-specific performance on par with ASAp, while converging significantly faster when the
constraints are difficult to satisfy.
2 P RELIMINARIES
We first describe autoregressive language models and their properties. We then discuss speculative
decoding, a method closely related to the algorithm that we will introduce.
Language models with many parameters (“Large Language Models”, or LLMs) are useful, but can
require considerable computational resources to evaluate. Moreover, the autoregressive sampling
process is inherently sequential, meaning that additional parallel computation resources cannot be
fully utilized to decrease generation latency, especially for longer sequences.
Speculative decoding (Leviathan et al., 2023; Miao et al., 2024; Cai et al., 2024; Li et al., 2024) is
one popular approach to decrease latency. This method assumes the existence of a small speculative
model (SSM) S that approximates the output of the LLM, using fewer computational resources.
Given input tokens x1...n , the SSM is autoregressively sampled for m tokens, resulting in tokens
xn+1...m . Then, the LLM P is used to compute P (xi+1 |x1...i ) for i ∈ [n . . . m]; this computation is
parallelizable. Finally, Algorithm 2 is used to select a prefix x1...k for k ∈ [n, m] of tokens to accept;
all later tokens are discarded. Additionally, because the probabilities P (·|x1...k ) have already been
computed, Algorithm 2 samples a new token xk+1 . This process maintains the property that the
distribution of sequences produced by this process matches the sequence distribution of P .
2
Unconstrained Ideal Constrained
̂
ℬ ℬ
⋅ | x1…n)
P( ⋅ | x1…n) P ( ⋅ | x1…n) P(
0.5 A 0.25 0 A 0 0 A 0
A 0.5 A 1 A 1
0.5 B 0.25 0.33 B 0.33 0.5 B 0.5
x1…n x1…n x1…n
Figure 1: Sampling in an example domain where B = {AA}. (Left) The language model assigns
equal probability to all sequences. (Center) With ‘AA’ as an error, its probability mass should be
equally redistributed to all other sequences. (Right) With constrained generation, the entire proba-
bility mass of ‘AA’ is shifted onto ‘AB’, significantly overrepresenting its probability.
Problem 1. Given an autoregressive language model P over alphabet V, and error set B ⊂ V ∗ ,
provide a method to sample from P̂ B .
Constrained generation attempts to solve the error-free generation problem by using a greedy algo-
rithm: during token selection, the algorithm always avoids selecting any tokens that immediately
lead to an error. Note that this algorithm assumes that if string x1...n ∈
/ B, then there exists at least
one available token xn+1 ∈ V such that x1...n+1 ∈ / B; however, this assumption may be weakened
if backtracking is allowed, in cases where every token leads to an immediate error.
The constrained generation algorithm has the effect of sampling from the following probability
distribution for each token:
x1...i ∈ B 0
CPB (xi |x1...i−1 ) = N ORMALIZE (2)
x1...i ∈
/B P (xi |x1...i−1 )
3
As in Section 2.1, a repeated sampling process results in a derived distribution CPB (x1...n ) =
B B B
Q
i∈[1...n] CP (xi |x1...i−1 ). It is often the case that for sequence x1...n , CP (x1...n ) ≫ P̂ (x1...n );
i.e. low-probability samples are amplified by the constrained generation process.
The fundamental issue is that the constrained generation algorithm commits to a given prefix, even
if the most probable sequences beginning with that prefix are errors. Figure 1 provides a simple
example of this occurrence. Note that this distortion is even worse in low-entropy scenarios; if
P (B|x1...n , A) were lowered to 0.0001, it would still be the case that CPB (AB|x1...n ) = 0.5. This
amplification effect compounds exponentially for longer sequences.
Algorithm 3 ASAp
procedure ASA P(P, B, x1...n ) ▷ x1...n is prompt
P̂ B ← P
while Limit not reached do
Sample sequence xn+1...m ∼ P̂ B (·|x1...n ) until error or stopping condition
if x1...m ∈
/ B then break
P̂ B ← A DD BAD S AMPLE(P̂ B , x1...m ) ▷ Remove x1...m as a possible sequence
return X
procedure A DD BAD S AMPLE(P̂ B , x1...m )) ▷ In practice, only adjust xn+1...m
P̂ B∪{x} ← P̂ B
for xi ∈ (xm , . . . , x1 ) do ▷ Note that token sequence is reversed
▷ Remove probability of x1...m , without changing probability of any other sequence ◁
B∪{x} B B
P̂ (xi |x1...i−1 ) ← P̂ (xi |x1...i−1 ) − P̂ (xi...m |x1...i−1 )
Renormalize P̂ B∪{x} (·|x1...i−1 )
return P̂ B∪{x}
Adaptive Sampling with Approximate Expected Futures (ASAp) (Park et al., 2024) is a technique to
sample exactly from the distribution of P̂ B . ASAp begins similarly to rejection sampling, but it iter-
atively builds set B ⊆ B containing all encountered samples that have been rejected so far. Because
B is finite, the conditional probabilities P̂ B (xi |x1...i−1 ) can be efficiently calculated, allowing for
the algorithm to sample from P̂ B exactly. If the sampled sequence is a member of B, it is added to
B, and the sampling process repeats.
In the limit of repeated samples, B will approach B, and therefore, P̂ B will approach P̂ B . Impor-
tantly, if x ∼ P̂ B is sampled such that x ∈
/ B, this sample may be accepted, even though B ̸= B.
This procedure is equivalent to sampling without replacement, adapted to autoregressive genera-
tion. While ASAp succeeds in cases where there are only a small number of errors that comprise
the majority of the probability mass, its generation speed suffers when there are a large number of
errors—each error must be discovered before it is added to B. In dense probability sets, its perfor-
mance characteristics are similar to rejection sampling, as there are an exponential number of error
sequences that must be discovered as generation length increases.
We note two additional methods that, although they use very different formalizations and implemen-
B
P sample from P̂ . In both cases,
tations from each other, rely on a similar core idea to approximately
for any given prefix x1...n , these methods create an estimator of xn+1...m ∈Σ∗ P (xn+1...m |x1...n ) ×
1x1...m ∈B ; i.e. the likelihood of an error in all sequences that begin a specific prefix, weighted by
the probability of generating each sequence. This is used to estimate P̂ B .
Lew et al. (2023) create this estimate using Monte Carlo sampling. They apply several optimiza-
tions such as sampling without replacement, and aggressive pruning of low-probability branches. In
contrast, Zhang et al. (2024), building off the work of Zhang et al. (2023), first distill a LLM into a
hidden Markov model (HMM) with a tractable number of states (thousands or tens of thousands). If
4
the constraint can be expressed as a DFA over tokens—not always possible for certain constraints—
their method takes the product system of the DFA and HMM, and then calculates the probability of
an error in this product system.
Both of these methods, while exhibiting impressive results on many tasks, also face issues in certain
dense error sets—the approximation of the posterior tends to become inaccurate when arbitrary
generations almost immediately lead to an error.
4 M ETHOD
4.1 P REVIOUS ITERATIONS OF ASA P ARE (A LMOST ) S MALL S PECULATIVE M ODELS
For some iteration of ASAp, with B as the set of observed errors so far, let x = (x1 , . . . , xn ) be
a trace drawn from P̂ B , where it is discovered that x ∈ B. We observe that P̂ B and P̂ B∪{x} are
almost always near-identical probability distributions, with P̂ B∪{x} generally as a “more accurate”
distribution because it incorporates an additional error sample.
Our method reduces computation by using the sample x ∼ P̂ B to approximate a sample x′ ∼
P̂ B∪{x} , in a similar manner to how speculative decoding uses a sample from a SSM to approximate
a sample from a LLM. By evaluating S PEC S AMPLE(x, P̂ B , P̂ B∪{x} ), our method obtains a prefix
of x that can be used as a starting point for sampling again. Because the distributions of P̂ B and
P̂ B∪{x} are so close to each other, this prefix is usually most of the length of x. In contrast, ASAp
would involve backtracking to the beginning of the generation. This process is given as Algorithm
4; we refer to it as Approximately Aligned Decoding, or AprAD.
However, AprAD does not perfectly maintain the output distribution: Algorithm 4 amplifies some
sequence probabilities because it only invokes S PEC S AMPLE after discovering an error. To main-
tain the output distribution, S PEC S AMPLE should always be invoked for strings x1...n where
∃i ∈ [1 . . . n], P̂ B (xi |x1...i−1 ) < P (xi |x1...i−1 )—but the algorithm has no way of checking if
this condition holds without iterating through every suffix, negating any performance benefit.
Even though the AprAD does not perfectly maintain the output distribution, we show in the fol-
lowing sections that it provides a very useful midpoint in the tradeoff of computational complexity
versus task-specific performance and accuracy.
4.2 A NALYSIS
Let AB
P (x1...n ) represent the probability of the AprAD method producing sequence x1...n .
For x1...n ∈ B, AB P (x1...n ) = 0 . For all other sequences, we provide evidence that AprAD more
closely follows the ideal distribution, compared to constrained generation. While the nature of the
iterative process makes it difficult to write a closed form description of the probability amplification
AB P (x1...n )
P̂ B (x1...n )
, less probability amplification occurs with AprAD than with constrained generation when
an error is detected, as an error’s probability mass is “distributed” over many sequences due to the
5
ASAp Constrained AprAD (Ours)
Error Set KL-div Ratio KL-div Ratio KL-div Ratio
∅ 0.0014 1.000 0.0014 1.000 0.0014 1.000
AAA 0.0014 1.020 0.0075 1.000 0.0046 1.004
AAA, AAC 0.0012 1.041 0.0429 1.000 0.0157 1.013
AAA, ACC 0.0013 1.042 0.0138 1.000 0.0093 1.009
AAA, CCC 0.0010 1.044 0.0155 1.000 0.0074 1.010
AAA, AAB, ABA, BAA 0.0013 1.093 0.0504 1.000 0.0224 1.024
A** except AAC 0.0014 1.232 0.3836 1.113 0.1540 1.205
*** except AAA, AAB, ABA, BAA 0.0000 3.644 0.1771 1.670 0.0521 2.142
*** except AAA, BAA 0.0000 5.701 0.0000 1.784 0.0000 2.653
Table 1: KL-Divergence and generation ratios for simulated task with various error sets. Lower is
better for both. Stars in the error set are wildcards; i.e. AB* means ABA, ABB, and ABC. Note
that constrained generation will backtrack if all tokens for a given prefix are disallowed, resulting in
ratios greater than 1 for some error sets.
speculative sampling operation. In contrast, with constrained generation, an error’s probability mass
is moved entirely to sequences that share n − 1 prefix tokens.
We empirically show that AprAD is closer to the ideal distribution, compared to constrained decod-
ing, by creating a testbench to simulate an environment where the ideal distribution is known. The
testbench contains a simulated language model that always returns one of three tokens (A, B, and
C) with equal probability. We mark k sequences of length 3 as errors, and use the sampling method
under test to sample 10000 sequences of length 3. The ideal distribution is trivial to compute—
1
probability 27−k for every non-error sequence. To measure how a sampling process compares to
the ideal distribution, we compute the KL-divergence between the observed distribution and ideal.
Additionally, we measure the Generation Ratio; i.e. how many times the language model must be
evaluated, divided by the number of tokens generated in the output.
The results are shown in Table 1, indicating that our method approximates the ideal distribution
more closely than constrained generation, with a lower generation ratio than ASAp.
5 E VALUATION
While Section 4.2 shows that our method performs well in a simulated domain, the following exper-
iments test the sampling methods on a series of more difficult, real-world tasks.
It is common in poetry or creative writing exercises to write text without using a specific letter; a
product of this exercise is called a lipogram. Lipograms where the excluded letter is a vowel tend to
be more difficult to create than with other letters. Large language models often fail at this task, and
more generally, most tasks dependent on individual letters rather than entire tokens.
We use Mistral-7B-Instruct-v0.2 (Jiang et al., 2023) to generate lipograms with vowels as the ex-
cluded letter. We prompt the LLM to perform one of five simple tasks (detailed in Appendix A).
Each task is appended to instructions to avoid using one of the five vowels, resulting in 25 prompts.
For each prompt, we generate a completion with four sampling methods: unconstrained generation,
constrained generation, ASAp, and AprAD, for up to 200 tokens. If the process reaches 2000 model
invocations, generation is interrupted, and the last sequence before an error was detected is returned.
We then ask human raters to score each completion on quality, regardless of if the constraint was
followed, on a scale of 1-5. If the forbidden letter is detected in the output, we then mark the
constraint as violated. Otherwise, we also ask the human raters to decide if the output violates
the intent of the constraint; i.e. by answering in a foreign language, adding unnecessary accents,
swapping for Cyrillic lookalike characters, etc. Additional information about the rating process is
provided in Appendix A.
6
Summarize the history of artificial intelligence without using the letter "A". AprAD (Ours)
The history of intelligent technology unfolds through key milestones in computing. In the
mid-20th century, pioneering minds nurtured the seeds of this field through the development
of logistic models, which evolved into expert systems in the 1960s. These systems, relying on
heuristics, could solve complex problems in fields like medicine or meteorology.
Summarize the history of artificial intelligence without using the letter "A". Unconstrained
In the early days, brilliant minds pondered on creating machines that could mimic human
intelligence. These pioneers, including Alan Turing and Marvin Minsky, theorized and exper-
imented. They aimed to create intelligent machines, using logic and mathematics.
Summarize the history of artificial intelligence without using the letter "A". ASAp
History of Intelligence
Summarize the history of artificial intelligence without using the letter "A". Constrained
In the mid-20th century, scientists delved deep into investiging computionl modeles for intel-
ligencen modeled on the workings of the brn. Pioneers such как E.N. Turing, Clrk Quincy
Shrffngton, иnddd W. Ross Quilp, аnd Маrvin Minsky led ths fld. They explored logic-gаted
progrаms, thm brgth mchr tо lf understnding the cptures of th gnntl mind.
Figure 2: Representative generation samples for all four methods, using M ISTRAL -7B-I NSTRUCT-
V 0.2. Appearance of the banned letter is bolded, and non-ASCII characters (all Cyrillic in this
example) are colored red. Full samples are provided in Appendix D.
The results of this evaluation are provided in Table 2, and a representative sample of the outputs are
provided in Figure 2. All outputs and rater scores are included in the supplemental material, and
additional examples are provided in Appendix D.
As shown by these results, AprAD consistently produces high-quality outputs, nearly matching
the readability of unconstrained generation. Additionally, it consistently follows the intent of a
constraint—while all methods except for unconstrained generation follow the constraint, constrained
generation often does so in an undesirable manner, rather than by selecting appropriate words that
do not contain the banned letter. Finally, while the computation required by AprAD is more than
constrained generation, our method is usually able to finish generation in the allotted computation
limit, while ASAp struggles to generate more than a handful of tokens with the same constraint.
Table 2: Lipogram evaluation results, including subjective rater evaluations of output quality and
adherence to the constraint. Higher is better for Quality and Constraint Intent; lower is better for
Generation Ratio. Note that constraint intent is automatically set to 1 if the banned letter is detected
in the output. Generation is stopped after 2000 model invocations and the output provided as-is; for
ASAp, this often resulted in very short outputs and thus extreme generation ratios and low scores.
Several raters interpreted these short outputs as violating constraint intent.
We created a program to incrementally detect hallucinated API calls in partial Python programs,
based on the Pyright language server (Microsoft, 2019). Note that in order to satisfy the assump-
tions in Section 3—that any string with an error as a prefix is itself an error—we constructed the
detector to act conservatively. False negatives are strongly preferred over false positives. For ex-
ample, even if name foo is never defined, the detector does not consider the incomplete program
7
Size Method Pass@1 Pass@5 !NameErr@1 !NameErr@5 Gen. Ratio
Unconstrained 0.214 0.498 0.831 0.996 1.000 ± 0.000
Ours 0.259 0.541 0.976 1.000 1.080 ± 0.385
15b
ASAp 0.261 0.536 0.976 1.000 1.555 ± 3.906
Constrained 0.221 0.506 0.930 1.000 1.005 ± 0.053
Unconstrained 0.119 0.345 0.800 0.987 1.000 ± 0.000
Ours 0.145 0.375 0.950 0.993 1.064 ± 0.402
7b
ASAp 0.152 0.395 0.952 0.993 1.468 ± 2.574
Constrained 0.124 0.345 0.891 0.993 1.005 ± 0.029
Table 3: Subset of tasks where at least one trial results in a different output: 233 tasks (20.4%) for
15b, 304 tasks (26.7%) for 7b. For both model sizes, of the tasks where at least one model output is
different, an average of 1.5 out of 5 outputs are different. Lower is better for generation ratio; higher
is better for all others. Our method approaches the task performance of ASAp, with a generation
ratio close to that of constrained generation.
“example([Link]” as an error, because it is possible to later add text that turns the function
argument into a generator expression that binds foo.
We compare the performance of AprAD, constrained to avoid producing code with hallucinated
API calls, relative to other sampling methods by evaluating on BigCodeBench (Zhuo et al., 2024), a
benchmark that focuses on practical programming tasks, often requiring the use of common libraries.
An analysis of the solutions that several common LLMs generate reveals that their solutions often
require imports available in the testing environment, but which are not listed in the prompt. In order
for the hallucination detection program to discover these available resources, we add all imports
available in the test environment to the dataset prompt for this experiment.
For all sampling methods, we use Starcoder2 (Lozhkov et al., 2024), in the 7B and 15B model sizes.
We generate 5 samples for each task, with temperature 0.8, and a top-p of 0.95. In addition to
evaluating the pass@1 and pass@5 rates on execution-based tests, we log if the evaluation specif-
ically fails with a NameError or UnboundLocalError as an indicator that the generation included a
hallucinated API call,1 and calculate the rate at which this does not occur.
Note that all methods use the same random seed, so the outputs only diverge if and when the detector
activates. Table 3 shows the results for all tasks where the outputs diverge in any method; Table 4
(Appendix) also includes the tasks for which all methods return identical results. As the results
show, the output quality of AprAD is close to ASAp, while its generation ratio is much lower.
8
the algorithm doesn’t backtrack as far, and the resulting behavior approaches constrained decoding.
The optimal point in this tradeoff will depend on the specific task and error set, and is an area
for future research. As defined, however, AprAD occupies a useful position on this spectrum, is
straightforward to implement, and exhibits positive results on a variety of tasks.
R EFERENCES
Luca Beurer-Kellner, Marc Fischer, and Martin Vechev. Guiding llms the right way: Fast, non-
invasive constrained generation, 2024. URL [Link]
Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao.
Medusa: Simple LLM inference acceleration framework with multiple decoding heads, 2024.
URL [Link]
Darryl Francis. The Scunthorpe problem. Word Ways, 53(2), May 2020. URL https:
//[Link]/wordways/vol53/iss2/12.
Saibo Geng, Martin Josifoski, Maxime Peyrard, and Robert West. Grammar-constrained decoding
for structured nlp tasks without finetuning, 2024. URL [Link]
13971.
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chap-
lot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier,
Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril,
Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023. URL https:
//[Link]/abs/2310.06825.
Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative
decoding, 2023. URL [Link]
Alexander K. Lew, Tan Zhi-Xuan, Gabriel Grand, and Vikash K. Mansinghka. Sequential monte
carlo steering of large language models using probabilistic programs, 2023. URL https://
[Link]/abs/2306.03081.
Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EAGLE: Speculative sampling requires
rethinking feature uncertainty, 2024. URL [Link]
Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Noua-
mane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, De-
nis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov,
Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue Zhuo,
Evgenii Zheltonozhskii, Nii Osae Osae Dade, Wenhao Yu, Lucas Krauß, Naman Jain, Yix-
uan Su, Xuanli He, Manan Dey, Edoardo Abati, Yekun Chai, Niklas Muennighoff, Xian-
gru Tang, Muhtasham Oblokulov, Christopher Akiki, Marc Marone, Chenghao Mou, Mayank
Mishra, Alex Gu, Binyuan Hui, Tri Dao, Armel Zebaze, Olivier Dehaene, Nicolas Patry, Can-
wen Xu, Julian McAuley, Han Hu, Torsten Scholak, Sebastien Paquet, Jennifer Robinson, Car-
olyn Jane Anderson, Nicolas Chapados, Mostofa Patwary, Nima Tajbakhsh, Yacine Jernite, Car-
los Muñoz Ferrandis, Lingming Zhang, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von
Werra, and Harm de Vries. Starcoder 2 and the stack v2: The next generation, 2024. URL
[Link]
Daniel Melcer, Nathan Fulton, Sanjay Krishna Gouda, and Haifeng Qian. Constrained decoding for
code language models via efficient left and right quotienting of context-sensitive grammars, 2024.
URL [Link]
Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae
Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, Chunan Shi, Zhuoming Chen, Daiyaan
Arfeen, Reyna Abhyankar, and Zhihao Jia. Specinfer: Accelerating large language model serv-
ing with tree-based speculative inference and verification. In Proceedings of the 29th ACM In-
ternational Conference on Architectural Support for Programming Languages and Operating
Systems, Volume 3, ASPLOS ’24. ACM, April 2024. doi: 10.1145/3620666.3651335. URL
[Link]
9
Microsoft. Pyright, 2019. URL [Link]
Kanghee Park, Jiayu Wang, Taylor Berg-Kirkpatrick, Nadia Polikarpova, and Loris D’Antoni.
Grammar-aligned decoding, 2024. URL [Link]
Honghua Zhang, Meihua Dang, Nanyun Peng, and Guy Van den Broeck. Tractable control for
autoregressive language generation, 2023. URL [Link]
Honghua Zhang, Po-Nien Kung, Masahiro Yoshida, Guy Van den Broeck, and Nanyun Peng. Adapt-
able logical control for large language models, 2024. URL [Link]
2406.13892.
Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam
Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, Simon Brunner, Chen Gong, Thong
Hoang, Armel Randy Zebaze, Xiaoheng Hong, Wen-Ding Li, Jean Kaddour, Ming Xu, Zhihan
Zhang, Prateek Yadav, Naman Jain, Alex Gu, Zhoujun Cheng, Jiawei Liu, Qian Liu, Zijian Wang,
David Lo, Binyuan Hui, Niklas Muennighoff, Daniel Fried, Xiaoning Du, Harm de Vries, and
Leandro Von Werra. Bigcodebench: Benchmarking code generation with diverse function calls
and complex instructions, 2024. URL [Link]
10
A L IPOGRAM EVALUATION DETAILS
We provide the following prompts to the language model, as well as the relevant special tokens to
delimit user instructions and chat turns.
Each prompt is combined with each vowel, resulting in 25 prompts. With four sampling methods,
this results in 100 total generations.
During sampling, we use a top-k of 20, and temperature of 0.8.
We create a file that only contains the 100 prompt-completion pairs, without information on which
method generated each completion. All samples are shuffled in random order.
We selected four AI researchers not otherwise directly involved in the experimental evaluation of
this method as human raters, to evaluate 25 samples each. We provided the following instructions to
the raters:
This file contains a set of prompts, and responses using one of several meth-
ods. Each prompt contains a constraint to not use a specific letter. Irrespective
of whether the response follows the constraint, rate the response quality on a scale
of 1-5 in the “Score” column, noting that generation is always cut off after 200
tokens.
Additionally, rate how well the response follows the intent of the constraint in the
“Follows Intent” column. Examples of not following the intent include working
around the constraint by excessively dropping letters, using unnecessary accents,
writing Unicode lookalike letters, or responding in a foreign language, rather than
through selecting appropriate words that satisfy the constraint. This column is
pre-filled with ‘X’ if the output contains the banned letter. Otherwise, write 1 if it
violates the intent, 2 if it is ambiguous, and 3 if it does not.
We additionally highlighted the presence of non-ASCII lookalike letters to the human raters. The
complete model outputs, and the scores that each rater assigned, are provided in the supplementary
material. Additional example outputs are provided in Appendix D.
Table 4: Results for each method on entirety of BigCodeBench. Note that these results are identical
to those in Table 3, except that they are consistently offset and scaled to include values for tasks in
which all tasks return the same result.
11
Table 4 includes results for the entirety of BigCodeBench; not just the tasks for which the methods
diverged in their output.
Algorithm 5 Many error-free decoding methods may be generalized by their behavior after an error
procedure E RROR F REE D ECODING(P, B, x1...n , S TRATEGY)
P̂ B ← P
m←n ▷ Current token index
while Stopping condition not met do
Sample one token xm+1 ∼ P̂ B (·|x1...m )
Increment m
if x1...m ∈ B then
P̂ B∪{x} ← A DD BAD S AMPLE(P̂ B , x1...m ) ▷ Algorithm 3
B B∪{x}
x1...m ← S TRATEGY(P̂ , P̂ , x1...m ) ▷ m may decrease
P̂ B ← P̂ B∪{x}
return x1...m
procedure A PR ADS TRATEGY(P̂ B , P̂ B∪{x} , x1...m )
return S PEC S AMPLE(P̂ B , P̂ B∪{x} , 0, x1...m ) ▷ Algorithm 2
procedure ASA P S TRATEGY(P̂ B , P̂ B∪{x} , x1...m )
return [] ▷ Backtrack to beginning
procedure C ONSTRAINED D ECODING S TRATEGY(P̂ B , P̂ B∪{x} , x1...m )
return x1...m−1 ▷ Delete the error token but don’t backtrack further
12
Provide instructions to tie a tie without using the letter "I". AprAD (Ours)
To create an elegant neckwear look, follow these steps to adorn your collar:
1. Gather your tools: Choose a well- Press your collar down and adjust the top button.
2. Select the perfect knot: There are many knots to choose from; select a preferred one.
3. Dress your end: Take the end nearest to you and form a loop, called the "small" or "narrow"
end.
4. Create an overlap: Cross the longer end, called the "large" end, over the loop.
5. Form a second loop: Take the small end and make another loop.
6. Create a "V" shape: Place the large end over the two loops. Adjust for symmetry.
7. Repeat: Cross the small end over the large end, then the large end over the small end to
create a "V" shape once more.
8. Tuck and adjust: T
Provide instructions to tie a tie without using the letter "I". Unconstrained
To create an elegant necktie knot, follow these steps:
1. Lay the tie flat on the table with the wide end (the end you’ll be tying first) facing you. The
thin end (the side you’ll be attaching to theshirt) should be hanging on the left.
2. Cross the wide end over the thin end and bring it behind the looped thin end.
3. Now, pass the wide end over the looped thin end from left to right, going around the back
of the necktie.
4. Keeping the wide end behind the looped thin end, cross it over to the left side and bring it
in front of the looped end.
5. Now, the wide end should be on the left and the looped end on the right. Cross the wide
end over the looped end again to the right side.
6. Pass the wide end under
Provide instructions to tie a tie without using the letter "I". ASAp
To create
Provide instructions to tie a tie without using the letter "I". Constrained
To create an elegant necktube knot for your formal att attar, follow these steps:
1. Lay the necktube flat on the table, end towards you.
2. Fold one end about four to four and a half инches (10-12 cm) over the other end.
3. Create a loop, called the "small loop," by fold the part of the necktube above the folded end,
about two to three инches (5-7.5 cm) down.
4. Create a second loop, called the "large loop," by fold the part of the necktube around your
neck, and rest the large loop over the small loop.
5. Tuck the end of the necktube under the large loop, then the small loop.
6. Pass the end of the necktube under the small loop and through the hole created by the knot.
7. Tug on
13
Provide instructions to tie a tie without using the letter "O". AprAD (Ours)
1. Begin with the tie unwinded and the widest end (the "bracelet end") tucked under the
neckband.
2. Bring the wider end (the "tie end") behind the neckband, then tuck it under the bracelet end
and pin it in place with a thin fabric "shelf" called the "keeper."
3. Create a "Wing Tip" by lifting the neckband and slipping the tie end under it, then pull it up
evenly and adjust the width until it’s level with the bracelet end.
4. With the tie end still held in place with the neckband, create a "Shelf" by lifting the bracelet
end and passing the tie end beneath it, then adjust the tie width again by pulling the tie end
evenly.
5. Create a "Three-inch rule" by sliding the bracelet end up three inches (7.6 cm
Provide instructions to tie a tie without using the letter "O". Unconstrained
1. Begin by laying out your tie on a flat surface with the wide end, known as the "tail," facing
you. The narrower end is called the "Tip."
2. Cross the Tip over the Tail from left to right.
3. Bring the Tail around the back of the neck and cross it over the Tip.
4. Now, take the Tip and thread it under the loop created by the Tail.
5. Pull the Tip up through the loop.
6. Adjust the position of the Tip and Tail to achieve a symmetrical knot.
7. Repeat the process on the other side, crossing the Tip over the Tail, bringing the Tail around
the back, threading the Tip under the loop, and pulling it through.
8. With both sides adjusted, adjust the knot one final time to ensure a neat and symmetrical
look.</s>
Provide instructions to tie a tie without using the letter "O". ASAp
1
Provide instructions to tie a tie without using the letter "O". Constrained
1. Begin by laying the necktie flat in a figure eight shape. The wide end (called the "butt end")
will be anchered at the left side, and the thin end (called the "tip end") will be adjusted at the
right side.
2. Bring the right side (the side with the tip end) up and behind the left side, creating a "U"
shape, then pull it straight up and tuck the tip end under the wide part at the nape (the back
part) near the center.
3. Next, bring the left side up and pass it (the wide end) behind the neck, then pull it straight
up and rest it against the right side.
4. With the left side still in hand, take the right side (tip end) and bring it up and pass it behind
the left side.
5. Tuck the tip end behind and under the wide part (where the
14
Write a story without using the letter "E". AprAD (Ours)
In a quaint, old-world town, a mystical story unfolds. A solitary knight, Sir Gavr, hails from a
distant land. His mission was vital, as his king had sought his aid. A monstrous dragon, with
a long, sinuous form, had laid claim to a bountiful land.
Sir Gavr, galloping through twisting hills and past lilac valms, his stallion, Swiftpaw, carrying
him swiftly. Through woodlands dark and mossy, a path narrow and winding. His armor
glinting gold, a radiant sight, as knight and mount, in rhythmic flight, cut through air, crisp
and light.
A labyrinth of hills and hollows, Sir Gavr and Swiftpaw, onward bound. A looming mountain,
a daunting sight, as dragon’s lair, a
15
Describe elephants without using the letter "O". AprAD (Ours)
Elephants are large, gray, thick-skinned pachyderms, characteristically having extended,
curved tusks, and a fan-shaped, large ear structure. They are the largest living terrestrial an-
imals, with a distinct, massive build. These gentle giants have a wrinkled visage, and large,
dangling, gray, prehensile ears, which are effective at dissipating heat. The trunk, a lengthy,
muscular, and versatile appendage, is an integral feature, serving multiples uses, including
breathing, grasping, and trumpeting. Their legs are pillar-like, ending in large, flat feet, which
help them maneuver in diverse terrains. Their size and strength make them apex terrestrial
animals, and they are crucial elements in their respective habitats.</s>
16
Critique the Mona Lisa without using the letter "E". AprAD (Ours)
In this artwork, a captivating woman, known as La Joconda or La Hora, holds a puzzling grin.
A portrait, it is, of this cryptic woman. With a backdrop of lush, rolling hills, this painting is a
symphony of color and form.
This woman, clad in a rich, royal gown, sits in a chair in an unusual position, twisting slightly,
inviting an intimacy that is both alluring and intriguing. A slight smirk, or is it a sly grin? This
ambiguity is what draws us in, as if to ask, "Am I amusing or mocking?"
Boldly, shadows and light play in this work, as if dancing a waltz, with chiaroscuro as its
music. This contrast, striking and dramatic, is what adds an almost magical quality to this
artwork.
A mosaic of colors, this painting is
Critique the Mona Lisa without using the letter "E". Unconstrained
The Mona Lisa, a singular 16th-century painting, is Leoni da Vinci’s epitome. This
masterpiece, penned by the virtuoso artist, evokes a sense of enigma and captivation. The
woman portrayed, with her enigmatic smile, gazes directly at the observer, challenging one’s
interpretation.
The painting’s composition masterfully balances the human form against a lush background,
showcasing Leonardo’s virtuosic skill. The woman, dressed in a subtle, intricate robe, is
contrasted by the stark, plain backdrop. Her expression, subtly shifting between coy and
contemplative, invites endless analysis.
The intricate play of light and shadow on her face adds depth and dimension, as if the painting
itself holds a secret. The artist’s adept use of sfumato technique, where colors and tones blend
Critique the Mona Lisa without using the letter "E". ASAp
In this
Critique the Mona Lisa without using the letter "E". Constrained
I’ll do my utmost to critic this Mona Lisa without utilizing "é" or any of its variants, including
"é" as in "étudiant" or "è" as in "mèasuring." I’ll focus on "critic" as in "judging with an
analytical and authoritarian approach."
This painting, La Gioconda, by da Vinci, is an intriguing, cryptic work of art. Its woman’s
countanza, or counténanc, or visagio, or visagio da vita, or simply, its facia, is a captivating,
almost hypnotizing, sight.
With a sfumato background that dissipatas, or dissipas, into an indistinct, unbound horizon,
our ojo is drawn to its sujето principal. Its sfumato, or "smoky," background is in striking
contrast
17
Critique the Mona Lisa without using the letter "A". AprAD (Ours)
Leonymus Léonifor de Vinci’s iconic piece, the ’Lizone Jocundity’ or ’Smiling Lioness,’ epit-
omizes the refined intrigue of the 16th century. This work, completed in the mid-1500s, is
renowned for its subject’s elusive expression, veiled in mystery.
The ’Lizone’ is executed with the finest of techniques. Leonifor’s virtuosity is evident in the
meticulous, lifelike rendering of the figure’s physiognomy, dressed in the subdued splendor of
the epoch. The sitter, shrouded in the mists of time, seems to defy definition.
The composition is striking; the figure is posed, enshrouded in the folds of her veil, the folds
of her robe - the folds of her being.
Critique the Mona Lisa without using the letter "A". Unconstrained
The Mona Lisa, a iconic masterpiece by Leonardo da Vinci, is a 16th century portrait ren-
dered in oil on poplar wood. This artwork, renowned for its enigmatic smile, is noted for its
exceptional fine detail and the subtle interplay of light and shadow.
The subject, believed to be Lisa Gherardini, is depicted in three-quarter view against a
landscape backdrop. Her expression, often described as enigmatic or melancholic, is con-
veyed through her slightly parted lips, a subtle curve of her mouth, and her enigmatic gaze.
The painting showcases da Vinci’s mastery of sfumato, a technique that creates a soft
transition between colors and tones. This is evident in the delicate shading on the subject’s
face, hands, and the background landscape.
The background is a harmonious blend of mountains, water, and trees, which adds
Critique the Mona Lisa without using the letter "A". ASAp
The
Critique the Mona Lisa without using the letter "A". Constrained
The Moná Licá, renowned work of Leonido de Vinci, is renowned for its enigmistic smile,
subtle yet intriguing. This 16th century oil on poplár wood portritr é, housed in the Louvre,
Fránce, boggles the mind with its complexitı́. The composition, with the subject’s eyes meet-
ing the observer’s, invites introspection. The intricè intricàtions of the sitter’s dress, mirroring
the folds of the sitter’s neckline, exhibit the virtuositı́ of the mı́ster. The sfumátto technique,
used in the rendering of the shádows, highlights the subtlety of the lighting on the subject. The
delicácy of the brushwork, seen in the subtle blending of colors, reflects the meticulousness of
the mı́ster. The over
18
Summarize the history of artificial intelligence without using the letter "U". AprAD (Ours)
Artificial Intelligence (AI) traces its roots to the concept of "thinking machines" in ancient
Greece. Early theories on creating sentient beings were proposed by philosophers like René
Descartes and Blaise Pascal. The term "Artificial Intelligence" emerged in the mid-1950s
when scientists and engineers began working on creating programs capable of solving complex
problems, learning from experience, and interacting with the environment.
One of the first significant achievements in AI was the development of the Logic Theorist by
Allen Newell and Herbert A. Simon in 1956. This program was designed to prove mathemat-
ical theorems, and it was considered a major milestone in the field.
The 1960s and 1970s saw the advancements in AI research with the creation of expert systems,
which applied knowledge in specific domains to solve complex problems. These systems, like
MYCIN and EMYC
Summarize the history of artificial intelligence without using the letter "U". Unconstrained
Artificial Intelligence (AI) is a field of computer science that focuses on creating intelligent
machines capable of performing tasks that typically require human intelligence. Its roots can
be traced back to the mid-20th century.
In the 1940s and 1950s, early pioneers like Alan Turing and Marvin Minsky explored the
possibility of creating machines that could think and learn. Turing proposed the Turing Test,
which measures a machine’s ability to imitate human conversation. Minsky and John Mc-
Carthy coined the term "Artificial Intelligence" at a workshop at Dartmouth College in 1956.
Initially, AI research focused on symbolic AI, using rules and symbols to represent knowledge
and reason. However, this approach faced limitations, and researchers turned to other methods.
In the late 1950s and early 1960s, there was significant interest in AI and
Summarize the history of artificial intelligence without using the letter "U". ASAp
Artificial Intelligence (AI) traces
Summarize the history of artificial intelligence without using the letter "U". Constrained
Artificial Intelligence (AI) is a field of компьютерная наука, dealing with creating machines
or programs that mimic intelligent behavior. Its roots can be traced back to the mid-20th
венгарда, when Alan Тюринг, an English mathematician, proposed the concept of a "logical
machine" capable of performing any task that a person can.
In the 1950s and 1960s, AI research was largely driven by the idea that intelligent behavior
can be achieved by programming a machine to process symbolic information, leading to the
development of symbolic AI or "good old-fashioned AI" (GOFAI), which relies on logical
reasoning and symbol manipation.
However, the limitations of this approach became apparent in the late 1960s, with the realiza-
tion that symbolic reasoning alone was insignificant in tackling real-world problems
19
Describe elephants without using the letter "U". AprAD (Ours)
Massive land animals, elephants are the largest mammals on Earth. They have long, gray,
wrinkled skin, and large, fan-like ears. Their tails are thick at the base and taper towards the
tip. They possess two long front legs and a pair of smaller hind legs. They are herbivores,
eating a wide variety of plants, leaves, bark, and branches. Their diet provides them with an
incredible capacity to store food in their digestive tracts, which can take days to process. These
gentle giants have a keen memory and form strong family bonds, living in matriarchal herds.
They help one another, displaying behaviors of cooperation and protection, making them an
integral part of their ecosystems.</s>
20
Critique the Mona Lisa without using the letter "U". AprAD (Ours)
The Mona Lisa, a iconic masterpiece by Leonardo da Vinci, is a captivating image of a woman
whose enigmatic smile and mesmerizing gaze have left viewers in awe since its inception.
The portrait, renowned for its exceptional detail and realistic representation, evokes a sense of
mystery and refinement.
The lady, believed to be Lisa Gherardini, is depicted while seated before a rocky, landscaped
backdrop. Her visage is characterized by the delicate interplay of light and shadow, with the
folds of her clothing creating an intricate, almost three-dimensional effect. Her eyes, with
their faint hint of a smirk, invite the viewer into a world of enigma, daring one to decipher her
emotions.
The painting’s composition is noteworthy for its asymmetrical balance, which keeps the viewer
constantly engaged. The delicate folds of her veil, the
Critique the Mona Lisa without using the letter "U". Unconstrained
The Mona Lisa, a iconic masterpiece by Leonardo da Vinci, is a captivating image of a woman
whose enigmatic smile and inscrutable expression have baffled and intrigued viewers for
centuries. Her soft, delicate features and lifelike quality are achieved through the painter’s
virtuosic use of sfumato, a technique involving the subtle blending of colors to create a hazy,
smoky effect.
The background landscape, rendered with meticulous detail, provides an idyllic setting for the
enigmatic figure. The artist’s exceptional skill in capturing the play of light and shadow, as
well as his mastery of perspective, are evident in the way the background elements recede into
depth.
Mona Lisa’s gaze, seemingly directed at the viewer, invites one to ponder the secret thoughts
hidden behind her enigmatic expression. Her hands, delicately inter
Critique the Mona Lisa without using the letter "U". ASAp
The Mon
Critique the Mona Lisa without using the letter "U". Constrained
The Mona Lisa, a iconic masterpiece by Leonardo da Vinci, is a captivating image of a woman
whose enigmatic smile and inscrütableness have intrigighed viewers for over fivecenthies. The
painting is renowned for its exceptional detail and lifelike portrayal of the sitter.
The composition is masterly balanced, with the sitter positioned slightly off center. Her gaze,
following no discernible focal point, invites prolonged contemplation. The delicate rendering
of her facial expression, with its slight hint of a smile and the faintest of eyebrows, adds to the
overall alliteration.
The intricately patterned veil she wears adds a sense of mystery and intrigness. The delicate
play of light and shadow enhances the sense of depth and dimensionality. The sfoglio, or
smoky sfondato, in the backdrop contibtes to
21
Provide instructions to tie a tie without using the letter "E". AprAD (Ours)
1. Starting with your collar straight and flat:
2. Position your collar: Fold it up at about an inch (2.5 cm) and adjust it to sit comfortably.
3. Form a ’Width’ and’Slip’ or ’Four in Hand’ knot:
a. Width: Pick up a small strip of fabric (about an inch or two, 2.5 cm to 5 cm) with your
non-dominant hand. This is your ’width’ or ’wingtip’ for this knot.
b. Slip: With your dominant hand (usually your right hand), grab a loop of fabric that’s around
a foot (30 cm) long. This is your ’slip’.
4. Cross and Tuck:
a. Bring your slip hand up and across your width, crossing your width hand’s strip.
b. Tuck
Provide instructions to tie a tie without using the letter "E". Unconstrained
1. Place the tie around your neck, with one end hanging lower than the other. This is called
the "working end."
2. Cross the working end over the undone side (the side without the Windsor knot).
3. Bring the working end behind the neck, crossing it over the center of the knot.
4. Pull the working end straight down, making a loop on the side with the Windsor knot.
5. Tuck the end of the loop under the Windsor knot, towards the narrower side of the tie.
6. Pull the working end to tighten the knot, and then repeat steps 3-5 to create a second loop
on the other side.
7. Cross the second loop over the first loop, just like in step 2.
8. Bring the working end behind the neck and under the first loop, crossing it over the center
of the
Provide instructions to tie a tie without using the letter "E". ASAp
1
Provide instructions to tie a tie without using the letter "E". Constrained
1. Position your shirt: Worn collar flat and button up, slip your arms through armh Hols of
your suit or blaz Jackет. Tuck in your shirt and fastен buttons.
2. Taut knot: Grip your narrow finish, also known as "point," with your right hand, holding it
at about a 45-diagonal from your body. Wrap your "broad" or "wing tip" around your collar
and cross it ov-r your finish, making a "V" Shap at your collar.
3. Tightly wrap: Grip your broad tip with your right hand, and pull it across and o’r your
finish, going toward your l-ft (non-dominant) sid- and passing it thru’g your "working" n-d
(right) knot loop.
4. Wrap again: Cross your broad tip to your "back," going
22