0% found this document useful (0 votes)
46 views

Large Language Model Routing With Benchmark Datasets

The document proposes learning a "router model" to select the best large language model (LLM) for new tasks based on their performance on benchmark datasets. The router model is trained through binary classification tasks that predict whether each LLM will perform correctly on different samples from benchmark datasets. This information is then used to score and select the best LLM for new tasks. The authors demonstrate this approach on question answering and instruction following datasets, showing it can outperform using any single model. However, they also discuss limitations like how well the router generalizes to out-of-distribution tasks and the need for a larger pool of benchmark datasets.

Uploaded by

VKB Library iisu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views

Large Language Model Routing With Benchmark Datasets

The document proposes learning a "router model" to select the best large language model (LLM) for new tasks based on their performance on benchmark datasets. The router model is trained through binary classification tasks that predict whether each LLM will perform correctly on different samples from benchmark datasets. This information is then used to score and select the best LLM for new tasks. The authors demonstrate this approach on question answering and instruction following datasets, showing it can outperform using any single model. However, they also discuss limitations like how well the router generalizes to out-of-distribution tasks and the need for a larger pool of benchmark datasets.

Uploaded by

VKB Library iisu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Large Language Model Routing with Benchmark Datasets

Tal Shnitzer∗†, Anthony Ou∗‡, Mı́rian Silva§, Kate Soule§, Yuekai Sun¶,
Justin Solomon‡, Neil Thompson‡, Mikhail Yurochkin§

Abstract
arXiv:2309.15789v1 [cs.CL] 27 Sep 2023

There is a rapidly growing number of open-source Large Language Models (LLMs) and benchmark
datasets to compare them. While some models dominate these benchmarks, no single model typically
achieves the best accuracy in all tasks and use cases. In this work, we address the challenge of selecting the
best LLM out of a collection of models for new tasks. We propose a new formulation for the problem, in
which benchmark datasets are repurposed to learn a “router” model for this LLM selection, and we show
that this problem can be reduced to a collection of binary classification tasks. We demonstrate the utility
and limitations of learning model routers from various benchmark datasets, where we consistently improve
performance upon using any single model for all tasks.

1 Introduction
Large Language Models (LLMs) have demonstrated ground-
breaking abilities to solve diverse tasks across a variety of
NLP domains (Devlin et al., 2018; Brown et al., 2020). To-
day, researchers in both academia and industry are releasing
new LLMs daily.1 These models perform tasks ranging from
text classification to question-answering, summarization, and
dialogue.
The popularity and influx of open-source LLMs and the
diversity of their potential use cases made it crucial to develop
comprehensive benchmarks, i.e., collections of datasets repre-
senting different tasks and domains to compare LLMs. For
example, HELM (Liang et al., 2022) consists of 42 scenarios
covering a variety of uses, MMLU (Hendrycks et al., 2020)
is a multiple-choice question answering benchmark with 57
tasks organized by topics, Open LLM Leaderboard (Beeching
et al., 2023) combines MMLU with other question-answering
datasets, and LM Evaluation Harness (Gao et al., 2021) sup-
ports over 200 tasks. While there always will be an LLM that
is the best on average across benchmarks, there is unlikely
to ever be a model that is strictly the best on each of the Figure 1: We learn the strengths of candidate
hundreds of datasets comprising various benchmarks. Mean- LLMs (marked with corresponding colors) on
while, a practitioner typically wants to know what is the best various tasks (emojis: QA, reasoning, summa-
model for their specific use case and is less concerned about rization, etc.) and domains (4 sections within
average performance on a plethora of other datasets. each box: finance, legal, general knowledge, etc.)
In this paper, we study the problem of identifying the from benchmark datasets. We accomplish this
best LLM for a new task. To learn about the strengths and by training a binary classifier per LLM (upper
weaknesses of candidate LLMs we use benchmark datasets part of the figure). For a new task, we score
∗ Equal contribution each LLM with these binary classifiers and rec-
† Broad Institute, work done while at CSAIL, MIT ommend an LLM for the user (lower part).
‡ CSAIL, MIT
§ MIT-IBM Watson AI Lab
¶ University of Michigan
1 Hugging Face currently hosts 22,482 models for text generation.

1
that give insights into the performance of LLMs across tasks and domains. For example, suppose the new
task is answering math questions. In that case, it is more intuitive to consider models that do well on other
STEM question-answering datasets and discount performance on, e.g., sociology or toxicity detection. We make
this idea precise by casting the learning of model strengths as a binary supervised learning task, where the
features are input embeddings of samples across tasks and the labels are whether the model “did well” on
the corresponding inputs, e.g., generated correct class label, answered a question correctly, or followed input
instructions sufficiently well. See Figure 1 for an illustration. Such information is collected during benchmark
evaluations and can be reused for training model routers without having to run expensive LLM inference again.
The resulting router is also efficient at test time, as it only requires calling the chosen LLM.
Our contributions are summarized below:
• We formalize the problem of learning the strengths and weaknesses of LLMs for downstream routing, i.e.,
selecting the best model, as a collection of binary classification problems. The goal of each classification
problem is to predict whether a given LLM will be “correct” on an input.
• We propose three scores for selecting LLMs for a new task using these correctness predictors. Our third score
is designed to account for mistakes a correctness predictor can make on the (out-of-distribution) data from
a new task that is likely to be different from the datasets in benchmarks used for training the correctness
predictors. We establish connections to meta-learning to obtain theoretical insights into the efficacy of these
scores.
• We verify the efficiency of our model routing scores empirically on 29 datasets from HELM (Liang et al., 2022)
representing scenarios like question answering, text classification, knowledge, and reasoning, and MixInstruct
(Jiang et al., 2023), a collection of datasets for evaluating instruction following capabilities of LLMs.
• We discuss and empirically investigate questions concerning the efficacy and utility of learning LLM routers
from benchmarks: generalization of correctness predictors to new tasks, the importance of a larger pool of
benchmarks, and the potential of routing smaller LLMs to reduce costs.

2 Related work
Benchmarking Comparing models or algorithms across various tasks is a standard practice in ML and AI
literature. Prior to Foundation Models (Bommasani et al., 2021), it was typical to apply the same learning algo-
rithm to train a model on each of the datasets and compare the performance against other learning algorithms.
The UCI Machine Learning Repository (Kelly et al., 2023) is one prominent example of such a collection of
datasets often used to compare learning algorithms. With the emergence of Foundation Models, i.e., models
with billions of parameters trained on massive datasets using large compute clusters, the paradigm changed to
evaluating the same model (or a few-shot tuned version of it) on a variety of tasks (Bojar et al., 2014; Goyal
et al., 2019; Li et al., 2022). In the context of Large Language Models, many benchmarks (Wang et al., 2018,
2019; Hendrycks et al., 2020; Gao et al., 2021; Srivastava et al., 2022; Liang et al., 2022; Beeching et al., 2023;
Jiang et al., 2023) were proposed to help determine the most capable LLM. Benchmarks typically average the
performance of models across tasks and provide a final ranking, discarding the rest of the information. In this
work, we use the byproducts of benchmark evaluations, i.e., the per-sample performance of various LLMs across
tasks, to learn about their individual strengths and identify the best LLM for a new task.

Model selection Selecting the best model, or model selection, is a classical topic in statistics and ML (Bishop
and Nasrabadi, 2006; Hastie et al., 2009; Raschka, 2018). However, the typical problem setting is quite different:
classical methods like cross-validation aim to estimate the population error of a model trained on samples from
the population distribution. In other words, the goal is to find the best model for in-distribution test data,
i.e., data sampled from the same distribution as the train data. The notion of “train” data is quite elusive
for LLMs, as they are usually trained on massive datasets with trillions of tokens with a simple task of next
token prediction (Radford et al., 2019; Brown et al., 2020). However, the tasks we evaluate them on are often
more structured, e.g., classification and question-answering, and are specific to domains that may or may not
be sufficiently represented in the train data. In addition, techniques like k-fold cross-validation require training
the model multiple times, which is infeasible for LLMs.

Out-of-distribution model selection Recognizing the limitations of the model selection methods for in-
distribution test data (Gulrajani and Lopez-Paz, 2021; Koh et al., 2021), recent work has proposed a variety

2
of methods to select models when deployed on data that may differ from the train data. These methods rely
on ideas such as bootstrapping (Xu and Tibshirani, 2022), reweighing (Chen et al., 2021b; Maity et al., 2023),
agreement of models or ensembles (Jiang et al., 2021; Chen et al., 2021a; Ng et al., 2023), or aligning model
accuracy in-distribution with a confidence threshold (Guillory et al., 2021; Garg et al., 2022; Yu et al., 2022).
Most of these methods are nontrivial to extend to generation use-cases of LLMs; some require training multiple
models, and some need well-defined in-distribution data related to the new task.

Routing LLMs Prior work on selecting LLMs primarily considers choosing one that produces the best gen-
eration for a given input. Liu and Liu (2021); Ravaut et al. (2022); Jiang et al. (2023) train dedicated scoring or
ranking models that can be applied to model generations. Unlike our work, these approaches require generating
outputs with every candidate LLM to make a decision, which can be computationally prohibitive with a large
pool of candidate LLMs. FrugalGPT (Chen et al., 2023) calls LLMs sequentially until a dedicated scoring model
deems the generation acceptable. Prior works in this group require training data sufficiently representative of
each of the tasks and domains of interest to train the corresponding ranking and scoring models. In this paper,
instead, we use data from benchmarks to learn the strengths and weaknesses of LLMs across tasks and domains.
The resulting model router requires generating outputs only with the chosen LLM at test time.

3 Learning from Benchmarks


We start by introducing notation to describe the majority of NLP benchmarks. Let {xd1 , . . . , xdnd }D d=1 be a
collection of inputs across D tasks. Each input text xdi corresponds to a reference answer rid , i.e., an ideal
generation for the corresponding input. Finally, there is a metric Fd (x, o, r) that can be task-dependent and
measures how well a response o for an input x corresponds to the reference r. To test an LLMm , m ∈ {1, . . . , M },
on the benchmark, for each task d = 1, . . . , D, its responses are generated {odim = LLMm (xdi )}ni=1 d
and compared
to the corresponding references to obtain performance metrics {fim d
= Fd (xdi , odim , rid )}ni=1
d
.2 At this point, the
majority of the benchmark studies will take a (weighted) average of the performance metrics and report a single
score for every LLM to rank them in performance. Instead, we reuse these evaluation results to formulate a
supervised learning problem to better understand the strengths and weaknesses of various LLMs based on their
performance on data points and tasks.

Supervised learning from benchmarks Our goal is to learn a simple routing function gm (x) for each
d′ nd′
LLM, m = 1, . . . , M , that can predict {fim }i=1 , i.e., the performance of the corresponding LLM on a new task
d′ . Then it is trivial to select the best LLM for this task. For efficiency at test time, we restrict the routers
{gm }Mm=1 to only depend on the input x. This is in contrast to the majority of prior works on LLM routing
that first obtain generations with every candidate LLM and then use them to choose the best model (Liu and
Liu, 2021; Ravaut et al., 2022; Jiang et al., 2023). With thousands of open-source LLMs, it is simply infeasible
to obtain generations with every LLM for every input at test time.
To complete the problem formulation, we denote the “correctness” of model m on an input x by y(x, m) ∈
{0, 1}. Correctness is evaluated as follows: generate a response odim with LLM m on input xdi , compare it to
the corresponding reference rid , and output 1 if the model’s response is good enough, i.e., fim d
> ηd , and 0
otherwise, where ηd is some threshold that can be task and/or metric specific. For tasks like classification or
multiple-choice QA, y(xdi , m) = fimd
, while for various evaluation metrics used in summarization and instruction
following tasks (Zhang et al., 2020; Sellam et al., 2020; Yuan et al., 2021), the notion of correctness can help
to account for the heterogeneity of popular metrics and task difficulty levels. In Section 5.2, we also present
results with raw metrics instead of correctness.
To train a predictor of an LLM correctness, for each LLM, m = 1, . . . , M , we solve the following optimization
problem:
XD X nd
min ℓ(gm (xdi ), y(xdi , m)), (1)
gm
d=1 i=1

where we choose ℓ to be a binary cross-entropy loss and gm is any standard probabilistic classifier, i.e., gm (x)
estimates P (y(x, m) = 1|x).
2 We omit dependency on the prompt when generating with an LLM and, w.l.o.g., consider the same LLM with a different

prompting strategy as a different LLM.

3
An important consideration when training correctness predictors is their ability to generalize out-of-distribution
(OOD) data, since our goal is to estimate LLM performance on a new task d′ that has not been seen during
training. Training predictors given data from multiple domains that need to generalize to unseen domains is
indeed an active area of research in ML literature. For example, Sun and Saenko (2016); Arjovsky et al. (2019)
proposed methods for improving OOD generalization when training on data from multiple domains, while Koh
et al. (2021) proposed a benchmark for OOD generalization demonstrating the challenging nature of the problem
in various applications.
In this work, we use a simple model for the correctness predictor: we embed all inputs with a sentence
transformer (Reimers and Gurevych, 2019) and use a k-nearest neighbors classifier (Cover and Hart, 1967) as
{gm }Mm=1 . kNN is a simple non-parametric classifier that allows us to fit a potentially complicated decision
boundary of an LLM’s correctness across multiple tasks without extensive hyperparameter tuning. We choose
this approach for learning correctness predictors to emphasize the utility of learning from benchmarks even
with a basic method and instead focus on the question specific to our problem that has not been studied in
prior works on OOD generalization: Can we improve the quality of LLM routing with an imperfect correctness
predictor ?

4 LLM routing with (imperfect) correctness predictors


The goal of LLM routing is to identify an LLM that will have the highest frequency of being correct on a new
′ n ′
task d′ , given the inputs {xdi }i=1
d
from this task:
Pnd′ ′
arg maxm S̃(m, d′ ), where S̃(m, d′ ) = 1
nd ′ i=1 y(xdi , m). (2)

Here, S̃(m, d′ ) is the “oracle” score that we want to estimate. The most intuitive estimator is simply using the
correctness predictor Pnd′ ′
S1 (m, d′ ) = n1d′ i=1 gm (xdi ), (3)
but prior work has shown that accurately estimating P (y|x), i.e., calibration, is challenging on OOD data
(Ovadia et al., 2019). Meanwhile, gm may still produce accurate predictions after thresholding the predicted
probability even if the class probabilities are not estimated well, which is often the case with neural networks
(Guo et al., 2017). This motivates another score:
Pnd′ ′ ′ ′
S2 (m, d′ ) = n1d′ i=1 ḡm (xdi ), where ḡm (xdi ) = I(gm (xdi ) > t), (4)

where t ∈ (0, 1) is some threshold, e.g., t = 0.5, I is an indicator function, and ḡm (x) ∈ {0, 1} can be interpreted
as the prediction of gm on x. This score, however, does not take into account the potential “imperfection” of
gm , i.e., lower accuracy on OOD data from task d′ . To address this issue, we model the out-of-distribution
confidence of the predictions ḡm .

A simple OOD confidence model We model LLM correctness as follows:


(
′ ḡm (x) with probability p(d′ , m)
y(x, m)|x, d = (5)
1 − ḡm (x) with probability 1 − p(d′ , m),

i.e., p(d′ , m) ∈ [0, 1] is the probability that ḡm is the correct prediction on a data point from task d′ . The above
model can be condensed as follows:

y(x, m)|x, d′ ∼ Bern(ḡm (x)p(d′ , m) + (1 − ḡm (x))(1 − p(d′ , m))). (6)

In this simplistic (and approximate) model, we assume that p(d′ , m) does not depend on the input x after
conditioning on the task d′ . The assumption is analogous to the homoscedastic error term assumption in linear
regression models and allows us to interpret p(d′ , m) as the marginal/overall accuracy of ḡm on data from the
task d′ .
Prior work has studied the problem of estimating OOD accuracy given the inputs from a new task, but
existing methods are challenging to combine with our approach. For example, Garg et al. (2022) learn a
threshold on model confidence, which is hard to apply when using kNN classifiers, and Ng et al. (2023) require

4
data augmentations that can be challenging to identify given the diversity of tasks in benchmarks. Prior methods
also do not take into account the partition of the train data into tasks inherent in our problem setting.
We treat the problem of estimating p(d′ , m) as a supervised learning task, taking advantage of the task
partition. Specifically, we assign a task descriptor u(d) ∈ R+ to every task that measures the distance of the
data from task d to the other available tasks combined. Then we collect the values of p(d, m), i.e., the accuracy
of ḡm on d, and fit a non-parametric regression model to predict p(d, m) from u(d). At test time, we compute
′ n ′
u(d′ ) for a new task d′ based on the inputs {xdi }i=1
d
and predict p(d′ , m) using the fitted regression model. In
general, one can consider more sophisticated, higher-dimensional task descriptors u(d), but here, for simplicity,
we keep it 1-dimensional and use a Gaussian kernel smoother (also known as the Nadaraya-Watson estimator)
as the non-parametric regressor. We provide details in Appendix A.
Finally, given the model of LLM correctness 6, S̃(m, d′ ) is a random variable (corresponding to S̃(m, d′ ))
distributed as a (scaled) sum of two Bernoulli random variables. To arrive at our final score for LLM routing,
we take its expected value:

S3 (m, d′ ) = S2 (m, d′ )p(d′ , m) + (1 − S2 (m, d′ ))(1 − p(d′ , m)). (7)

When selecting an LLM with S3 , we consider an alternative to the arg max criterion based on our correctness
model 6, which defaults to the best model on average across benchmark datasets when we are not sufficiently
confident that a candidate model will be better:
(
m3 if P (S̃(m3 , d′ ) > S̃(m∗ , d′ )) > η
(8)
m∗ otherwise,
PD
where m3 = arg maxm S3 (m, d′ ), i.e., the best LLM for the new task according to S3 , and m∗ = arg maxm d=1 S̃(m, d),
i.e., the best LLM across the benchmark datasets. In the experiments, we set η = 0.6.
We summarize our LLM routing procedures in Appendix A.

4.1 Connection to meta-learning


The OOD confidence model in equation 6 is a meta-model of routing across multiple tasks, and fitting it entails
a form of meta-learning. Consider the meta-learning problem
PD Pnd
mingm ,p(·,m) d=1 i=1 ℓ(ḡm (xdi )p(d, m) + (1 − ḡm (xdi ))(1 − p(d, m)), y(xdi , m)), (9)

where ḡm and p(·, m) are meta-parameters and adaptation step ḡm → ḡm (·)p(·, m) adaptively shrinks the router
output towards ambiguity. We exploit this connection to theoretically demonstrate the potential advantages of
routing LLMs using S3 over S2 .
In expectation/in the population, equation 9 fits a larger model class than equation 1, so the risk of the
adaptively shrunken router is at most that of the non-adaptive router:
PD  d d d

d=1 E ℓ(ḡm (X )p(d, m) + (1 − ḡm (X ))(1 − p(d, m)), y(X , m))
PD   (10)
≤ d=1 E ℓ(ḡm (X d ), y(X d , m)) .

This suggests (subject to standard assumptions on the loss function) that adaptive shrinkage routing leads to
better approximations of the oracle router. Lemma 4.1 confirms this intuition.
Lemma 4.1. Let ℓ(y1 , y2 ) = ρ(y1 − y2 ) for some subadditive ρ : R → R (e.g. ρ(x) = 21 x2 for the square loss).
We have
e ≤ E ℓ(ḡm (X d ), y(X d , m)) ),
 
ℓ(S2 , S)
e ≤ E ℓ(p(d, m)ḡm (X d ) + (1 − p(d, m))(1 − ḡm (X d )), y(X d , m)) ).
 
ℓ(S3 , S)
We present the proof in Appendix D. Combining equation 10 and Lemma 4.1, we expect the adaptive router
based on S3 to outperform its non-adaptive counterpart based on S2 . That said, it is unclear whether adaptive
shrinkage will improve the performance of the adaptive router in finite samples: the expected performance
of the adaptive router may be offset by the inflation in variance from fitting the larger (adaptive) model
class. Fortunately, our empirical results show that task-specific adaption, i.e., using S3 as a score for routing,
generally improves performance. The two-step method for fitting ḡm and p in Section 4 approximately minimizes
equation 9 with a single Gauss-Seidel pass through the decision variables.

5
Table 1: LLM routing on HELM: Comparison of various model scores for LLM routing with the Oracle model
selection and performance of the best model on average (BMA).
Acc. Ratio to Best Pearson Spearman % BMA # Params Rank
S1 eq. 3 0.662 0.855 0.685 0.465 0.17 40.3B 6.172
S2 eq. 4 0.676 0.868 0.636 0.468 0.10 44.3B 5.897
S3 eq. 7, 8 0.694 0.898 0.727 0.492 0.48 49.8B 5.310
S3 true p 0.735 0.944 0.799 0.596 0.22 33.8B 3.800
LL 0.684 0.869 0.714 0.459 0.10 — 6.517
BMA 0.688 0.884 — — 1.00 70.0B 6.069
Oracle 0.773 1.000 — — 0.21 29.1B 1.000

5 Experiments
5.1 Model routing on HELM
We explore the benefits and challenges of learning from benchmarks using the HELM (Liang et al., 2022)
benchmark.

Data We select 29 datasets representing scenarios such as question answering (including a subset of MMLU
(Hendrycks et al., 2020)), text classification, language, knowledge, and reasoning, among others. We present
additional information about these datasets in Table 3.

Models We evaluate 18 open-source models ranging in size from 3B to 70B, including base and chat variations
of Llama 2 in different sizes. All models are summarized in Table 4.

Model routing The best model on average (BMA) across the 29 considered HELM datasets is llama-2-70b
(followed by llama-2-70b-chat). Our goal is to show that learning model routers from benchmark data can
simultaneously outperform BMA and reduce inference costs by recommending smaller LLMs for tasks where
they can perform well. We compare models selected with the three scores, S1 , S2 , and S3 , presented in Section
4 to the performance of llama-2-70b, i.e., the BMA. All correctness predictors gm s are kNN classifiers with
k = 5.
We also report the performance of the best model according to the “oracle” score S̃, which is the upper
bound on what can be achieved with model routing, and S̃3 , which corresponds to S3 with the true p(d′ , m), i.e.,
the accuracy of (an imperfect) gm on d′ . Finally, we compare to scoring LLMs with the average log-likelihood
(LL) (or negative perplexity) of the response they generate on the inputs from the task of interest. This last
baseline requires producing generations with every LLM at test time to make a selection, while all of our scores
only require generating with the chosen LLM.

Results We conduct 29 sets of experiments, each time selecting 28 of the datasets as the benchmark data for
training the LLM routers and using the remaining task as the new task d′ for evaluating the quality of the LLM
selection for this task. In Table 1 we report averages across experiments for the performance of the selected
model (Acc.), ratio of this performance to the performance of the best model for the corresponding new task
(Ratio to Best), Pearson and Spearman rank correlations between model accuracies and model scores, number
of parameters of the selected model (# Params), rank of the selected model out of 18 considered (Rank). We
also report the fraction of times the BMA is selected by a method (% BMA). Best results are highlighted with
bold and second best with an underline (excluding Oracle).
First, we notice that accounting for imperfections of the correctness predictors (their average accuracy is
0.59) has clear benefits: when we have access to the true accuracy of correctness predictors, the corresponding
score, S3 true p, noticeably outperforms all other scores. Our simple kernel smoothing estimator of this accuracy
(MAE= 0.116) allows us to obtain a practical model routing score S3 that outperforms BMA (llama-2-70b)
while choosing smaller models for some of the tasks (as evident by the average number of parameters of the
chosen models). S2 sacrifices some accuracy but chooses even smaller performant models. Overall, learning
from benchmarks allows us to obtain LLM routers that can improve overall performance while utilizing smaller

6
models where appropriate. Finally, we note that log-likelihood (LL) also performs well, however, routing with
it requires passing each test input through each candidate LLM, which have 347B parameters in total.

Reducing the OOD gap The average accuracy of cor-


0.78
rectness predictors across tasks and models for the exper-
iments in Table 1 is 0.59. It is a fairly low accuracy for 0.76
binary classification, which we attribute to the diversity of

Average Accuracy
tasks in the HELM benchmark leading to substantial dis- 0.74
tribution shifts when predicting the correctness of LLMs on
held-out tasks. We investigate the quality of model rout- 0.72
S1 eq. (3)
ing when we reduce this OOD gap. A simple strategy to 0.70 S2 eq. (4)
reduce this gap is to collect a small number of labeled in- S3 eq. (8)
S3 true p
distribution samples. This can be accomplished by asking a

0.68 LL
practitioner to provide reference answers (rid s) for a small BMA
Oracle
number of inputs from their task, allowing us to evaluate 0.66
the correctness of candidate LLMs on these in-distribution 0.00 0.05 0.10 0.15 0.20 0.25
inputs and use them to improve correctness predictors. d′
We simulate this scenario by moving min(αnd , 50) sam- Figure ′ 2: Using min(αn , 50) training samples

ples from the data from a new task d′ to the data for train- from d to reduce OOD gap.
ing the correctness predictors. The upper limit of 50 samples is to maintain practical utility while accounting
for varying dataset sizes (see Table 3). We conduct 29 sets of experiments, repeating each one 10 times to obtain
standard deviations (randomness is due to random selection of data points from a new task for reducing the
OOD gap). We summarize the average accuracy of models selected with various routing scores for varying α in
Figure 2 (α = 0 corresponds to Table 1). Results for Pearson correlation are in Figure 6(a).
We see that even a small number of in-distribution samples (α = 0.05) can reduce the OOD gap (corre-
sponding average accuracy of correctness predictors is 0.65; see Figure 6(b)) and noticeably improves the model
routing performance of all three of our scores. When the number of in-distribution samples further increases,
S1 starts to outperform S3 . We attribute this observation to kNN being well-calibrated in-distribution, i.e.,
the correctness predictors provide reliable estimates of their own confidence P (y|x), which are used by S1 in
equation 3. Finally, we note a fairly large variance in the results due to random selection of the in-distribution
training samples from d′ , suggesting that active learning (Settles, 2009) can help to further improve LLM
routing.

5.2 Model Routing on Mix-Instruct


BERTScore

We now consider a different setting and task type,


75 Ours
the MixInstruct benchmark dataset (Jiang et al., Oracle
70
2023). The dataset is composed of instruction-
65
LL
following tasks, divided into train/validation/test sets 0.1 0.2 0.3 0.4 0.5 0.6
of 100K/5K/5K samples, and includes evaluations of
BARTScore

N = 11 open-source LLMs using common metrics, 3.0 10% 18% 27% 37% 48% 61% 73% 84%
e.g. BERTScore (Zhang et al., 2020), BARTScore 97% 100%
3.5 Percentage of test-set used
(Yuan et al., 2021), and BLEURT (Sellam et al., 4.0
2020). In Jiang et al. (2023), this benchmark was 0.1 0.2 0.3 0.4 0.5 0.6
used to compare different LLM ranking methods in 0.00
0.25
BLEURT

per-instance model selection. We follow the same set-


ting and apply our score S1 (m, d′ ) to the test set, per- 0.50
instance, where we use the 100K-sample train set as 0.75
1.00
the benchmark data for training our LLM router. See 0.1 0.2 0.3 0.4 0.5 0.6
Appendices A and C for details on the score compu- Maximal distance to reference data
tation and the experiment parameters, respectively. Figure 3: Average metrics on subsets of the MixInstruct
Due to the per-instance setting, and since the test set test set, defined by limiting the maximal average dis-
was constructed from in-distribution data, we focus tance between test instances and their closest neighbors
on our simplest router model S1 , equation 3. in the reference (train) set.
We compare our approach with the scoring methods examined by Jiang et al. (2023), as well as scoring based

7
Table 2: Average metrics for per-instance LLM selection on the test set of MixInstruct. MCPI denotes model
calls per instance, for N models. Best results are highlighted with bold and second best with an underline
(excluding Oracle).
BERTScore ↑ BARTScore ↑ BLEURT ↑ MCPI
Random 66.36 -3.76 -0.77 -
LL 65.83 -4.12 -0.96 N
BMA: Open-Assisant 74.68 -3.45 -0.39 -
BMA: Vicuna 69.60 -3.44 -0.61 -
MLM-Scoring (Salazar et al., 2020) 64.77 -4.03 -0.88 N
SimCLS (Liu and Liu, 2021) 73.14 -3.22 -0.38 N
SummaReranker (Ravaut et al., 2022) 71.60 -3.25 -0.41 N
PairRanker (Jiang et al., 2023) 72.97 -3.14 -0.37 N
Ours 74.75 -3.40 -0.38 2
Oracle 77.67 -2.87 -0.15 N

on the average log-likelihood (LL) of the model responses to the inputs. Additionally, we present the metrics
for the best models on average (BMA), Open-Assistant (LAION-AI, 2023) and Vicuna (Chiang et al., 2023).
We report the results of BERTScore, BARTScore and BLEURT in Table 2, along with the number of model
calls per instance (MCPI) performed during inference time. All compared methods require model generations
for every point in the test set, by each of the examined LLMs, whereas our approach requires only one model
generation and one call to some general embedding function. In addition, all methods, except for LL, require
training auxiliary language models, whereas our approach is a simple kNN classifier on the embedded inputs.
While our approach does not consistently outperform the compared methods, these results demonstrate the
potential of using benchmark datasets for model routing with significantly better inference-time efficiency.

Effect of benchmark dataset sparsity To highlight the potential of our approach in this setting, we
examine the effect of the reference benchmark data sparsity. We apply our method to different sub-
sets of the test set, Xtest , where the subsets are defined by limiting the maximal average distance of
each test set point to the closest points from theo reference (train) set, denoted by NNtrain , i.e. XC′ =
n
x′ ∈ Xtest |NNtrain
1 ′ ′
P
(x′ )| x∈NNtrain (x′ ) dist(x , x) < C , where C is the maximal average distance and XC is
the resulting subset of the test set. Figure 3 presents the metric scores for the different subsets using our
method, the oracle (best possible choices), and LL scoring. We also report the percentage of the test set that is
used in each subset. This figure depicts that our predictor approaches the oracle metrics as the average distance
to the reference points decreases. This suggests that adding more benchmark datasets, to reduce the sparsity
of the reference space, may lead to better LLM selections with our approach.

6 Discussion and Conclusion 0.76


How useful are smaller LLMs? While a given LLM 0.74
may work best on average, these models tend to be the
0.72
Average Accuracy

biggest and therefore most expensive to run. Practitioners


can achieve gains in cost, compute, and latency if we can 0.70
successfully predict whether a smaller LLM can be adequate
for a given task. Identifying good smaller models for tasks 0.68 S1 eq. (3)
of interest will also redefine the cost/benefit tradeoff behind S2 eq. (4)
0.66 S3 eq. (8)
automating certain tasks, potentially incentivizing the au- S3 true p
tomation of new tasks that were previously cost-prohibitive 0.64 LL
Llama 2 70B
to automate with larger LLMs. 0.62 Oracle
To evaluate the potential of smaller LLMs we revisit 0.00 0.05 0.10 0.15 0.20 0.25
our HELM experiment in Figure 2. In Figure 4, we perform
LLM routing using only models with ≤ 13B parameters and Figure 4: LLM routing with ≤ 13B parameter
compare it to the performance of Llama 2 70B. Oracle’s models compared to Llama 2 70B.

8
performance demonstrates that it is conceptually possible to outperform a large model by routing smaller
LLMs. Results with our scores S1 and S2 (see Figure 7 for breakdown by scores) demonstrate that it is also
practically feasible to match the performance of the 70B model by combining learning from benchmarks with a
small number (α = 0.04, i.e., 2-40 samples) of labeled samples from a new task that a practitioner can provide
to save on the inference costs in their LLM application.

Learning from more benchmarks We anticipate 1.0


learning LLM routers from benchmarks to be the most ef-

Pearson Corr(S3 eq. (7), Accs.)


fective when new tasks are similar to the benchmark tasks, 0.9
thus reducing the OOD gap without any labeling burden for 0.8
a practitioner. To empirically investigate this hypothesis,
in Figure 5 we visualize the relation between the quality 0.7
of model routing with S3 , measured with Pearson corre-
0.6
lation between model scores and accuracies of candidate
LLMs, and the distance u(d′ ) from a new task d′ to the 0.5
available benchmark data for training the routers. In this
0.4
experiment, we aggregate results across different α values
from Figure 2. For smaller distance values the correlation 0.3
is approaching 1, while for large distances it sometimes de- 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
teriorates. Results for other scores demonstrate a similar Dataset Distance u(d )
trend and are presented in Appendix B.2 along with addi- Figure 5: Correlation(S3 , Accs.) and u(d′ ).
tional details. This experiment and the benchmark dataset
sparsity analysis presented in Figure 3 for MixInstruct illustrate that learning with more benchmarks can im-
prove the efficacy and reliability of the LLM routers as new tasks are more likely to be closer to a large collection
of datasets.

Future work Our work demonstrates the potential of learning from benchmarks for LLM routing and inves-
tigates 3 model scores in the context of OOD generalization when routing LLMs for new tasks. We summarize
potential next steps for improving the quality and efficacy of LLM routers.
The major challenge of LLM routing is OOD generalization of correctness predictors. Thus, using more
benchmarks and modern methods for improving OOD generalization to learn correctness predictors is a promis-
ing next step. A practitioner can also provide labels for a few samples from their task, possibly guided by active
learning techniques, to adapt or fine-tune correctness predictors. Even when reducing the OOD gap is too chal-
lenging, our score accounting for the (potentially low) accuracy of correctness predictors demonstrated strong
results when this accuracy, p(d′ , m), is known for a new task, thus encouraging the development of methods for
estimating it better.
We also anticipate that routing “expert” LLMs fine-tuned for a specific domain can improve the results.
Regions of the sample space where such models are “correct” should mostly align with the domains of their
expertise (recall Figure 1), making it easier to learn the corresponding correctness predictors, and simplifying
LLM routing when a new task is from a specific domain.
Our experiments in Figure 4 demonstrate the utility of LLM routing with smaller models, which can reduce
costs and facilitate the use of LLMs in a broader set of domains. Thus, we want to explore modifications to our
scores that will encourage the selection of smaller LLMs when their anticipated performance is comparable to
the larger, more reliable models. Prior work on frugal API selection (Chen et al., 2020, 2023) provides a good
starting point to explore this direction.

References
Arjovsky, M., Bottou, L., Gulrajani, I., and Lopez-Paz, D. (2019). Invariant risk minimization. arXiv preprint
arXiv:1907.02893.
Beeching, E., Fourrier, C., Habib, N., Han, S., Lambert, N., Rajani, N., Sanseviero, O., Tunstall, L., and Wolf, T.
(2023). Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard.

Bishop, C. M. and Nasrabadi, N. M. (2006). Pattern recognition and machine learning, volume 4. Springer.

9
Bojar, O., Buck, C., Federmann, C., Haddow, B., Koehn, P., Leveling, J., Monz, C., Pecina, P., Post, M., Saint-
Amand, H., et al. (2014). Findings of the 2014 workshop on statistical machine translation. In Proceedings
of the ninth workshop on statistical machine translation, pages 12–58.
Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J.,
Bosselut, A., Brunskill, E., et al. (2021). On the opportunities and risks of foundation models. arXiv preprint
arXiv:2108.07258.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry,
G., Askell, A., et al. (2020). Language models are few-shot learners. Advances in neural information processing
systems, 33:1877–1901.
Chen, J., Liu, F., Avci, B., Wu, X., Liang, Y., and Jha, S. (2021a). Detecting errors and estimating accuracy on
unlabeled data with self-training ensembles. Advances in Neural Information Processing Systems, 34:14980–
14992.

Chen, L., Zaharia, M., and Zou, J. (2023). FrugalGPT: How to Use Large Language Models While Reducing
Cost and Improving Performance. arXiv preprint arXiv:2305.05176.
Chen, L., Zaharia, M., and Zou, J. Y. (2020). Frugalml: How to use ml prediction apis more accurately and
cheaply. Advances in neural information processing systems, 33:10685–10696.

Chen, M., Goel, K., Sohoni, N. S., Poms, F., Fatahalian, K., and Ré, C. (2021b). Mandoline: Model evaluation
under distribution shift. In International conference on machine learning, pages 1617–1629. PMLR.
Chiang, W.-L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez,
J. E., Stoica, I., and Xing, E. P. (2023). Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt
quality.

Cover, T. and Hart, P. (1967). Nearest neighbor pattern classification. IEEE transactions on information
theory, 13(1):21–27.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional trans-
formers for language understanding. arXiv preprint arXiv:1810.04805.

Gao, L., Tow, J., Biderman, S., Black, S., DiPofi, A., Foster, C., Golding, L., Hsu, J., McDonell, K., Muennighoff,
N., Phang, J., Reynolds, L., Tang, E., Thite, A., Wang, B., Wang, K., and Zou, A. (2021). A framework for
few-shot language model evaluation.
Garg, S., Balakrishnan, S., Lipton, Z. C., Neyshabur, B., and Sedghi, H. (2022). Leveraging unlabeled data to
predict out-of-distribution performance. In International Conference on Learning Representations.

Goyal, P., Mahajan, D., Gupta, A., and Misra, I. (2019). Scaling and benchmarking self-supervised visual
representation learning. In Proceedings of the ieee/cvf International Conference on computer vision, pages
6391–6400.
Guillory, D., Shankar, V., Ebrahimi, S., Darrell, T., and Schmidt, L. (2021). Predicting with confidence on
unseen distributions. In Proceedings of the IEEE/CVF international conference on computer vision, pages
1134–1144.
Gulrajani, I. and Lopez-Paz, D. (2021). In search of lost domain generalization. In International Conference
on Learning Representations.
Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. (2017). On calibration of modern neural networks. In
International conference on machine learning, pages 1321–1330. PMLR.
Hastie, T., Tibshirani, R., Friedman, J. H., and Friedman, J. H. (2009). The elements of statistical learning:
data mining, inference, and prediction, volume 2. Springer.
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. (2020). Measuring
massive multitask language understanding. arXiv preprint arXiv:2009.03300.

10
Jiang, D., Ren, X., and Lin, B. Y. (2023). Llm-blender: Ensembling large language models with pairwise
ranking and generative fusion. In Proceedings of the 61st Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pages 14165–14178, Toronto, Canada. Association for Computational
Linguistics.
Jiang, Y., Nagarajan, V., Baek, C., and Kolter, J. Z. (2021). Assessing Generalization of SGD via Disagreement.
In International Conference on Learning Representations.
Kelly, M., Longjohn, R., and Nottingham, K. (2023). The UCI Machine Learning Repository.
Koh, P. W., Sagawa, S., Marklund, H., Xie, S. M., Zhang, M., Balsubramani, A., Hu, W., Yasunaga, M.,
Phillips, R. L., Gao, I., et al. (2021). Wilds: A benchmark of in-the-wild distribution shifts. In International
Conference on Machine Learning, pages 5637–5664. PMLR.
LAION-AI (2023). Open assistant.
Li, C., Liu, H., Li, L., Zhang, P., Aneja, J., Yang, J., Jin, P., Hu, H., Liu, Z., Lee, Y. J., et al. (2022). Elevater:
A benchmark and toolkit for evaluating language-augmented visual models. Advances in Neural Information
Processing Systems, 35:9287–9301.
Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y.,
Kumar, A., et al. (2022). Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
Liu, Y. and Liu, P. (2021). Simcls: A simple framework for contrastive learning of abstractive summarization.
In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th
International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 1065–1072.
Maity, S., Yurochkin, M., Banerjee, M., and Sun, Y. (2023). Understanding new tasks through the lens of
training data via exponential tilting. In International Conference on Learning Representations.
Ng, N. H., Hulkund, N., Cho, K., and Ghassemi, M. (2023). Predicting out-of-domain generalization with
neighborhood invariance. Transactions on Machine Learning Research.
Ovadia, Y., Fertig, E., Ren, J., Nado, Z., Sculley, D., Nowozin, S., Dillon, J., Lakshminarayanan, B., and Snoek,
J. (2019). Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift.
Advances in neural information processing systems, 32.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. (2019). Language models are
unsupervised multitask learners. OpenAI blog, 1(8):9.
Raschka, S. (2018). Model evaluation, model selection, and algorithm selection in machine learning. arXiv
preprint arXiv:1811.12808.
Ravaut, M., Joty, S., and Chen, N. (2022). Summareranker: A multi-task mixture-of-experts re-ranking frame-
work for abstractive summarization. In Proceedings of the 60th Annual Meeting of the Association for Com-
putational Linguistics (Volume 1: Long Papers), pages 4504–4524.
Reimers, N. and Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. In
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for
Computational Linguistics.
Salazar, J., Liang, D., Nguyen, T. Q., and Kirchhoff, K. (2020). Masked language model scoring. In Proceedings
of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2699–2712.
Sellam, T., Das, D., and Parikh, A. (2020). BLEURT: Learning robust metrics for text generation. In Proceedings
of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7881–7892.
Settles, B. (2009). Active learning literature survey.
Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., Brown, A. R., Santoro, A., Gupta, A.,
Garriga-Alonso, A., et al. (2022). Beyond the imitation game: Quantifying and extrapolating the capabilities
of language models. arXiv preprint arXiv:2206.04615.

11
Sun, B. and Saenko, K. (2016). Deep coral: Correlation alignment for deep domain adaptation. In Computer
Vision–ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings,
Part III 14, pages 443–450. Springer.
Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. (2019).
Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural
information processing systems, 32.
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. (2018). Glue: A multi-task benchmark
and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461.
Xu, H. and Tibshirani, R. (2022). Estimation of prediction error with known covariate shift. arXiv preprint
arXiv:2205.01849.
Yu, Y., Yang, Z., Wei, A., Ma, Y., and Steinhardt, J. (2022). Predicting out-of-distribution error with the
projection norm. In International Conference on Machine Learning, pages 25721–25746. PMLR.
Yuan, W., Neubig, G., and Liu, P. (2021). Bartscore: Evaluating generated text as text generation. Advances
in Neural Information Processing Systems, 34:27263–27277.
Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., and Artzi, Y. (2020). BERTScore: Evaluating Text
Generation with BERT. In International Conference on Learning Representations.

12
A Correctness predictors and confidence estimation
We provide additional details on the correctness predictors used in our experiments, along with more details on
the dataset distance and the Gaussian kernel smoother for estimating the accuracy of the correctness predictors
on new tasks, p(d′ , m)s.

Correctness predictors in our experiments While any probabilistic classifier may fit our setting, in the
experiments we mainly used a simple kNN classifier, applied in an embedded space. Recall that we have D
benchmark datasets with inputs {xdi }ni=1 d
for d = 1, . . . , D. To compute our correctness predictor based on the
benchmark datasets, we first embed all their inputs. We denote the combined set of embedded inputs from the
benchmark datasets as D = {ϕ(xd1 ), . . . , ϕ(xdnd )}D
d=1 , where ϕ is a sentence transformer (Reimers and Gurevych,

2019). We use all-mpnet-base-v2 from Hugging Face in all experiments. Given a sample xdi from a new task
d′ , we embed it using the same ϕ and define the classifier, gm , for each model m by:
 ′ 1 X
gm xdi = y(e, m), (11)
k d′
e∈NN(ϕ(xi ),k,D)


where y(e, m) ∈ {0, 1} is the correctness of model m on the (embedded) input e, and NN(ϕ(xdi ), k, D) is the set

of k closest embedded neighbors from D to the new embedded sample ϕ(xdi ), according to the cosine distance.
Then, ḡm , as defined in equation 4, is a binary kNN classifier. Finally, we compute the per-model correctness
predictors, S1 (m, d′ ) and S2 (m, d′ ), for the new task d′ , according to equation 3 and equation 4, respectively.
Next, we describe a method for estimating the probability p(d′ , m) in our confidence model and the S3 (m, d′ )
score, equation 7. This method comprises a dataset distance and a kernel smoother, defined as follows.

Dataset distance Our dataset distance u(d) is a one-sided variant of the Chamfer distance with extended
neighborhood size. We define it formally below:
nd
1 X
u(d) = nn(xdi , D−d ), (12)
nd i=1

where D−d is the set of (embedded) inputs from the D datasets excluding inputs from d (for a new task d′ ,
D−d′ = D since d′ is not part of the D benchmark datasets we use for training LLM routers), and nn(xdi , D−d )
is the average distance from the input xdi to its closest κ neighbors in D−d :
1 X
nn(x, D) = cosine(ϕ(x), e), (13)
κ
e∈NN(ϕ(x),κ,D)

where NN(ϕ(x), κ, D) is the set of κ closest embedded neighbors of ϕ(x) in D according to cosine distance. We
set κ = 19 for the dataset distance in all experiments.

Kernel smoother For each LLM m = 1, . . . , M , to obtain the corresponding kernel smoother estimate we
iterate over the available benchmark datasets, each time holding one out and computing pairs (u(d), p(d, m))
for held out dataset d, where p(d, m) is the accuracy of gm on data from d after training on D−d . We repeat this
process 10 times for 15 values of in-distribution mixing parameter α (similar to the experimental setup in Figure
2 but using benchmark datasets d = 1, . . . , D instead of d′ ) to obtain the training set of distance-accuracy pairs
{uz , pz (m)}Z
z=1 . In the HELM experiments in Section 5.1, Z = 28 ∗ 10 ∗ 15 = 4200 (28 is the number of datasets
from HELM after holding one out as the new task for evaluating the performance).
For a new task d′ , we compute u(d′ ) using the inputs from this task and our benchmark datasets and estimate

p(d , m) for each m with simple Gaussian kernel smoothing:
PZ
′ pz (m)K(u(d′ ), uz )
p(d , m) = z=1 PZ , (14)

z=1 K(u(d ), uz )

 
)−uz )2
where K(u(d′ ), uz ) = exp − (u(d2σ 2 . We set σ = 0.09 in all experiments which is the value we found to
perform well through some preliminary experimentation.

13
0.116
0.85 0.68
Pearson Corr(Score, Accs.)
0.114

Average Acc. of gms

MAE of estimating p
0.80 0.66
0.112
0.75 0.64
0.110

0.70 S1 eq. (3) 0.62 0.108


S2 eq. (4)
S3 eq. (7)
0.65 S3 true p 0.60 0.106
LL
0.00 0.05 0.10 0.15 0.20 0.25 0.00 0.05 0.10 0.15 0.20 0.25 0.00 0.05 0.10 0.15 0.20 0.25

(a) Pearson correlation (b) Accuracy of gm s (c) Estimation of p(d′ , m)s

Figure 6: Additional results for Reducing the OOD gap experiment in Figure 2.

0.72 0.72 0.72

0.70 0.70 0.70


Average Accuracy

Average Accuracy

Average Accuracy
0.68 0.68 0.68

0.66 0.66 0.66

0.64 0.64 0.64


S1 eq. (3) S2 eq. (4) S3 eq. (8)
0.62 Llama 2 70B 0.62 Llama 2 70B 0.62 Llama 2 70B
0.00 0.05 0.10 0.15 0.20 0.25 0.00 0.05 0.10 0.15 0.20 0.25 0.00 0.05 0.10 0.15 0.20 0.25

(a) S1 equation 3 (b) S2 equation 4 (c) S3 equation 8

Figure 7: LLM routing with ≤ 13B parameter models compared to Llama 2 70B.

Finally, we note that the proposed confidence model, including the definitions of the dataset distance and
kernel smoother, can be combined with any classifier gm , and is not restricted to the kNN classifier used for the
correctness predictor in our experiments.

Additional notes regarding S3 Recall that when selecting a model with S3 (m, d′ ) we use an additional step
described in equation 8 that facilitates the selection of the best model on average when we are not sufficiently
confident in the model with the highest S3 (m, d′ ) score. Probability expression, P (S̃(m3 , d′ ) > S̃(m∗ , d′ )),
required for this step is not available in closed form, as S̃ is distributed as a (scaled) sum of two Bernoulli
random variables, but it is straightforward to estimate via Monte Carlo sampling from the corresponding
Bernoulli distributions.
When reporting correlations for S3 (e.g., Pearson and Spearman correlations in Table 1), we use S3 (m, d′ )
as is, i.e., as defined in equation 7.

B Additional results for model routing on HELM


B.1 Reducing the OOD gap
We present additional results for this experiment in Figure 6. (a) shows Pearson correlation improvement as
we increase α, similar to the trends in accuracy improvement in Figure 2; (b) demonstrates that the accuracy
of correctness predictors gm s improves as we increase the number of samples from d′ used for training them,
thus reducing the OOD gap; (c) shows the mean absolute error (MAE) of our kernel smoothing estimator of
the accuracy of correctness predictors p(d′ , m) – the estimator does not improve as much with increased α, thus
S3 eventually becomes worse than S1 in terms of correlation and accuracy of the selected models.

14
Pearson Corr(S1 eq. (3), Accs.) 1.0 1.0 1.0

Pearson Corr(S2 eq. (4), Accs.)

Pearson Corr(S3 true p, Accs.)


0.8 0.8 0.8

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0


0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Dataset Distance u(d ) Dataset Distance u(d ) Dataset Distance u(d )
(a) S1 equation 3 (b) S2 equation 4 (c) S3 equation 7; true p(d′ , m)

Figure 8: Correlation of scores and LLM accuracies on new tasks and corresponding data distances.

B.2 Dataset distance and Pearson correlation


The dataset distance u(d′ ) is computed as in equation 12. As evident from equation 13, dataset distance will
usually decrease for larger values of α as inputs from d′ are moved into D (assuming that inputs from d′ are
on average closer to each other than they are to inputs from other tasks). In this experiment, this serves as a
mechanism to study the performance of LLM routing on closer datasets, providing insights into the benefits of
learning LLM routers on more benchmarks where it is more likely that dataset distance for a new task is small.
In Figure 8 we present relations between dataset distance u(d′ ) and Pearson correlation between various
model scores and accuracies of candidate LLMs. For results with S3 see Figure 5.

C Additional details for model routing on MixInstruct


Correctness predictor and metrics In the experiments on the MixInstruct dataset (Jiang et al., 2023), we
construct S1 (m, d′ ) following the scoring approach described in Appendix A, where the MixInstruct train set
was defined as the benchmark dataset. Then, instead of computing a per-dataset score for the entire test set, we
′ ′
compute the score for each test point, i.e., S1 (m, xdi ) = gm (xdi ), and select a model m per-point based on this
score. The reported metrics in Table 2 and Figure 3 are averaged over the output evaluations of these per-point

model selections. In our experiments, to compute gm (xdi ) we use the BERTScore metric on the closest train
set points (as y(x, m) in 11). This was motivated by the conceptual relation between the implementation of our
approach and the BERTScore, which relies on embedding space distances, and was validated empirically.

kNN parameter We set k = 10 for the kNN classifier, slightly higher than in the HELM experiments. This
choice was motivated by the in-distribution properties of the test set in MixInstruct, which is constructed from
different parts of the same datasets that comprise the train set. We note that the metrics did not significantly
vary for different choices of k ∈ [5, 100].

D Proof of Lemma 4.1


Lemma D.1 (Lemma 4.1). Let ℓ(y1 , y2 ) = ρ(y1 − y2 ) for some subadditive ρ : R → R (e.g. ρ(x) = 12 x2 for the
square loss). We have
e ≤ E ℓ(ḡm (X d ), y(X d , m)) ),
 
ℓ(S2 , S)
e ≤ E ℓ(p(d, m)ḡm (X d ) + (1 − p(d, m))(1 − ḡm (X d )), y(X d , m)) ).
 
ℓ(S3 , S)

Proof. We start by showing the upper bound of ℓ(S2 , S): e

e = ρ(E ḡm (X d ) − E y(X d , m) )


   
ℓ(S2 , S) (def of ℓ)
= ρ(E ḡm (X d ) − y(X d , m) )
 

≤ E ρ(ḡm (X d ) − y(X d , m)) )


 
(convexity of ρ)
= E ℓ(ḡm (X d ), y(X d , m)) ),
 
(def of ℓ)

15
where we recalled that subadditive functions are convex in the third step. The upper bound of ℓ(S3 , S)
e follows
a similar argument:

e = ρ(p(d, m)E ḡm (X d ) + (1 − p(d, m))(1 − E ḡm (X d ) ) − E y(X d , m) )


     
ℓ(S3 , S)
= ρ(E p(d, m)ḡm (X d ) + (1 − p(d, m))(1 − ḡm (X d )) − y(X d , m) )
 

≤ E ρ(p(d, m)ḡm (X d ) + (1 − p(d, m))(1 − ḡm (X d )) − y(X d , m)) )


 

= E ℓ(p(d, m)ḡm (X d ) + (1 − p(d, m))(1 − ḡm (X d )), y(X d , m)) )


 

≤ E ℓ(ḡm (X d ), y(X d , m)) ).


 

16
Table 3: HELM dataset details

Dataset Size (instances) Type


RAFT-ADE Corpus V2 40 Binary Classification
RAFT-Banking 77 40 77 Class Classification
RAFT-NeurIPS Impact Statement Risks 40 Binary Classification
RAFT-One Stop English 40 3 Class Classification
RAFT-Overruling 40 Binary Classification
RAFT-Semiconductor Org Types 40 3 Class Classification
RAFT-Systematic Review Inclusion 40 Binary Classification
RAFT-TAI Safety Research 40 Binary Classification
RAFT-Terms of Service 40 Binary Classification
RAFT-Tweet Eval Hate 40 Binary Classification
RAFT-Twitter Complaints 40 Binary Classification
IMDB 1000 Binary Classification
Civil Comments-demographic=all 1000 Binary Classification
bAbI-QA-task=all 1000 Q&A: one word answers
BoolQ 1000 Binary Classification
Entity Matching-Dataset=Beer 182 Binary Classification
Entity Matching-Dataset=Dirty iTunes Amazon 218 Binary Classification
Entity Matching-Dataset=Abt Buy 1000 Binary Classification
Entity Data Imputation-Dataset=Restaurant 242 Q&A: one word answers
Entity Data Imputation-Dataset=Buy 182 Q&A: one word answers
BBQ-subject=all 1000 Multiple Choice Questions
Legal Support 1000 Multiple Choice Questions
LSAT QA-task=all 461 Multiple Choice Questions
MMLU-Subject=Abstract Algebra 111 Multiple Choice Questions
MMLU-Subject=College Chemistry 108 Multiple Choice Questions
MMLU-Subject=Computer Security 111 Multiple Choice Questions
MMLU-Subject=Econometrics 126 Multiple Choice Questions
MMLU-Subject=US foreign policy 111 Multiple Choice Questions
Truthful QA-task=mc single 654 Multiple Choice Questions
Total: 29 datasets 9946

17
Table 4: Candidate LLMs

Name Model Size, B Average Accuracy on the 29 HELM tasks


codegen-16b-mono 16 0.451
dial-flant5-xl 3 0.454
falcon-40b 40 0.641
flan-t5-xl 3 0.650
flan-t5-xxl 11 0.658
flan-ul2 20 0.668
gpt-jt-6b-v1 6 0.576
gpt-neox-20b 20 0.492
mpt-7b-instruct 7 0.514
mt0-xxl 13 0.543
llama-2-13b 13 0.624
llama-2-13b-chat 13 0.623
llama-2-13b-chat-beam 13 0.603
llama-2-70b 70 0.688
llama-2-70b-chat 70 0.687
llama-2-7b 7 0.610
llama-2-7b-chat 7 0.605
starcoder 15 0.587
Total: 18 LLMs 347

18

You might also like