0% found this document useful (0 votes)
92 views9 pages

Parameter-Efficient Fine-Tuning For Pre-Trained Vision Models: A Survey

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
92 views9 pages

Parameter-Efficient Fine-Tuning For Pre-Trained Vision Models: A Survey

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Parameter-Efficient Fine-Tuning for Pre-Trained Vision Models: A Survey

Yi Xin1 , Siqi Luo1,2 , Haodi Zhou1 , Junlong Du3 , Xiaohong Liu2 ,


Yue Fan4 , Qing Li4 , Yuntao Du4∗
1
Nanjing University, 2 Shanghai Jiao Tong University, 3 Youtu Lab, Tencent,
4
Beijing Institute for General Artificial Intelligence (BIGAI)
{xinyi, siqiluo, haodizhou}@[Link], jeffdu@[Link]
xiaohongliu@[Link], {liqing, fanyue, duyuntao}@[Link]
arXiv:2402.02242v2 [[Link]] 8 Feb 2024

Abstract each dataset, which becomes impractical as the number of


tasks increases, especially in the case of large PVMs.
Large-scale pre-trained vision models (PVMs) have As a promising solution, parameter-efficient fine-tuning
shown great potential for adaptability across vari- (PEFT), which was originally proposed in NLP, overcomes
ous downstream vision tasks. However, with state- the above challenges by updating a minimal number of pa-
of-the-art PVMs growing to billions or even tril- rameters while potentially achieving comparable or superior
lions of parameters, the standard full fine-tuning performance to full fine-tuning [Hu and et al., 2021; Yu and
paradigm is becoming unsustainable due to high et al., 2022]. These approaches hinge on recent advances
computational and storage demands. In response, showing that large pre-trained models trained with rich data
researchers are exploring parameter-efficient fine- have strong generalisability and most parameters in the PVMs
tuning (PEFT), which seeks to exceed the perfor- could be shared for the new tasks [Kornblith and et al., 2019;
mance of full fine-tuning with minimal parameter Yu and et al., 2022]. PEFT methods could reduce learnable pa-
modifications. This survey provides a comprehen- rameters, which not only facilitates more effective adaptation
sive overview and future directions for visual PEFT, to novel tasks but also safeguards the pre-existing knowledge
offering a systematic review of the latest advance- within the PVMs. Taking into account the prospects of PEFT
ments. First, we provide a formal definition of PEFT and the fast-paced development of large-scale vision models,
and discuss model pre-training methods. We then a survey that provides a detailed and up-to-date investigation
categorize existing methods into three categories: of PEFT in the vision domain is in urgent demand.
addition-based, partial-based, and unified-based. Fi-
nally, we introduce the commonly used datasets This paper aims to provide a comprehensive and systematic
and applications and suggest potential future re- study of PEFT methods in the vision domain, particularly fo-
search challenges. A comprehensive collection of cusing on transformer-based pre-trained models ranging from
resources is available at [Link] the year 2019 to the year 2023. As shown in Fig. 1, existing
Awesome-Parameter-Efficient-Transfer-Learning. visual PEFT methods could be categorized into addition-based
tuning, partial-based tuning, and unified-based tuning. In sec-
tion 2, we will define the problem of PEFT, introduce popular
1 Introduction backbones, and discuss pre-training methods. In section 3, a
detailed taxonomy and in-depth analysis of the PEFT methods
With the development of available datasets [Deng and et al.,
will be presented. The real-world applications of PEFT will
2009], model architectures [Dosovitskiy and et al., 2021], and
be introduced in section 4. Finally, in section 5, we will point
training algorithms [He and et al., 2022], a significant number
out future research challenges.
of vision foundation models have been developed. Particularly,
transformer-based pre-trained vision models (PVMs) [Khan
and et al., 2022] have demonstrated remarkable performance 2 Preliminaries
across various computer vision tasks, such as image classifi-
cation [Dosovitskiy and et al., 2021] and semantic segmenta- 2.1 Problem Definition
tion [Kirillov and et al., 2023]. Definition 1. (Parameter-efficient Fine-tuning). Given a pre-
Owing to the powerful representational abilities of PVMs, trained model M parametrized by θ, and a downstream task
it has become a popular paradigm to fine-tune PVMs for learn- |D|
ing downstream tasks. However, traditional full fine-tuning, D = {(xi , yi )}i=1 , where (xi , yi ) serves as a ground-truth
though effective, requires substantial computational and mem- input-output pair of task D, parameter-efficient fine-tuning
ory resources. This becomes particularly costly for models aims to adapt θ to task D, where task-specific parameters
with billions or even trillions of parameters. Additionally, increment ∆θ is introduced with |∆θ| ≪ |θ|. The optimal
there is a requirement to maintain separate model weights for parameters are found by optimizing the losses L on task D:

Corresponding author min E(xi ,yi )∈D L(Mθ+∆θ (yˆi |xi ), yi ). (1)
∆θ
Convpass [Jie and et al., 2022], AIM [Yang and et al., 2023], PEA [Sharma and et al., 2023],
Adapter Design
AdaptFormer [Chen and et al., 2022], ST-Adapter [Pan and et al., 2022]
Adapter Tuning
LoRand [Yin and et al., 2023a], SCT [Zhao and et al., 2023],
Optimization
Polyhistor [Liu and et al., 2022a], VMT-Adapter [Xin and et al., 2024]

VPT [Jia and et al., 2022], LPT [Dong and et al., 2023], Pro-tuning [Nie and et al., 2023],
Embedding Level DePT [Gao and et al., 2022], IDPT [Zha and et al., 2023], ViPT [Zhu and et al., 2023]
LION [Wang and et al., 2024], CVP [Tsai and et al., 2023]
Prompt Tuning
Addition-based ProSFDA [Hu and et al., 2022], EVP-L [Liu and et al., 2023], P2P [Wang and et al., 2022],
Tuning (§3.1) Pixel Level VP [Bahng and et al., 2022], EVP [Wu and et al., 2022], IML-VP [Chen and et al., 2023a],
DAM-VP [Huang and et al., 2023]

PATT [Yu and et al., 2022], eTT [Xu and et al., 2023a], LAE [Gao and et al., 2023], VQT [Tu and et al., 2023],
Prefix Tuning
Prefix-tuning [Li and et al., 2021]
PEFT Methods for PLMs

Side-Tuning [Zhang and et al., 2020], ViT-Adapter [Chen and et al., 2023b],
Param Efficient
SAN [Xu and et al., 2023b]
Side Tuning
Param & Memory LST [Sung and et al., 2022], DTL [Fu and et al., 2024], E3 VA [Yin and et al., 2023b],
Efficient SAM-LST [Chai and et al., 2023]

Linear Probe [Kornblith and et al., 2019], AdapterBias [Fu and et al., 2022], DP-BiTFiT [Bu and et al., 2022],
Specification Tuning
LN-Tune [Basu and et al., 2024], BitFit [Zaken and et al., 2022], DiffFit [Xie and et al., 2023]
Partial-based
Tuning (§3.2)
LoRA [Hu and et al., 2021], KronA [Edalati and et al., 2022], FacT [Jie and Deng, 2023], EFFT [Chen, 2023],
Reparameter Tuning Atten-Scale [Basu and et al., 2024], KAdaptation [He and et al., 2023], PHNNs [Grassucci and et al., 2022],
SSF [Lian and et al., 2022], DnA [Jiang and et al., 2022], RepAdapter [Luo and et al., 2023]

Unified-based
V-PETL [Yu and et al., 2022], NOAH [Liu and et al., 2022b], U-Tuning [Jiang and et al., 2023], LAE [Gao and et al., 2023]
Tuning (§3.3)

Figure 1: Taxonomy of Parameter-Efficient Fine-Tuning Methods for Pre-trained Vision Models.

2.2 Vision Transformer line of works improves the standard ViT by integrating ad-
The standard Vision Transformer [Dosovitskiy and et al., ditional or contextual information, with notable models like
2021] consists of a patch embedding layer and L Transformer Pyramid DeiT [Touvron and et al., 2021] and Token to To-
layers. Given an image x ∈ RH×W ×C , the patch embed- ken (T2T) ViT [Yuan and et al., 2021]. Another line of work
ding layer first splits and flatten the image x into sequen- focuses on multi-scale ViTs using hierarchical designs to cap-
2 ture spatial details at varying scales, which is a capability
tial patches xp ∈ RN ×(P C) , where (H, W ) represents the
limited in standard ViTs due to fixed token numbers and di-
height and width of the input image, (P, P ) is the resolution
mensions. Key models in this category include Pyramid ViT
of each image patch, C denotes the number of channels, and
(PVT) [Wang and et al., 2021] and Swin Transformer [Liu and
N = HW/P 2 is the number of image tokens. Then, xp is
et al., 2021]. A more comprehensive survey can be found in
mapped to x0 ∈ RN ×d with a trainable linear projection. The this literature [Khan and et al., 2022].
combination of a prepended [cls] token and x0 are the inputs
of Transformer encoders. 2.3 Model Pre-training
Each Transformer layer consists of a multi-head attention
(MHA) and a multilayer perceptron (MLP) module. In MHA, Recently, many pre-training methods with innovative back-
attention scores are computed using query (Q), key (K), and bones for training PVMs have emerged. The pre-training
value (V ) representations, along with projection matrices methods can be mostly categorized into supervised learning
Wq , Wk , Wv ∈ Rd×d . Given an input xℓ−1 at the ℓ-th layer, and self-supervised learning.
the attention is calculated as follows:
Supervised pre-training. These methods use classification
Q = xℓ−1 Wq , K = xℓ−1 Wk , V = xℓ−1 Wv , (2) losses for pre-training on large annotated datasets, e.g., Ima-
geNet [Deng and et al., 2009]. A renowned pre-trained model
QK T
x′ℓ = Attention(Q, K, V ) = sof tmax( √ )V. (3) that applies supervised pre-training is SAM [Kirillov and et
d al., 2023], which is trained on pixel-level annotated datasets
The output tokens x′ℓ are further sent to a LayerNorm(LN) and and achieves excellent results in segmentation tasks.
an MLP block, which is formulated as follows: Self-supervised pre-training. Self-supervised learning,
xℓ = M LP (LN (x′ℓ )) + x′ℓ , (4) now a leading pre-training paradigm, includes 1) Contrastive
learning methods, which focus on attracting similar (positive)
where xℓ is the output of the ℓ-th encoder layer. and repelling dissimilar (negative) samples. This includes both
Recent advancements in vision Transformer architectures image-based approaches such as SimCLR [Chen and et al.,
have significantly enhanced performance in vision tasks. A 2020], MoCo [He and et al., 2020], DINO [Caron and et al.,
2019 2020 2021 2022 2023
To overcome this, Convpass incorporates trainable convolu-
Backbone
ViT Swin T2T SAM tional blocks, thereby enhancing the adapter’s capabilities by
DeiT PVT CvT
integrating the strengths of convolutional neural networks. Ad-
Self-Supervised
MoCo SimCLR BEiT ALIGN EVA DINO v2 ditionally, AIM [Yang and et al., 2023] introduces adapters
BYOL MAE CLIP EVA-02
method
DINO SimMIM
specialized in spatial, temporal, and joint domains, while ST-
Adapter [Pan and et al., 2022] offers a spatiotemporal adapter.
Figure 2: The representative backbones and pre-training methods. Both methods are tailored to improve a vision model’s spa-
tiotemporal reasoning for video understanding tasks. In the
2021], and multi-modality based approaches like CLIP [Rad- field of robotic manipulation, Rob-Adapter [Sharma and et al.,
ford and et al., 2021] and ALIGN [Jia and et al., 2021]. Note 2023] applies the classic bottleneck architecture, commonly
that we exclusively focus on the image-related modules for used in image classification, for lossless adaptation.
multimodal self-supervised models, ignoring other modalities. In the second category, these methods focus on optimizing
2) Mask image modeling methods, including MAE [He and the adapter’s architecture to reduce trainable parameters. One
et al., 2022], SimMIM [Xie and et al., 2021], and EVA [Fang such example is LoRand [Yin and et al., 2023a], which cre-
and et al., 2022], which involve masking parts of images and ates compact adapter structures through a low-rank synthesis
reconstructing them. approach. This method achieves a reduction in parameters by
parameterizing both the down-projection layer Wdown and the
3 Methodology up-projection layer Wup through the multiplication of three
3.1 Addition-based Methods low-rank matrices. Another distinct approach is presented in
SCT [Zhao and et al., 2023], which opts for a selective channel
Addition-based methods involve incorporating additional train-
tuning strategy, focusing on specific task-relevant channels to
able modules or parameters into original PVMs to learn task-
lower parameter costs. Furthermore, Polyhistor [Liu and et
specific information. This subsection discusses four primary
al., 2022a] adopts a unique method by decomposing a hyper-
branches of representative addition-based methods: adapter
network into two separate hyper-networks and factorizing an
tuning, prompt tuning, prefix tuning, and side tuning.
adapter’s weight matrix into two kernels. This technique is
Adapter Tuning. As a pioneering work, the adapter was particularly beneficial in multi-task architectures, contributing
initially introduced in the NLP domain by [Houlsby and et al., to a reduction in the number of parameters. Expanding on
2019] to achieve PEFT. Owing to its remarkable effectiveness, the ideas of Polyhistor, VMT-Adapter [Xin and et al., 2024]
it has been successfully adopted in the CV field as well. This integrates knowledge extraction modules to adapt to multi-
method integrates small neural modules, termed adapters, into ple vision tasks efficiently, demonstrating both parameter and
the Transformer layers. During the adaptation process, only training efficiency.
these adapters are fine-tuned. The adapter architecture consists
of a down-projection layer parameterized by Wdown ∈ Rd×k Prompt Tuning. Visual prompt tuning methods provide an
and an up-projection layer parameterized by Wup ∈ Rk×d . alternative to injecting learnable modules into the Transformer
Here, k (with k << d) serves to reduce the dimension of model. In such a method, the original input, whether an im-
the representation into a lower rank. Furthermore, a ReLU age embedding or the actual image, is wrapped with visual
layer is positioned between two layers to enable non-linear prompts. These prompts consist of additional trainable param-
projection. For a given input feature map xℓ ∈ RN ×d , the eters or perturbations. They are uniquely adaptable parameters
adapter generates optimized features as follows: and can be optimized according to the specific task and the
training data. The primary goal is to align the input distribu-
xˆℓ = ReLU (xℓ Wdown ) Wup , (5) tion to original pre-training data with task-specific prompts.
T Research in visual prompt tuning typically falls into two main
where W = [Wdown ; Wup ] ∈ Rd×2k denotes all the trainable categories: 1) inject a set of learnable parameters into the
parameters in the adapter. image embedding space, and 2) inject learnable perturba-
Adapter tuning methods in CV domain can be broadly di- tions around the border of the original input image.
vided into two categories: 1) design specific adapter archi- In the first category, VPT [Jia and et al., 2022] is a pio-
tectures for various vision tasks (e.g., image classification, neering work. It presents two variants: VPT-Shallow (see
video understanding, etc.), and 2) employ advanced opti- Fig. 3(c)) and VPT-Deep. VPT-Shallow integrates additional l
mization techniques to reduce the trainable parameters in learnable prompts, denoted as P = [P1 ], [P2 ], ...[Pl ] ∈ Rl×d ,
the adapter. into the input patch embeddings x0 ∈ RN ×d . These prompts
In the first category, AdaptFormer [Chen and et al., 2022] are then concatenated with path embedding to form the final
serves as a typical example. It marks the first instance of input. This process can be expressed as follows:
adapting vision transformers to a broad array of downstream
visual recognition tasks using adapters. Notably, AdaptFormer x0 = concat(P, x0 ) = [P, x0 ] ∈ R(l+N )×d , (6)
doesn’t modify the structure of the adapter but demonstrates
that parallel insertion of adapters is more efficacious for vision where [·, ·] is the concatenation along the token dimension.
tasks than the sequential insertion typically employed in NLP VPT-Deep advances VPT-Shallow by adding prompts to ev-
tasks. Another key contribution is Convpass [Jie and et al., ery Transformer layer’s input space, updating only these
2022], which highlights that current adapters are hindered by prompts during fine-tuning while keeping pre-trained param-
a lack of strong inductive bias, limiting their performance. eters frozen. The cost of VPT-Deep depends on the prompt
Decoder Decoder

Adapter
×; VPT
Adapter !&'

Side Network
……
MLP ReLU

LayerNorm learnable tokens


!"#$%

MHA
(b) Adapter Tuning (c) Prompt Tuning (d) Side Tuning
Attention

78 ∆!
Q 78 K 79 V Attention
79
1=0
LoRA
LoRA

!< !8 !9 Up Pretrained
Q 78 K 79 V *
Tanh Weights

LayerNorm ( = *(0, . /)
Q K V Down

VPT Embeded Patches 2ℓ45 2ℓ45


(a) Unified View of PEFT Methods (e) Prefix Tuning (f) Reparameter Tuning
Figure 3: The detailed architecture of various PEFT methods.
length and token embedding dimension, and the experiments each subset, addressing issues related to large data diversity.
show that longer prompts yielding better performance. Simi- Furthermore, this category of methods is particularly effective
larly, DePT [Gao and et al., 2022] introduces learnable visual for pixel-level tasks, such as image segmentation and point
prompts into the vision Transformer, specifically for data- cloud analysis. For instance, EVP-L [Liu and et al., 2023] em-
efficient test-time domain adaptation. Similarly, CVP [Tsai ploys the high-frequency components of the input as prompts
and et al., 2023] proposes a self-supervised convolutional for low-level structure segmentation tasks. ProSFDA [Hu and
prompt for robust visual perception. Additionally, LPT [Dong et al., 2022] adds a zero-initialized learnable prompt to target
and et al., 2023] optimizes shared prompts to extract gen- images in medical image segmentation. P2P [Wang and et al.,
eral features for the long-tailed dataset. IDPT [Zha and et 2022] converts point cloud data into colorful images, which
al., 2023] ventures into applying visual prompt tuning on are then used as vision prompts to adapt PVMs for various
pre-trained point cloud models. Moreover, some works fo- point cloud analysis tasks. Additionally, to further understand
cused on designing sub-networks to produce visual prompts. and improve visual prompt effectiveness, ILM-VP [Chen and
Pro-Tuning [Nie and et al., 2023] designs lightweight prompt et al., 2023a] automatically remaps source labels to target
blocks, comprising three lightweight convolutional layers, to labels, enhancing the target task accuracy of visual prompting.
generate task-specific discriminative prompts for each down- Prefix Tuning. Inspired by the success of prompt tuning,
stream input image. LION [Wang and et al., 2024] adds two Prefix-tuning [Li and et al., 2021] introduces learnable pre-
implicit layers, positioned at the beginning and end of the fix matrices to the MHA module of the PVMs. It involves
PVMs, serving as visual prompts to enrich the visual input prepending two randomly initialized prefix matrices Pk , Pv ∈
and representation. Lastly, ViPT [Zhu and et al., 2023] takes Rl×d to the keys and values in the MHA, leading the attention
both RGB and auxiliary modal inputs, which are initially pro- calculation in Eq. 3 to:
cessed by the patch embed to generate corresponding RGB Q[Pk , K]T
and prompt tokens. Attention(Q, K, V ) = sof tmax( √ )[Pv , V ] (7)
In the second category, research focuses on optimizing task- d
specific prompts at the pixel level, integrating these prompts However, random initialization may bring random noise, im-
directly with input images. A key method in this category is pacting the convergence of the fine-tuning downstream tasks.
VP [Bahng and et al., 2022], which modifies learnable per- To address this, PATT [Yu and et al., 2022] proposes a parallel
turbations around image borders and does not require PVMs attention mechanism to the original attention module without
access at test time. Building on VP, EVP [Wu and et al., random initialization and uses two linear layers (with param-
2022] employs a strategy where images are shrunk and sub- eters Wdown ∈ Rd×k and Wup ∈ Rk×l ) and Tanh layers to
jected to data augmentations, followed by padding the area transform prefix matrices (see Fig. 3e). Specifically, for the
around the image with the prompt. DAM-VP [Huang and et al., l-th Transformer layer, given previous layer’s output xℓ−1 , and
2023] adopts a divide-and-conquer strategy. It segments high- we get a pair of prefix matrices via:
diversity datasets into subsets and learns separate prompts for Pk , Pv = T anh(xℓ−1 Wdown )Wup . (8)
Following PATT, eTT [Xu and et al., 2023a] uses the latest ters in PVMs, such as bias and LayerNorm, which are crucial
innovations in attentive prefix tuning (i.e., generating new for downstream tasks. This method concentrate on important
key-value pairs) for few-shot learning and LAM [Gao and parameters while discarding those deemed less relevant. The
et al., 2023] includes prefix tuning as part of the framework concept, while straightforward, has proven to be surprisingly
for continual learning. In contrast to original Prefix-tuning, effective. One of the earliest examples is Linear Probe [Korn-
VQT [Tu and et al., 2023] only appends additional prefix blith and et al., 2019], which introduces a linear layer as the
vectors to the query Q, not to both the value V and key K. classifier on the top of PVMs. In this method, all parameters
of the PVMs are frozen, allowing for an exploration of the
Side Tuning. Different from previous PEFT methods that
pre-training capabilities of the PVMs. This technique has be-
typically involve inserting additional modules or parameters
come a standard baseline in various PEFT methods. Moreover,
inside PVMs, side tuning employs a side network, a smaller
BitFit [Zaken and et al., 2022] empirically demonstrates that
and separate network that operates in parallel with the PVMs,
optimizing only the bias terms within a model can be effective,
as shown in Fig. 3(d).
which could be represented as follows:
Earlier side tuning methods concentrated on parame-
ter efficiency, with a key focus on how to design the side xℓ = xℓ−1 Wℓ + bℓ , (9)
network. Side-Tuning [Zhang and et al., 2020] utilizes a four-
layer convolutional network as the additive side network. The where the weight parameters Wℓ are kept frozen, and only
outputs of the network are summed with the representations the bias bℓ is optimized during the tuning process. Remark-
from the PVMs in the final layer, facilitating the resolution ably, this approach enables the model to retain over 95% of its
of various tasks. Furthermore, SAN [Xu and et al., 2023b] performance across several benchmarks. Building on the prin-
proposes a two-branch side adapter network. One branch is ciples of BitFit, DP-BiTFiT [Bu and et al., 2022] combines the
dedicated to predicting mask proposals, while the other fo- efficiency of the standard BiTFit approach to address down-
cuses on predicting attention biases that are applied to the stream tasks involving sensitive data and achieves state-of-the-
self-attention blocks for mask class recognition. More re- art accuracy for differentially private algorithms. Similarly,
cently, ViT-Adapter [Chen and et al., 2023b] designs a spatial DiffFit [Xie and et al., 2023] only fine-tunes the bias term and
prior module along with two feature interaction operations, newly added scaling factors in specific layers of diffusion mod-
which enables integrating image priors into the architecture of els, and this strategy results in training speed-ups and reduced
ViT without necessitating a redesign. Such an arrangement is model storage costs. Meanwhile, AdapterBias [Fu and et al.,
particularly beneficial for dense prediction tasks as it supple- 2022] presents a unique approach that avoids altering the bias
ments missing local information and reorganizes fine-grained, of the PVMs. Instead, it targets the bias term at the MLP layer
multi-scale features. by using a linear layer with a weight α and a tunable vector v,
Besides prioritizing parameter efficiency, following works which is expressed as follows:
discovered that side tuning can lead to GPU memory ef- xℓ = xℓ−1 Wℓ + bℓ + α ⊗ v. (10)
ficiency through innovative designs. LST [Sung and et al.,
2022] proposes to separate trainable parameters from the back- Instead of tuning bias terms, LN-Tune [Basu and et al., 2024]
bone model to create a small Transformers network. This sep- introduces a strong PEFT baseline that fine-tunes only the
aration completely obviates the need for costly backpropaga- LayerNorm parameters of the PVMs.
tion through a large backbone network, resulting in significant Reparameter Tuning. Reparameter tuning methods also
GPU memory savings. Building upon the ideas in LST, SAM- introduce new learnable parameters during the training stage,
LST [Chai and et al., 2023] incorporates an additional convo- while these parameters can be integrated into the original
lutional neural network as a complementary encoder within PVMs through reparameterization during the inference phase.
the SAM. This integration leads to faster training and reduced LoRA [Hu and et al., 2021] is a prominent example, where
resource demands. However, as LST is not directly applicable trainable low-rank matrices are injected into Transformer lay-
to some PVMs like the Swin Transformer, E3 VA [Yin and et ers to approximate updates to the weights. For a pre-trained
al., 2023b] provides a gradient backpropagation highway for weight matrix Wℓ , LoRA represents its update with a low-rank
low-rank adapters. This method is compatible with all PVMs decomposition:
and further enhances efficiency. More recently, DTL [Fu and

et al., 2024] designs a compact side network specifically for Wl = Wℓ + △W = Wℓ + BA, (11)
ViT to achieve both parameter and GPU memory efficiency.
where B and A are trainable parameters. Generally, LoRA up-
3.2 Partial-based Methods dates the query and value projection matrices in multi-head at-
tention. Since then, there has been plenty of following research
Partial-based methods concentrate on updating only a small in this area. KronA [Edalati and et al., 2022] shares structural
subset of inherent parameters while maintaining the majority similarities with LoRA but differs in replacing LoRA’s low-
of the model’s parameters unchanged during the adaptation rank decomposition with Kronecker product decomposition,
process. These methods do not seek to change the internal expressed as △W = B ⊗ A. This modification enhances
structure of the model. This section will cover two strategies: computational efficiency and reduces the number of required
specification tuning and reparameter tuning. floating-point operations (FLOPs). Building on these concepts,
KAdaptation [He and et al., 2023] decomposes update weights
Specification Tuning. Specification tuning is an efficient through a sum of n Kronecker products between shared slow
approach that directly modifies a specific subset of parame- weights Ai and independent fast weights Bi . and it further
takes an additional step by decomposing Bi into the product Characteristic Representative
of two low-rank matrices ui and vi : Category
Method
#Trainable Params
Xn Xn NAM SP IE ME
W + △W = W + Ai ⊗ Bi = Ai ⊗ (ui viT ). (12)
i=1 i=1 A DAPTER # # # # AdaptFormer L × (2dk)
Thus, the trainable parameters are now substantially reduced
P ROMPT # ! # # VPT-Deep L × (ld)
in this manner. Delving deeper, FacT [Jie and Deng, 2023]
proposes a tensorization-decomposition framework, which in- P REFIX # ! # # PATT L × 2 × (dk + kl)
volves tensorizing the weights of PVMs into a single 3D tensor
and then decomposing their increments into lightweight fac- S IDE # ! # ! LST #Params of subnetwork

tors. This approach efficiently stores weight increments, offer- S PECIFICATION ! ! ! # BitFit L × (7 × d)
ing a novel way to handle the parameters of PVMs. Following
FacT, EFFT [Chen, 2023] aims to minimize redundancies both R EPARAMETER # ! ! # LoRA L × 2 × (2dk)
within and across layers, without increasing computational
latency. This method exemplifies how tensor decomposition Table 1: Comparison between different tuning methods.
can be leveraged for more efficient model tuning. Beyond pre-
trained weight matrices, other works have explored different the model structure. 3) Inference Efficient (IE): Additional
parameters of PVMs. SSF [Lian and et al., 2022] integrates modules typically increase inference latency, with reparameter
learnable scale and shift parameters to adjust features and then tuning being an exception due to its mitigating reparameter-
reparameterizes these into the MLP layer. RepAdapter [Luo ization technique. 4) Memory Efficient (ME): Side tuning
and et al., 2023] demonstrates that adapter modules can be uniquely achieves memory efficiency as its gradient backprop-
seamlessly integrated into PVMs via structural reparameteri- agation does not involve PVMs. Overall, each PEFT method
zation, thereby achieving zero cost during inference. presents unique advantages and limitations, and there is no
completely perfect PEFT method.
3.3 Unified-based Tuning
Unified-based tuning approaches offer a unified framework Parameter Analysis. To accurately calculate the number
to integrate various fine-tuning methods into a single, har- of trainable parameters, we select a specific, representative
monized architecture. This approach streamlines the process work for each taxonomy, as shown in Tab. 1. It is observed
and enhances the overall efficiency and effectiveness of the that BitFit has the smallest number of trainable parameters
fine-tuning. For instance, NOAH [Liu and et al., 2022b] in- since it only updates the bias terms in PVMs. In contrast,
corporates Adapter, LoRA, and VPT into each Transformer LST has the largest trainable parameters due to its parallel
block and employs Neural Architecture Search (NAS) to de- subnetwork, but it can achieve memory efficient. Optimiza-
termine the best design for specific downstream tasks. This tion of the subnetwork structure may be crucial in the future.
method represents a comprehensive approach to optimizing Additionally, AdaptFormer, PATT, and LoRA share similar
fine-tuning by combining multiple techniques. LAM [Gao and parameter magnitudes, as they all inject comparable structures
et al., 2023] proposes a unified framework for continual learn- into each Transformer layer. VPT-Deep has a slightly higher
ing. This framework is designed to be adaptable, allowing parameter count than BitFit. In practical applications, com-
any PEFT method to be reconfigured into a competitive ap- pared to full fine-tuning, these methods possess only 0.05% to
proach for continual learning. Additionally, V-PEFT [Yu and 10% of the trainable parameter, yet they achieve comparable
et al., 2022] provides a unified analysis of PEFT techniques or even better performance on downstream tasks.
for video tasks. This research investigates the critical aspects 4 Datasets and Applications
of fine-tuning positions, offering a cohesive view of these tech-
niques. Similarly, U-Tuning [Jiang and et al., 2023] rethinks In this section, we briefly discuss the popular datasets and
PEFT from an integrated perspective, re-evaluating existing applications in visual PEFT, as shown in Tab. 2. Image recog-
tuning paradigms. It identifies a parallel form for mainstream nition is the primary benchmark and application for PEFT,
tuning methods, including adapter, prefix, and prompt tuning, exemplified by datasets such as FGVC [Jia and et al., 2022]
which effectively reduces the coupling in tuning structures. (5 downstream tasks) and VTAB-1k [Zhai and et al., 2019]
(19 downstream tasks). PEFT is also influential in other do-
3.4 Discussion mains. Beyond image classification, video action recogni-
Characteristic Analysis. We summarize the characteristics tion is another key application area, involving datasets like
of all PEFT methods, as shown in Tab. 1. The methods are Kinetics-400 [Kay and et al., 2017], SSv2 [Goyal and et al.,
compared in 4 aspects. 1) No-Additional Modules (NAM): 2017], HMDB51 [Kuehne and et al., 2011] and Driving-48 [Li
Specification tuning is the only method that does not intro- and et al., 2018]. Additionally, PEFT has been utilized for
duce new modules, while others introduce additional modules dense prediction tasks, using datasets like COCO [Lin and
or parameters more or less. 2) Structure Preserving (SP): et al., 2014], ADE20K [Zhou and et al., 2019] and PASCAL
Adapter Tuning changes the structure of the PVMs. By con- VOC [Everingham and et al., 2015]. Furthermore, the use
trast, prompt tuning, prefix tuning, side tuning, and reparam- of PEFT is expanding into new fields, including point cloud
eter tuning maintain the structure of the original PVM while analysis and robotic manipulation. It’s evident that PEFT is
introducing new modules. Specification Tuning directly opti- increasingly applied across various domains and prevailing in
mizes a subset of parameters of PVMs, so it does not change diverse downstream tasks.
Application Dataset Description #Classes Train size Val size Test size
Fine-Grained Visual Classification (FGVC) [Jia and et al., 2022]
CUB-200-2011 Fine-grained bird species recognition 200 5,394 600 5,794
NABirds Fine-grained bird species recognition 55 21,536 2,393 24,633
Oxford Flowers Fine-grained flower species recognition 102 1,020 1,020 6,149
Stanford Dogs Fine-grained dog species recognition 120 10,800 1,200 8,580
Stanford Cars Fine-grained car classification 196 7,329 815 8,041
Visual Task Adaptation Benchmark (VTAB-1k) [Zhai and et al., 2019]
CIFAR-100 100 10,000
Caltech101 102 6,084
DTD 47 1,880
Natural-tasks that contain natural images
Flowers102 102 800 200 6,149
captured using standard cameras.
Image Recognition Pets 37 3,669
SVHN 10 26,032
Sun397 397 21,750
Patch Camelyon 2 32,768
Specialized-tasks that contain images
EuroSAT 10 5,400
captured via specialized equipment, such 800 200
Resisc45 45 6,300
as medical and satellite imagery.
Retinopathy 5 42,670
Clevr/count 8 15,000
Clevr/distance 6 15,000
DMLab 6 22,735
KITTI/distance Structured-tasks that require geometric 4 711
800 200
dSprites/location comprehension like object counting. 16 73,728
dSprites/orientation 16 73,728
SmallNORB/azimuth 18 12,150
SmallNORB/elevation 9 12,150
Kinetics-400 [Kay and et al., 2017] 400 240,436 N/A 19,787
SSv2 [Goyal and et al., 2017] 174 168,913 24,777 27,157
Video Recognition HMDB51 [Kuehne and et al., 2011] Video action recognition 51 3,500 1,500 1,849
Diving-48 [Li and et al., 2018] 48 15,900 N/A 2,000
UCF-101 [Soomro and et al., 2012] 101 9,537 N/A 3,783
MS COCO [Lin and et al., 2014] Instance segmentation 80 118,000 N/A 5,000
ADE20K [Zhou and et al., 2019] Semantic segmentation 150 20,210 N/A 2,000
Dense Prediction
PASCAL VOC [Everingham and et al., 2015] Semantic segmentation 21 1,464 N/A 1,449

Table 2: Several popular datasets and applications of visual PEFT.


5 Future Research Challenges timodal domain is also desirable. Moreover, PEFT methods
Explainability of Visual PEFT Methods. Despite signifi- could facilitate cross-modality alignment, leading to signifi-
cant advancements, the underlying reasons for the effective- cant improvements in downstream multimodal tasks. Conse-
ness of visual PEFT methods remain unclear, especially in quently, further exploration in both these domains represents
terms of the interpretability of visual prompts. In the NLP do- a promising direction for future research.
main, we can explain the prompt as a better description, which Building Visual PEFT Library. While numerous PEFT
is more intuitive. While in CV domain, the main challenge methods for the vision domain have been proposed, their direct
is that visual prompts are learned as unordered token-based employment or comparison is not conventional. In contrast,
prompts, which are difficult to translate into an understand- the NLP domain has developed comprehensive libraries like
able format. Other tuning techniques like adapter and prefix PEFT library1 , which integrate various PEFT methods and
also confront challenges in interpretability. These methods large language models (LLMs) to facilitate their application in
strive to reduce the number of parameters required for adapt- downstream tasks. Thus, it is desired to develop a library for
ing large models to specific tasks. Therefore, improving the the vision domain and even integrate the multimodal domain,
interpretability of PEFT is a crucial area for future research. which could boost the development of PEFT.
PEFT for Generative and Multimodal Models. On one
side, within the CV domain, most PEFT methods are tailored 6 Conclusion
for discriminative tasks, such as image classification and video In this paper, we conduct a comprehensive review of the visual
action recognition. Yet, exploring their application in genera- parameter-efficient fine-tuning domain by offering an in-depth
tive tasks is highly promising. With the help of adapter and analysis of existing methods, datasets, and applications. We
prompt, researchers have developed several PEFT methods conclude with a detailed comparison of these methods and
for pre-trained generative models [Xie and et al., 2023], par- identify several potential research challenges in the field. Our
ticularly stable diffusion models. Nonetheless, these models goal is for this survey to serve as a valuable resource for
still have much room for deeper exploration. On the other researchers interested in parameter-efficient fine-tuning, pro-
side, large multimodal models typically require more com- viding insights that could inspire further advancements.
putational and memory resources compared to single-modal
1
models. Therefore, investigating PEFT methods in the mul- [Link]
References [Goyal and et al., 2017] Raghav Goyal and et al. The” something
[Bahng and et al., 2022] Hyojin Bahng and et al. Exploring vi- something” video database for learning and evaluating visual
sual prompts for adapting large-scale models. Arxiv, 2022. common sense. ICCV, 2017.
[Basu and et al., 2024] Samyadeep Basu and et al. Strong base- [Grassucci and et al., 2022] Eleonora Grassucci and et al. Phnns:
lines for parameter efficient few-shot fine-tuning. AAAI, 2024. Lightweight neural networks via parameterized hypercomplex
convolutions. TNNLS, 2022.
[Bu and et al., 2022] Zhiqi Bu and et al. Differentially private
bias-term only fine-tuning of foundation models. NeurIPS, [He and et al., 2020] Kaiming He and et al. Momentum contrast
2022. for unsupervised visual representation learning. CVPR, 2020.
[Caron and et al., 2021] Mathilde Caron and et al. Emerging [He and et al., 2022] Kaiming He and et al. Masked autoen-
properties in self-supervised vision transformers. ICCV, 2021. coders are scalable vision learners. CVPR, 2022.
[Chai and et al., 2023] Shurong Chai and et al. Ladder fine- [He and et al., 2023] Xuehai He and et al. Parameter-efficient
tuning approach for sam integrating complementary network. model adaptation for vision transformers. AAAI, 2023.
Arxiv, 2023.
[Houlsby and et al., 2019] Neil Houlsby and et al. Parameter-
[Chen and et al., 2020] Ting Chen and et al. A simple framework efficient transfer learning for nlp. ICML, 2019.
for contrastive learning of visual representations. ICML, 2020.
[Hu and et al., 2021] Edward J Hu and et al. Lora: Low-rank
[Chen and et al., 2022] Shoufa Chen and et al. Adaptformer: adaptation of large language models. ICLR, 2021.
Adapting vision transformers for scalable visual recognition.
NeurIPS, 2022. [Hu and et al., 2022] Shishuai Hu and et al. Prosfda: Prompt
learning based source-free domain adaptation for medical im-
[Chen and et al., 2023a] Aochuan Chen and et al. Understanding
age segmentation. Arxiv, 2022.
and improving visual prompting: A label-mapping perspective.
CVPR, 2023. [Huang and et al., 2023] Qidong Huang and et al. Diversity-
aware meta visual prompting. CVPR, 2023.
[Chen and et al., 2023b] Zhe Chen and et al. Vision transformer
adapter for dense predictions. ICLR, 2023. [Jia and et al., 2021] Chao Jia and et al. Scaling up visual and
[Chen, 2023] Dongping Chen. Aggregate, decompose, and fine- vision-language representation learning with noisy text super-
tune: A simple yet effective factor-tuning method for vision vision. ICML, 2021.
transformer. Arxiv, 2023. [Jia and et al., 2022] Menglin Jia and et al. Visual prompt tuning.
[Deng and et al., 2009] Jia Deng and et al. Imagenet: A large- ECCV, 2022.
scale hierarchical image database. CVPR, 2009. [Jiang and et al., 2022] Ziyu Jiang and et al. Dna: Improving
[Dong and et al., 2023] Bowen Dong and et al. Lpt: Long-tailed few-shot transfer learning with low-rank decomposition and
prompt tuning for image classification. ICLR, 2023. alignment. ECCV, 2022.
[Dosovitskiy and et al., 2021] Alexey Dosovitskiy and et al. An [Jiang and et al., 2023] Zeyinzi Jiang and et al. Rethinking effi-
image is worth 16x16 words: Transformers for image recogni- cient tuning methods from a unified perspective. Arxiv, 2023.
tion at scale. ICLR, 2021. [Jie and Deng, 2023] Shibo Jie and Zhi-Hong Deng. Fact:
[Edalati and et al., 2022] Ali Edalati and et al. Krona: Parameter Factor-tuning for lightweight adaptation on vision transformer.
efficient tuning with kronecker adapter. Arxiv, 2022. AAAI, 2023.
[Everingham and et al., 2015] Mark Everingham and et al. The [Jie and et al., 2022] Shibo Jie and et al. Convolutional bypasses
pascal visual object classes challenge: A retrospective. IJCV, are better vision transformer adapters. Arxiv, 2022.
2015.
[Kay and et al., 2017] Will Kay and et al. The kinetics human
[Fang and et al., 2022] Yuxin Fang and et al. Eva: Exploring the action video dataset. Arxiv, 2017.
limits of masked visual representation learning at scale. CVPR,
[Khan and et al., 2022] Salman Khan and et al. Transformers in
2022.
vision: A survey. CSUR, 2022.
[Fu and et al., 2022] Chin-Lun Fu and et al. Adapterbias:
Parameter-efficient token-dependent representation shift for [Kirillov and et al., 2023] Alexander Kirillov and et al. Segment
adapters in nlp tasks. NAACL, 2022. anything. Arxiv, 2023.
[Fu and et al., 2024] Minghao Fu and et al. Dtl: Disentangled [Kornblith and et al., 2019] Simon Kornblith and et al. Do better
transfer learning for visual recognition. AAAI, 2024. imagenet models transfer better? CVPR, 2019.
[Gao and et al., 2022] Yunhe Gao and et al. Visual prompt tuning [Kuehne and et al., 2011] Hildegard Kuehne and et al. Hmdb: a
for test-time domain adaptation. Arxiv, 2022. large video database for human motion recognition. ICCV,
[Gao and et al., 2023] Qiankun Gao and et al. A unified contin- 2011.
ual learning framework with general parameter-efficient tuning. [Li and et al., 2018] Yingwei Li and et al. Resound: Towards
ICCV, 2023. action recognition without representation bias. ECCV, 2018.
[Li and et al., 2021] Xiang Lisa Li and et al. Prefix-tuning: Op- [Wang and et al., 2024] Haixin Wang and et al. Lion: Implicit
timizing continuous prompts for generation. ACL, 2021. vision prompt tuning. AAAI, 2024.
[Lian and et al., 2022] Dongze Lian and et al. Scaling & shift- [Wu and et al., 2022] Junyang Wu and et al. Unleashing the
ing your features: A new baseline for efficient model tuning. power of visual prompting at the pixel level. Arxiv, 2022.
NeurIPS, 2022. [Xie and et al., 2021] Zhenda Xie and et al. Simmim: a simple
[Lin and et al., 2014] Tsung-Yi Lin and et al. Microsoft coco: framework for masked image modeling. CVPR, 2021.
Common objects in context. ECCV, 2014. [Xie and et al., 2023] Enze Xie and et al. Difffit: Unlocking
[Liu and et al., 2021] Ze Liu and et al. Swin transformer: Hi- transferability of large diffusion models via simple parameter-
erarchical vision transformer using shifted windows. ICCV, efficient fine-tuning. Arxiv, 2023.
2021. [Xin and et al., 2024] Yi Xin and et al. Vmt-adapter: Parameter-
[Liu and et al., 2022a] Yen-Cheng Liu and et al. Polyhistor: efficient transfer learning for multi-task dense. AAAI, 2024.
Parameter-efficient multi-task adaptation for dense vision tasks. [Xu and et al., 2023a] Chengming Xu and et al. Exploring ef-
NeurIPS, 2022. ficient few-shot adaptation for vision transformers. TMLR,
[Liu and et al., 2022b] Yuanhan Liu and et al. Neural prompt 2023.
search. Arxiv, 2022. [Xu and et al., 2023b] Mengde Xu and et al. Side adapter net-
[Liu and et al., 2023] Weihuang Liu and et al. Explicit visual work for open-vocabulary semantic segmentation. CVPR,
prompting for low-level structure segmentations. CVPR, 2023. 2023.
[Yang and et al., 2023] Taojiannan Yang and et al. Aim: Adapt-
[Luo and et al., 2023] Gen Luo and et al. Towards efficient vi-
ing image models for efficient video action recognition. ICLR,
sual adaption via structural re-parameterization. Arxiv, 2023.
2023.
[Nie and et al., 2023] Xing Nie and et al. Pro-tuning: Unified
[Yin and et al., 2023a] Dongshuo Yin and et al. 1% vs 100%:
prompt tuning for vision tasks. TCSVT, 2023.
Parameter-efficient low rank adapter for dense predictions.
[Pan and et al., 2022] Junting Pan and et al. St-adapter: CVPR, 2023.
Parameter-efficient image-to-video transfer learning. NeurIPS, [Yin and et al., 2023b] Dongshuo Yin and et al. Parameter-
2022. efficient is not sufficient: Exploring parameter, memory, and
[Radford and et al., 2021] Alec Radford and et al. Learning time efficient adapter tuning for dense predictions. Arxiv,
transferable visual models from natural language supervision. 2023.
ICML, 2021. [Yu and et al., 2022] Bruce XB Yu and et al. Towards a unified
[Sharma and et al., 2023] Mohit Sharma and et al. Lossless adap- view on visual parameter-efficient transfer learning. Arxiv,
tation of pretrained vision models for robotic manipulation. 2022.
ICLR, 2023. [Yuan and et al., 2021] Li Yuan and et al. Tokens-to-token vit:
[Soomro and et al., 2012] Khurram Soomro and et al. Ucf101: Training vision transformers from scratch on imagenet. ICCV,
A dataset of 101 human actions classes from videos in the wild. 2021.
Arxiv, 2012. [Zaken and et al., 2022] Elad Ben Zaken and et al. Bitfit: Simple
[Sung and et al., 2022] Yi-Lin Sung and et al. Lst: Ladder side- parameter-efficient fine-tuning for transformer-based masked
tuning for parameter and memory efficient transfer learning. language-models. ACL, 2022.
NeurIPS, 2022. [Zha and et al., 2023] Yaohua Zha and et al. Instance-aware dy-
[Touvron and et al., 2021] Hugo Touvron and et al. Training namic prompt tuning for pre-trained point cloud models. ICCV,
data-efficient image transformers & distillation through atten- 2023.
tion. ICML, 2021. [Zhai and et al., 2019] Xiaohua Zhai and et al. A large-scale
[Tsai and et al., 2023] Yun-Yun Tsai and et al. Convolutional study of representation learning with the visual task adaptation
visual prompt for robust visual perception. NeurIPS, 2023. benchmark. Arxiv, 2019.
[Tu and et al., 2023] Cheng-Hao Tu and et al. Visual query tun- [Zhang and et al., 2020] Jeffrey Zhang and et al. Side-tuning:
ing: Towards effective usage of intermediate representations a baseline for network adaptation via additive side networks.
for parameter and memory efficient transfer learning. CVPR, ECCV, 2020.
2023. [Zhao and et al., 2023] Henry Hengyuan Zhao and et al. Sct: A
[Wang and et al., 2021] Wenhai Wang and et al. Pyramid vision simple baseline for parameter-efficient fine-tuning via salient
channels. IJCV, 2023.
transformer: A versatile backbone for dense prediction without
convolutions. ICCV, 2021. [Zhou and et al., 2019] Bolei Zhou and et al. Semantic under-
standing of scenes through the ade20k dataset. IJCV, 2019.
[Wang and et al., 2022] Ziyi Wang and et al. P2p: Tuning pre-
trained image models for point cloud analysis with point-to- [Zhu and et al., 2023] Jiawen Zhu and et al. Visual prompt multi-
pixel prompting. NeurIPS, 2022. modal tracking. CVPR, 2023.

You might also like