Parameter-Efficient Fine-Tuning For Pre-Trained Vision Models: A Survey
Parameter-Efficient Fine-Tuning For Pre-Trained Vision Models: A Survey
VPT [Jia and et al., 2022], LPT [Dong and et al., 2023], Pro-tuning [Nie and et al., 2023],
Embedding Level DePT [Gao and et al., 2022], IDPT [Zha and et al., 2023], ViPT [Zhu and et al., 2023]
LION [Wang and et al., 2024], CVP [Tsai and et al., 2023]
Prompt Tuning
Addition-based ProSFDA [Hu and et al., 2022], EVP-L [Liu and et al., 2023], P2P [Wang and et al., 2022],
Tuning (§3.1) Pixel Level VP [Bahng and et al., 2022], EVP [Wu and et al., 2022], IML-VP [Chen and et al., 2023a],
DAM-VP [Huang and et al., 2023]
PATT [Yu and et al., 2022], eTT [Xu and et al., 2023a], LAE [Gao and et al., 2023], VQT [Tu and et al., 2023],
Prefix Tuning
Prefix-tuning [Li and et al., 2021]
PEFT Methods for PLMs
Side-Tuning [Zhang and et al., 2020], ViT-Adapter [Chen and et al., 2023b],
Param Efficient
SAN [Xu and et al., 2023b]
Side Tuning
Param & Memory LST [Sung and et al., 2022], DTL [Fu and et al., 2024], E3 VA [Yin and et al., 2023b],
Efficient SAM-LST [Chai and et al., 2023]
Linear Probe [Kornblith and et al., 2019], AdapterBias [Fu and et al., 2022], DP-BiTFiT [Bu and et al., 2022],
Specification Tuning
LN-Tune [Basu and et al., 2024], BitFit [Zaken and et al., 2022], DiffFit [Xie and et al., 2023]
Partial-based
Tuning (§3.2)
LoRA [Hu and et al., 2021], KronA [Edalati and et al., 2022], FacT [Jie and Deng, 2023], EFFT [Chen, 2023],
Reparameter Tuning Atten-Scale [Basu and et al., 2024], KAdaptation [He and et al., 2023], PHNNs [Grassucci and et al., 2022],
SSF [Lian and et al., 2022], DnA [Jiang and et al., 2022], RepAdapter [Luo and et al., 2023]
Unified-based
V-PETL [Yu and et al., 2022], NOAH [Liu and et al., 2022b], U-Tuning [Jiang and et al., 2023], LAE [Gao and et al., 2023]
Tuning (§3.3)
2.2 Vision Transformer line of works improves the standard ViT by integrating ad-
The standard Vision Transformer [Dosovitskiy and et al., ditional or contextual information, with notable models like
2021] consists of a patch embedding layer and L Transformer Pyramid DeiT [Touvron and et al., 2021] and Token to To-
layers. Given an image x ∈ RH×W ×C , the patch embed- ken (T2T) ViT [Yuan and et al., 2021]. Another line of work
ding layer first splits and flatten the image x into sequen- focuses on multi-scale ViTs using hierarchical designs to cap-
2 ture spatial details at varying scales, which is a capability
tial patches xp ∈ RN ×(P C) , where (H, W ) represents the
limited in standard ViTs due to fixed token numbers and di-
height and width of the input image, (P, P ) is the resolution
mensions. Key models in this category include Pyramid ViT
of each image patch, C denotes the number of channels, and
(PVT) [Wang and et al., 2021] and Swin Transformer [Liu and
N = HW/P 2 is the number of image tokens. Then, xp is
et al., 2021]. A more comprehensive survey can be found in
mapped to x0 ∈ RN ×d with a trainable linear projection. The this literature [Khan and et al., 2022].
combination of a prepended [cls] token and x0 are the inputs
of Transformer encoders. 2.3 Model Pre-training
Each Transformer layer consists of a multi-head attention
(MHA) and a multilayer perceptron (MLP) module. In MHA, Recently, many pre-training methods with innovative back-
attention scores are computed using query (Q), key (K), and bones for training PVMs have emerged. The pre-training
value (V ) representations, along with projection matrices methods can be mostly categorized into supervised learning
Wq , Wk , Wv ∈ Rd×d . Given an input xℓ−1 at the ℓ-th layer, and self-supervised learning.
the attention is calculated as follows:
Supervised pre-training. These methods use classification
Q = xℓ−1 Wq , K = xℓ−1 Wk , V = xℓ−1 Wv , (2) losses for pre-training on large annotated datasets, e.g., Ima-
geNet [Deng and et al., 2009]. A renowned pre-trained model
QK T
x′ℓ = Attention(Q, K, V ) = sof tmax( √ )V. (3) that applies supervised pre-training is SAM [Kirillov and et
d al., 2023], which is trained on pixel-level annotated datasets
The output tokens x′ℓ are further sent to a LayerNorm(LN) and and achieves excellent results in segmentation tasks.
an MLP block, which is formulated as follows: Self-supervised pre-training. Self-supervised learning,
xℓ = M LP (LN (x′ℓ )) + x′ℓ , (4) now a leading pre-training paradigm, includes 1) Contrastive
learning methods, which focus on attracting similar (positive)
where xℓ is the output of the ℓ-th encoder layer. and repelling dissimilar (negative) samples. This includes both
Recent advancements in vision Transformer architectures image-based approaches such as SimCLR [Chen and et al.,
have significantly enhanced performance in vision tasks. A 2020], MoCo [He and et al., 2020], DINO [Caron and et al.,
2019 2020 2021 2022 2023
To overcome this, Convpass incorporates trainable convolu-
Backbone
ViT Swin T2T SAM tional blocks, thereby enhancing the adapter’s capabilities by
DeiT PVT CvT
integrating the strengths of convolutional neural networks. Ad-
Self-Supervised
MoCo SimCLR BEiT ALIGN EVA DINO v2 ditionally, AIM [Yang and et al., 2023] introduces adapters
BYOL MAE CLIP EVA-02
method
DINO SimMIM
specialized in spatial, temporal, and joint domains, while ST-
Adapter [Pan and et al., 2022] offers a spatiotemporal adapter.
Figure 2: The representative backbones and pre-training methods. Both methods are tailored to improve a vision model’s spa-
tiotemporal reasoning for video understanding tasks. In the
2021], and multi-modality based approaches like CLIP [Rad- field of robotic manipulation, Rob-Adapter [Sharma and et al.,
ford and et al., 2021] and ALIGN [Jia and et al., 2021]. Note 2023] applies the classic bottleneck architecture, commonly
that we exclusively focus on the image-related modules for used in image classification, for lossless adaptation.
multimodal self-supervised models, ignoring other modalities. In the second category, these methods focus on optimizing
2) Mask image modeling methods, including MAE [He and the adapter’s architecture to reduce trainable parameters. One
et al., 2022], SimMIM [Xie and et al., 2021], and EVA [Fang such example is LoRand [Yin and et al., 2023a], which cre-
and et al., 2022], which involve masking parts of images and ates compact adapter structures through a low-rank synthesis
reconstructing them. approach. This method achieves a reduction in parameters by
parameterizing both the down-projection layer Wdown and the
3 Methodology up-projection layer Wup through the multiplication of three
3.1 Addition-based Methods low-rank matrices. Another distinct approach is presented in
SCT [Zhao and et al., 2023], which opts for a selective channel
Addition-based methods involve incorporating additional train-
tuning strategy, focusing on specific task-relevant channels to
able modules or parameters into original PVMs to learn task-
lower parameter costs. Furthermore, Polyhistor [Liu and et
specific information. This subsection discusses four primary
al., 2022a] adopts a unique method by decomposing a hyper-
branches of representative addition-based methods: adapter
network into two separate hyper-networks and factorizing an
tuning, prompt tuning, prefix tuning, and side tuning.
adapter’s weight matrix into two kernels. This technique is
Adapter Tuning. As a pioneering work, the adapter was particularly beneficial in multi-task architectures, contributing
initially introduced in the NLP domain by [Houlsby and et al., to a reduction in the number of parameters. Expanding on
2019] to achieve PEFT. Owing to its remarkable effectiveness, the ideas of Polyhistor, VMT-Adapter [Xin and et al., 2024]
it has been successfully adopted in the CV field as well. This integrates knowledge extraction modules to adapt to multi-
method integrates small neural modules, termed adapters, into ple vision tasks efficiently, demonstrating both parameter and
the Transformer layers. During the adaptation process, only training efficiency.
these adapters are fine-tuned. The adapter architecture consists
of a down-projection layer parameterized by Wdown ∈ Rd×k Prompt Tuning. Visual prompt tuning methods provide an
and an up-projection layer parameterized by Wup ∈ Rk×d . alternative to injecting learnable modules into the Transformer
Here, k (with k << d) serves to reduce the dimension of model. In such a method, the original input, whether an im-
the representation into a lower rank. Furthermore, a ReLU age embedding or the actual image, is wrapped with visual
layer is positioned between two layers to enable non-linear prompts. These prompts consist of additional trainable param-
projection. For a given input feature map xℓ ∈ RN ×d , the eters or perturbations. They are uniquely adaptable parameters
adapter generates optimized features as follows: and can be optimized according to the specific task and the
training data. The primary goal is to align the input distribu-
xˆℓ = ReLU (xℓ Wdown ) Wup , (5) tion to original pre-training data with task-specific prompts.
T Research in visual prompt tuning typically falls into two main
where W = [Wdown ; Wup ] ∈ Rd×2k denotes all the trainable categories: 1) inject a set of learnable parameters into the
parameters in the adapter. image embedding space, and 2) inject learnable perturba-
Adapter tuning methods in CV domain can be broadly di- tions around the border of the original input image.
vided into two categories: 1) design specific adapter archi- In the first category, VPT [Jia and et al., 2022] is a pio-
tectures for various vision tasks (e.g., image classification, neering work. It presents two variants: VPT-Shallow (see
video understanding, etc.), and 2) employ advanced opti- Fig. 3(c)) and VPT-Deep. VPT-Shallow integrates additional l
mization techniques to reduce the trainable parameters in learnable prompts, denoted as P = [P1 ], [P2 ], ...[Pl ] ∈ Rl×d ,
the adapter. into the input patch embeddings x0 ∈ RN ×d . These prompts
In the first category, AdaptFormer [Chen and et al., 2022] are then concatenated with path embedding to form the final
serves as a typical example. It marks the first instance of input. This process can be expressed as follows:
adapting vision transformers to a broad array of downstream
visual recognition tasks using adapters. Notably, AdaptFormer x0 = concat(P, x0 ) = [P, x0 ] ∈ R(l+N )×d , (6)
doesn’t modify the structure of the adapter but demonstrates
that parallel insertion of adapters is more efficacious for vision where [·, ·] is the concatenation along the token dimension.
tasks than the sequential insertion typically employed in NLP VPT-Deep advances VPT-Shallow by adding prompts to ev-
tasks. Another key contribution is Convpass [Jie and et al., ery Transformer layer’s input space, updating only these
2022], which highlights that current adapters are hindered by prompts during fine-tuning while keeping pre-trained param-
a lack of strong inductive bias, limiting their performance. eters frozen. The cost of VPT-Deep depends on the prompt
Decoder Decoder
Adapter
×; VPT
Adapter !&'
Side Network
……
MLP ReLU
MHA
(b) Adapter Tuning (c) Prompt Tuning (d) Side Tuning
Attention
78 ∆!
Q 78 K 79 V Attention
79
1=0
LoRA
LoRA
!< !8 !9 Up Pretrained
Q 78 K 79 V *
Tanh Weights
LayerNorm ( = *(0, . /)
Q K V Down
tors. This approach efficiently stores weight increments, offer- S PECIFICATION ! ! ! # BitFit L × (7 × d)
ing a novel way to handle the parameters of PVMs. Following
FacT, EFFT [Chen, 2023] aims to minimize redundancies both R EPARAMETER # ! ! # LoRA L × 2 × (2dk)
within and across layers, without increasing computational
latency. This method exemplifies how tensor decomposition Table 1: Comparison between different tuning methods.
can be leveraged for more efficient model tuning. Beyond pre-
trained weight matrices, other works have explored different the model structure. 3) Inference Efficient (IE): Additional
parameters of PVMs. SSF [Lian and et al., 2022] integrates modules typically increase inference latency, with reparameter
learnable scale and shift parameters to adjust features and then tuning being an exception due to its mitigating reparameter-
reparameterizes these into the MLP layer. RepAdapter [Luo ization technique. 4) Memory Efficient (ME): Side tuning
and et al., 2023] demonstrates that adapter modules can be uniquely achieves memory efficiency as its gradient backprop-
seamlessly integrated into PVMs via structural reparameteri- agation does not involve PVMs. Overall, each PEFT method
zation, thereby achieving zero cost during inference. presents unique advantages and limitations, and there is no
completely perfect PEFT method.
3.3 Unified-based Tuning
Unified-based tuning approaches offer a unified framework Parameter Analysis. To accurately calculate the number
to integrate various fine-tuning methods into a single, har- of trainable parameters, we select a specific, representative
monized architecture. This approach streamlines the process work for each taxonomy, as shown in Tab. 1. It is observed
and enhances the overall efficiency and effectiveness of the that BitFit has the smallest number of trainable parameters
fine-tuning. For instance, NOAH [Liu and et al., 2022b] in- since it only updates the bias terms in PVMs. In contrast,
corporates Adapter, LoRA, and VPT into each Transformer LST has the largest trainable parameters due to its parallel
block and employs Neural Architecture Search (NAS) to de- subnetwork, but it can achieve memory efficient. Optimiza-
termine the best design for specific downstream tasks. This tion of the subnetwork structure may be crucial in the future.
method represents a comprehensive approach to optimizing Additionally, AdaptFormer, PATT, and LoRA share similar
fine-tuning by combining multiple techniques. LAM [Gao and parameter magnitudes, as they all inject comparable structures
et al., 2023] proposes a unified framework for continual learn- into each Transformer layer. VPT-Deep has a slightly higher
ing. This framework is designed to be adaptable, allowing parameter count than BitFit. In practical applications, com-
any PEFT method to be reconfigured into a competitive ap- pared to full fine-tuning, these methods possess only 0.05% to
proach for continual learning. Additionally, V-PEFT [Yu and 10% of the trainable parameter, yet they achieve comparable
et al., 2022] provides a unified analysis of PEFT techniques or even better performance on downstream tasks.
for video tasks. This research investigates the critical aspects 4 Datasets and Applications
of fine-tuning positions, offering a cohesive view of these tech-
niques. Similarly, U-Tuning [Jiang and et al., 2023] rethinks In this section, we briefly discuss the popular datasets and
PEFT from an integrated perspective, re-evaluating existing applications in visual PEFT, as shown in Tab. 2. Image recog-
tuning paradigms. It identifies a parallel form for mainstream nition is the primary benchmark and application for PEFT,
tuning methods, including adapter, prefix, and prompt tuning, exemplified by datasets such as FGVC [Jia and et al., 2022]
which effectively reduces the coupling in tuning structures. (5 downstream tasks) and VTAB-1k [Zhai and et al., 2019]
(19 downstream tasks). PEFT is also influential in other do-
3.4 Discussion mains. Beyond image classification, video action recogni-
Characteristic Analysis. We summarize the characteristics tion is another key application area, involving datasets like
of all PEFT methods, as shown in Tab. 1. The methods are Kinetics-400 [Kay and et al., 2017], SSv2 [Goyal and et al.,
compared in 4 aspects. 1) No-Additional Modules (NAM): 2017], HMDB51 [Kuehne and et al., 2011] and Driving-48 [Li
Specification tuning is the only method that does not intro- and et al., 2018]. Additionally, PEFT has been utilized for
duce new modules, while others introduce additional modules dense prediction tasks, using datasets like COCO [Lin and
or parameters more or less. 2) Structure Preserving (SP): et al., 2014], ADE20K [Zhou and et al., 2019] and PASCAL
Adapter Tuning changes the structure of the PVMs. By con- VOC [Everingham and et al., 2015]. Furthermore, the use
trast, prompt tuning, prefix tuning, side tuning, and reparam- of PEFT is expanding into new fields, including point cloud
eter tuning maintain the structure of the original PVM while analysis and robotic manipulation. It’s evident that PEFT is
introducing new modules. Specification Tuning directly opti- increasingly applied across various domains and prevailing in
mizes a subset of parameters of PVMs, so it does not change diverse downstream tasks.
Application Dataset Description #Classes Train size Val size Test size
Fine-Grained Visual Classification (FGVC) [Jia and et al., 2022]
CUB-200-2011 Fine-grained bird species recognition 200 5,394 600 5,794
NABirds Fine-grained bird species recognition 55 21,536 2,393 24,633
Oxford Flowers Fine-grained flower species recognition 102 1,020 1,020 6,149
Stanford Dogs Fine-grained dog species recognition 120 10,800 1,200 8,580
Stanford Cars Fine-grained car classification 196 7,329 815 8,041
Visual Task Adaptation Benchmark (VTAB-1k) [Zhai and et al., 2019]
CIFAR-100 100 10,000
Caltech101 102 6,084
DTD 47 1,880
Natural-tasks that contain natural images
Flowers102 102 800 200 6,149
captured using standard cameras.
Image Recognition Pets 37 3,669
SVHN 10 26,032
Sun397 397 21,750
Patch Camelyon 2 32,768
Specialized-tasks that contain images
EuroSAT 10 5,400
captured via specialized equipment, such 800 200
Resisc45 45 6,300
as medical and satellite imagery.
Retinopathy 5 42,670
Clevr/count 8 15,000
Clevr/distance 6 15,000
DMLab 6 22,735
KITTI/distance Structured-tasks that require geometric 4 711
800 200
dSprites/location comprehension like object counting. 16 73,728
dSprites/orientation 16 73,728
SmallNORB/azimuth 18 12,150
SmallNORB/elevation 9 12,150
Kinetics-400 [Kay and et al., 2017] 400 240,436 N/A 19,787
SSv2 [Goyal and et al., 2017] 174 168,913 24,777 27,157
Video Recognition HMDB51 [Kuehne and et al., 2011] Video action recognition 51 3,500 1,500 1,849
Diving-48 [Li and et al., 2018] 48 15,900 N/A 2,000
UCF-101 [Soomro and et al., 2012] 101 9,537 N/A 3,783
MS COCO [Lin and et al., 2014] Instance segmentation 80 118,000 N/A 5,000
ADE20K [Zhou and et al., 2019] Semantic segmentation 150 20,210 N/A 2,000
Dense Prediction
PASCAL VOC [Everingham and et al., 2015] Semantic segmentation 21 1,464 N/A 1,449