0% found this document useful (0 votes)
105 views12 pages

Tian Learning Vision From Models Rivals Learning Vision From Data CVPR 2024 Paper

Uploaded by

Achraf Louiza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
105 views12 pages

Tian Learning Vision From Models Rivals Learning Vision From Data CVPR 2024 Paper

Uploaded by

Achraf Louiza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

This CVPR paper is the Open Access version, provided by the Computer Vision Foundation.

Except for this watermark, it is identical to the accepted version;


the final published version of the proceedings is available on IEEE Xplore.

Learning Vision from Models Rivals Learning Vision from Data

Yonglong Tian1,† Lijie Fan2,†, * Kaifeng Chen1 Dina Katabi2 Dilip Krishnan1 Phillip Isola2
1 2 †
Google Research, MIT CSAIL, equal contribution
Github Repo: https://github.com/google-research/syn-rep-learn

Abstract

We introduce SynCLR, a novel approach for learning vi-


sual representations exclusively from synthetic images and
synthetic captions, without any real data. We synthesize a
large dataset of image captions using LLMs, then use an off-
the-shelf text-to-image model to generate multiple images
corresponding to each synthetic caption. We perform visual
representation learning on these synthetic images via con-
trastive learning, treating images sharing the same caption
as positive pairs. The resulting representations transfer well
to many downstream tasks, competing favorably with other
general-purpose visual representation learners such as CLIP
and DINO v2 in image classification tasks. Furthermore,
in dense prediction tasks such as semantic segmentation,
SynCLR outperforms previous self-supervised methods by a
significant margin, e.g., improving over MAE and iBOT by Figure 1. Three paradigms for visual representation learning. Top
6.2 and 4.3 mIoU on ADE20k for ViT-B/16. row: Traditional methods, such as CLIP [62], learn only from real
data; Middle row: Recent methods, such as StableRep [81], learn
from real text and generated images; Bottom row: Our method,
1. Introduction SynCLR, learns from synthetic text and synthetic images, and rival
the linear transfer performance of CLIP on ImageNet despite not
Representation learning extracts and organizes information directly observing any real data.
from raw, often unlabeled data. The quality, quantity, and
diversity of the data determines how good a representation
recent work has indeed shown that this can lead to strong
the model can learn. The model becomes a reflection of the
performance gains at scale [59], but this path is costly to
collective intelligence that exists in the data. We get what
pursue.
we feed in.
To alleviate the cost, in this paper we ask if synthetic
Unsurprisingly, the current best-performing visual rep-
data, sampled from off-the-shelf generative models, is a
resentation learning methods [59, 62] rely on large scale
viable path toward large scale curated datasets that can train
real datasets. However, the collection of real data has its
state-of-the-art visual representations.
own dilemmas. Collecting large scale uncurated data [71] is
relatively cheap and thus quite achievable. However, for self- We call such a paradigm learning from models, in con-
supervised representation learning, this approach exhibits trast to directly learning from data. Models have several
poor scaling behavior –i.e., adding more uncurated data has advantages as a data source for building large scale train-
little effect at large data scales [33, 80]. Collecting small ing sets: via their latent variables, conditioning variables,
scale curated data [21] also is achievable, but models trained and hyperparameters, they provide new controls for curat-
in this way are limited to relatively narrow tasks. The ideal ing data; we will make use of these controls in the method
would be large scale curated datasets of real images, and we propose. Models also can be easier to share and store
(because models are more compressed than data), and can
* Work done while interning at Google. produce an unlimited number of data samples (albeit with

15887
finite diversity). A growing literature has studied these class 1 class 2 class 3 class 4
properties and other advantages (and disadvantages) of us-
ing generative models as a data source for training down- SimCLR
stream models [3, 26, 40, 41, 69, 81]. Some of these meth-
ods use a hybrid mode – either mixing real and synthetic
datasets [3] or needing a real dataset to generate another class 1
synthetic dataset [81]. Other methods try to learn represen-
tations from purely synthetic data [69] but lag far behind Supervised
the best performing models. Instead, we show that learning CE
from models, without training on any real data, can yield rep-
resentations that match the top-performing representations
class 1 class 2
learnt from real data. For instance, as illustrated in Figure 1,
representations learnt by our method are able to transfer as
SynCLR
well as OpenAI’s CLIP [62] on ImageNet (both methods
using ViT-B [24]).
Our approach leverages generative models to re-define
the granularity of visual classes. As shown in Figure 2, con- Figure 2. Different learning objectives treat classification granu-
sider we have four images generated using two prompts: “a larity differently. These images are generated by two prompts “a
golden retriever, wearing sunglasses and a beach hat, rides a bike"
golden retriever, wearing sunglasses and a beach hat, rides
and “a cute golden retriever sits in a house made of sushi". Sim-
a bike" and “a cute golden retriever sits in a house made CLR treats each image as a class, while supervised cross-entropy
of sushi". Traditional self-supervised method such as Sim- treats them all as the same “golden retrieval” class. The former
CLR [13] will treat each of these images as a different class; does not consider shared semantics between images, and the latter
embeddings for different images are pushed apart with no is coarse-grained and ignores actions or relationships between sub-
explicit consideration of the shared semantics between im- jects/background. Our approach, SynCLR, defines visual classes
ages. On the other extreme, supervised learning methods by sentences.
(i.e. SupCE) will regard all these images as a single class
(e.g., “golden retriever”). This ignores nuances in the se-
examples of word-to-caption translations. Next, a text-to-
mantics of the images, such as the fact that the dogs are
image diffusion model is adopted to synthesize multiple
riding a bike in one pair of images and sitting inside a sushi
images for each synthetic caption. This yields a synthetic
house in the other pair of images. Instead, our method, Syn-
dataset of 600M images. Then we train visual representa-
CLR, treats captions as classes, i.e., each caption describes
tion models by a combination of multi-positive contrastive
a visual class (this level of granularity was also explored
learning [43] and masked image modeling [98].
in StableRep [81]). This allows us to group images by the
Our learned representations transfer well. With Syn-
concepts of “riding a bike” and “sitting in a sushi house”,
CLR pre-training, our ViT-B and ViT-L models achieve
in addition to grouping by a coarser class label like “golden
80.7% and 83.0% top-1 linear probing accuracy on
retrieval”. This level of granularity is difficult to mine in real
ImageNet-1K, respectively, which is on par with OpenAI’s
data, since collecting multiple images described by a given
CLIP [62]. On fine-grained classification tasks, SynCLR out-
caption is non-trivial, especially when scaling up the number
performs CLIP by 3.3% for ViT-B and 1.5% for ViT-L, and
of captions. However, text-to-image diffusion models are
performs similarly to DINO v2 [59] models, which are dis-
fundamentally built with this ability; simply by conditioning
tilled from a pre-trained ViT-g model. For semantic segmen-
on the same caption and using different noise inputs, a text-
tation on ADE20k, SynCLR outperforms MAE pre-trained
to-image diffusion model will produce different images that
on ImageNet by 6.2 and 4.1 in mIoU for ViT-B and ViT-L
all match the same caption. In our experiments, we find the
under the same setup, showing strong transfer ability for
caption-level granularity outperforms both SimCLR and su-
dense prediction tasks similar to DINO v2, which addition-
pervised training. Another advantage is that this definition of
ally involves a training period on 518x518 resolution images
visual classes has good scalability. Unlike ImageNet-1k/21k
that SynCLR does not have.
where a given number of classes is fixed, we can augment ex-
isting classes (or data) in an online fashion, and theoretically
scale up to as many classes as needed. 2. Related Works
Our system consists of three steps. The first step is to Self-supervised representation learning approaches in vi-
synthesize a large corpus of image captions. We design a sion develop domain-specific pre-text tasks, such as col-
scalable approach by leveraging the in-context learning ca- orization [94], rotation prediction [31], and solving jigsaw
pability of large language models (LLMs), where we present puzzles [56]. Domain-agnostic approaches have been pop-

15888
ular, such as contrastive learning [6, 13, 35, 38, 57, 78, 87] 3. Approach
and masked image modeling [2, 4, 5, 29, 39, 86, 90, 98].
Contrastive learning promotes invariance [79] for two views In this paper, we study the problem of learning a visual en-
of the same image and pushes apart representations for dif- coder f in the absence of real images or textual data. Our
ferent images [85] (or only invariance [11, 34]); the resulting approach hinges on the utilization of three key resources: a
representations yield strong performance for linear or zero- language generation model (g1 ), a text-to-image generative
shot transfer. Masked image modeling reconstructs the pix- model (g2 ), and a curated list of visual concepts (C). Our ex-
els [39, 90] or local features [4], often producing excellent ploration include three steps: (1) we employ g1 to synthesize
fine-tuning transfer performance, especially in dense predic- a comprehensive set of image descriptions T , which encom-
tion tasks [39]. The state of the art DINO v2 [59] leverages pass the range of visual concepts in C; (2) for each caption
both approaches, and our approach shares a similar spirit. in T , we generate multiple images using g2 , culminating in
an extensive synthetic image dataset X; (3) we train on X
Supervised learning [36, 45, 75] used to be the dominant to obtain a visual representation encoder f .
approach for learning transferable visual representations for We use Llama-2 7B [83] and Stable Diffusion 1.5 [64] as
various tasks [23, 32, 72]. Recent studies [37, 49] has shown g1 and g2 , respectively, because of their fast inference speed.
that, the transferability of representations learned in this way We anticipate that better g1 and g2 in the future will further
is limited, e.g., pre-training has no improvement over random enhance the effectiveness of this approach.
initialization for dense prediction tasks (e.g., object detec-
tion) when the fine-tuning is long enough. Such limitation 3.1. Synthesizing captions
continues when the model has been scaled up to 22B [20].
An alternative paradigm learns visual representations from To harness the capability of powerful text-to-image models
text supervision [42, 62], e.g., CLIP [62]. This approach is for generating a substantial dataset of training images, we ini-
more flexible (i.e., not requiring classes) and provides richer tially require a collection of captions that not only precisely
supervision, often learning generalizable representations. depict an image but also exhibit diversity to encompass a
broad spectrum of visual concepts.
Generative models as representation learners. A number We have developed a scalable approach to create such a
of papers have explored the representations that are learned large collection of captions, leveraging the in-context learn-
by generative models for various recognition tasks [22, 48]. ing capability of LLMs [9]. Our method involves crafting
As might be expected intuitively, such models indeed learn specific prompt engineering templates that guide the LLM to
especially good representations for dense tasks, such as opti- produce the required captions. We start by gathering the con-
cal flow estimation [70], semantic segmentation [8, 91], and cept list C from some existing datasets, such as ImageNet-
depth estimation [95]. Another line of work [18, 47] adapt 21k [21] and Places-365 [96]. For each concept c ∈ C, we
pre-trained diffusion models for zero-shot image recognition consider three straightforward templates to generate captions
via analysis-by-synthesis. These approaches may need to effectively.
be adapted when the architectures of the generative models
• c –> caption. As the most direct and simple approach, we
change or a new family of generative model emerge. Our
have the Llama-2 model sample a sentence for the concept
approach treats images as universal interfaces with the hope
c.
of better generality.
• c, bg –> caption. We combine the visual concept c with
Learning from synthetic data from generative models. a background or setting bg. A naïve approach would ran-
Synthetic data has been explored to train machine learn- domly select both c and bg, where bg may correspond to a
ing models in various domains [27, 46, 53, 54, 65, 66, 74, class name from a places dataset like [96]. However, this
77, 92]. In computer vision, the utilization of synthetic method often leads to unlikely combinations in the real
data for training models is common, ranging from optical world, such as a blue whale in a football field. Our abla-
flow [52] and autonomous driving [1] to semantic segmenta- tion experiments demonstrate that this strategy results in
tion [15] and human pose estimation [84]. Others [41, 50] suboptimal performance, likely because the generated cap-
have explored synthetic data for representation learning, tions fall far outside the training distribution of g2 . Instead,
with the predominant approach of altering the latent vari- we employ GPT-4 [58] to generate a list of suitable back-
ables of deep generative models. Our approach aligns with grounds for the chosen concepts. This approach increases
this research paradigm, but it diverges in its use of text-to- the likelihood of generating more plausible combinations,
image models, which have also been investigated by other re- such as a tiger in a forest or a cat in a kitchen, enhancing
searchers [40, 69, 99]. But they use synthetic data for super- the overall quality of the results.
vised learning [26, 69]. The closet work is StableRep [81], • c, rel –> caption. Given a visual concept c, we consider
which also conducts representation learning but still needs a pairing it with a positional relationship word, rel. Take
real text dataset. for instance, if c signifies cat and rel translates to in front

15889
Templates In context examples
c –> caption revolver –> Multiple antique revolvers lie on a wooden table, gleaming under soft, ambient light.
closet –> The compact closet, brimming with clothes and shoes, exudes a feeling of organization.
zebra –> A zebra is gallantly trotting across the vast, sunlit plains of the African savannah, creating a
captivating black and white spectacle.
bus station –> The bustling bus station thrums with restless energy, as travelers navigate through the crowded
space, awaiting their journeys amid the echoes of departing buses.
c,bg –> caption tiger, forest –> Two tigers are running together in the forest.
lighter, motorhome –> In the cozy, cluttered environment of a well-traveled motorhome, a sleek silver lighter
holds dominion on the rustic wooden table.
sunset, lake –> Golden sunset hues reflect on a calm lake, silhouetting a lone canoeist against a backdrop of
fiery clouds.
c,rel –> caption kit fox, in front of –> A group of small, fluffy, golden kit foxes is playfully gathered in front of a lush, green,
towering forest backdrop.
cabbage, besides –> A vibrant image portrays a lush, green cabbage, glistening with dewdrops, nestled
besides a rustic, wooden crate full of freshly harvested vegetables.

Table 1. We show examples for the three synthesis templates. Such examples are used as demonstrations for Llama-2 to perform the
in-context learning task. We have 176 such examples in total. Most of them are generated by prompting GPT-4 [58], while a handful of
others are human generated (in a 10M scale pilot study of synthetic captions, we do not notice significant differences between including or
excluding human generated examples.)

3.2. Synthesizing Images

For each text caption, we generate a variety of images by


initiating the reverse diffusion process with different random
noise. The Classifier-Free Guidance (CFG) scale is a crucial
factor in this process. A higher CFG scale enhances the qual-
ity of the samples and the alignment between text and image,
whereas a lower scale results in more diverse samples and
Figure 3. In-context caption generation using Llama-2 [83]. We
better adherence to the original conditional distribution of
randomly sample three in-context examples for each inference run. images based on the given text. Following the approach used
in StableRep [81], we opt for a lower CFG scale, specifically
2.5, and produce 4 images for each caption. Examples of
of, our objective is to prompt the LLM to create captions these images can be seen in Figure 4.
such as a cute yellow cat is enjoying the fish in front of the
sofa. To add variety, we have a selection of 10 different
positional relationship words that we randomly choose 3.3. Representation Learning
from.
Our representation learning method is built upon Sta-
For each of the three templates, we have prepared multi- bleRep [81]. The key component of our approach is the
ple demonstration examples that serve as instructions for the multi-positive contrastive learning loss [43] which works by
LLM to complete the caption synthesis task. Table 1 shows a aligning (in the embedding space) images generated from the
couple of examples for each template. In total, we have 106 same caption. We additionally combine multiple techniques
examples for c–>prompt, 50 examples for c, bg–>prompt, from other self-supervised learning methods, including a
and 20 examples for c, rel–>prompt. Such examples are patch-level masked image modeling objective. We briefly
mostly collected by prompting GPT-4, with a handful from review StableRep and elaborate on the added modules.
human. In a pilot study, we do not observe difference be- StableRep [81] minimizes the cross-entropy loss between
tween including or excluding human generated examples. a ground-truth assignment distribution and a contrastive as-
In the stage of generating captions in-context, we select a signment distribution. Consider an encoded anchor sample
concept and one of the three templates. Next, we randomly a and a set of encoded candidates {b1 , b2 , ..., bK }. The con-
pick three examples from the chosen template and frame the trastive assignment distribution q describes how likely the
caption generation as a text completion task. This process is model predicts a and each b to be generated from the same
illustrated in Figure 3. caption, and the ground-truth distribution is the actual match

15890
For these local crops, we only employ the contrastive loss,
omitting the iBOT loss. Local crops are encoded only by
the student network, and matched to global crops from the
same caption encoded by the EMA model. Such reuse of
A plate of paella, a mixed rice dish with chicken, beans, and seafood global crops saves computation. For each image x, where
we generate a single global crop xg alongside n local crops
xl , the final loss can be expressed as follows:
n
g 1X
L(x ) + L(xli ) + LiBOT (xg ) (4)
n i=1
An industrial power plant with its smokestacks belching black smoke.
3.4. Implementation
Concept list. We concatenate class names from various
datasets, including IN-1k [21], IN-21k (we keep the most fre-
quent 13k classes), Aircraft [51], Cars [44], DTD [17], Flow-
A fluffy, black and white junco bird perches on a snow-covered fence, ers [55], Pets [60], Sun397 [88], Caltech-101 [30], Food-
overlooking a dark forest.
101 [7], and Places-365 [96]. If the concept is a place (i.e.
Figure 4. Random examples of synthetic captions and images SUN397 and Places) or a texture (i.e. DTD), we only apply
generated in our SynCLR pipeline. Each caption comes with 4
the c –> caption template. For fine-grained classes such
images.
as pets or flowers, we employ GPT-4 to generate a consol-
idated list of probable backgrounds, rather than producing
between a and b (a is allowed to match multiple b): distinct lists for each specific class. We favor more frequent
exp(a · bi /τ ) sampling from IN-1k, Food101, Cars, Aircraft, and Flowers.
q i = PK (1) Batches. For each training batch, we sample 2048 captions
j=1 exp(a · bj /τ ) (except when noted), and use all of the 4 images generated
1match(a,bi ) by each caption. We generate 1 global and 4 local crops for
pi = PK (2)
j=1 1match(a,bj )
each image. As a result, each batch contains 8192 global
crops, which is similar with prior work [13, 14, 34, 81].
where τ ∈ R+ is the scalar temperature, a and all b have Masking. For the iBOT loss, we randomly choose 50%
been ℓ2 normalized, and the indicator function 1match(·,·) images inside a batch to mask, and randomly mask 50% of
indicates whether two samples are from the same caption. the tokens in each chosen image. We use 65536 prototypes.
The contrastive loss for a is given as While the target from the EMA model is ascertained using
K
X the SK algorithm, we apply softmax normalization to the
L(a) = H(p, q) = − pi log qi (3) output of the student model.
i=1 Projection heads. We follow the design in MoCo v3 [14]
and DINO [11] for the contrastive and iBOT loss heads,
iBOT [98] is a masked image modeling objective, wherein
respectively, ensuring consistency with established methods.
a localized patch is masked, and the model is tasked with
Other hyper-parameters. We set the temperature in the con-
predicting the tokenized representation of said masked patch.
trastive loss to 0.08. For the temperature used in the iBOT
It adapts the DINO [11] objective from the image level into
loss, we linearly increase it from 0.04 to 0.07 over 4000
the patch level. We follow [67] to replace the softmax-
iterations, and keep it as 0.07 afterwards, as in DINO [11].
centering method with the iterative Sinkhorn-Knopp (SK)
Additionally, the weight decay parameter is incrementally
algorithm [19]. We run SK for 3 iterations to build the
adjusted from 0.04 to 0.2, adhering to a cosine schedule.
prediction target.
Exponential Moving Average (EMA) is firstly introduced 4. Experiment
into self-supervised learning by MoCo [38]. We use EMA to
encode crops as b and to produce the targets for iBOT loss. We first perform an ablation study to evaluate the efficacy of
We update the EMA model as θema ← λθema + (1 − λ)θ, various designs and modules within our pipeline. Then we
following a cosine schedule for λ from 0.994 to 1 during proceed to scale up the volume of synthetic data.
training [34, 59]. We find the EMA module not only in-
4.1. Study different components
creases the final performance, but also improves the training
stability for long training schedules. We analyze each component of SynCLR, and ablate their
Multi-crop strategy is introduced by [10] as a smart way to effectiveness in two measurements: (1) linear probing perfor-
improve computation efficiency, and is adopted in this paper. mance on IN-1k; (2) average accuracy of linear transfer on

15891
StableRep SynCLR method EMA iBOT MC IN avg. ADE20k
captions
IN avg. IN avg. StableRep 75.8 85.7 -
cc12m 73.0 81.6 77.1 85.3 ✓ 76.7 86.7 48.0
IN+h+Places 75.4 80.0 78.7 83.0 ✓ ✓ 77.6 87.1 50.5
IN+Places+LLM 73.7 76.9 77.6 81.8 ✓ ✓ 78.6 87.8 49.5
IN+OurBG+LLM 75.3 78.5 78.2 81.9 SynCLR ✓ ✓ ✓ 78.8 88.1 50.8

our final config. 75.8 85.7 78.8 88.1 Table 4. Important components for our model. ViT-B/16 models
are trained for 85000 iterations. We study the modules that af-
Table 2. Comparison of different caption synthesis strategies.
fect the ImageNet linear evaluation, the fine-grained classification
We report top-1 ImageNet linear evaluation accuracy and the aver-
(avg.), and ADE20k segmentation.
age accuracy over 9 fine-grained datasets. Every item here includes
10M captions and 4 images per caption.
method IN avg.

CFG 2 3 4 Supervised CE 71.9 75.0


SimCLR 63.6 67.9
IN top-1 72.8 72.6 72.6
SynCLR 75.3 78.5
Table 3. Classifier-free guidance scale (CFG). Contrastive loss
prefers small CFG scale but is not very sensitive to it. Table 5. Comparison of different learning objectives. These
objectives assume different level of classification granularity, as
fine-grained datasets Aircraft [51], Cars [44], DTD [17], shown in Figure 2. Our modeling, i.e., defining classes as captions,
Flowers [55], Pets [60], Sun397 [88], Caltech-101 [30], outperforms the other two. To accomondate Supervised CE training,
all items here used IN+OurBG+LLM entry in Table 2.
Food-101 [7], and Pascal VOC [25]. For analysis conducted
in this subsection, we train ViT-B/16 [24] models for 85000 guidance scale. For the former, we find generating 4 images
iterations, and use the cls token as image representation. is almost able to reproduce StableRep [81]’s performance
Synthesize captions. Following [81], we use cc12m [12] (10 images) when using cc12m captions (ours 73.0% v.s.
real captions as our baseline, which has 10M sentences. StableRep 73.5% on ImageNet). Thus we stick to 4. For
To synthesize captions, we design the following variants: guidance scale, we briefly find the contrastive loss is not very
(a) IN+h+Places randomly combines one IN class plus its sensitive to CFG in a pilot study, as shown in Table 3. Thus
hypernyms in WordNet graph, with one place class; (b) we stick to 2.5, similar as StableRep [81].
IN+Places+LLM uses the c, bg –> caption in-context syn- Model components. We present the improvement of accu-
thesis template with c from IN and bg from places; (c) racy brought by different modules in Table 4. Compared
IN+ourBG+LLM uses the background classes output by to the baseline StableRep, adding a teacher EMA model
GPT-4, instead of Places; (d) ours means our full configu- improves the IN linear accuracy by 0.9%. Further adding
ration specified in Section 3.1. For each of the config, we iBOT local objective or the multi-crop strategy increases the
generate 10M captions. If not enough, we do duplication. accuracy by 0.9% and 1.9%, respectively. Combining all
Results are summarized in Table 2, where we train both of them results in our full SynCLR model, which achieves
StableRep and SynCLR to avoid biases favored by a single 78.8% top-1 IN linear accuracy. The fine-grained classifica-
method. Compared to a real caption dataset cc12m, sim- tion performance follows a similar trend, and reaches 88.1%.
ply concatenating IN and Places class names improves the Besides, we test the transfer ability to semantic segmenta-
ImageNet linear accuracy but reduces the fine-grained classi- tion on ADE20k. The iBOT objective brings 1.0 more mIoU
fication performance. Interestingly, naively asking Llama to than multi-crop strategy, demonstrating the effectiveness of
combine IN and Places classes into captions yields the worst masked image modeling for dense prediction tasks.
performance. Replacing random background from places Compare to SimCLR and supervised training. We com-
with GPT generated background improves the accuracy. This pare the three different representation learning objectives
shows the importance of synthesizing captions that follow shown in Figure 2, which classify images at different lev-
the distribution of real captions, which were used to train the els of granularity. Since supervised cross-entropy training
text-to-image model. Finally, our full configuration achieves requires a fixed set of balanced classes (indeed both fixed
the best accuracy on both ImageNet and fine-grained classi- set and balance are limitations of such method), we use
fication. Another advantage of our synthesis method is its the IN+ourBG+LLM configuration where we have 1000
scalability – scale up to hundreds of millions of captions with balanced classes (i.e., each class has 40k images). The su-
little duplication. In contrast, if we concatenate IN classes pervised training recipe follows [76]. For a fair compari-
with Places classes, there are at most 365k unique captions. son with SimCLR, we remove all unmatched modules (i.e.,
Synthesize images. There are two major parameters in this EMA, iBOT, and MC) to make sure that the only difference
process: number of images per caption and classifier free between SimCLR and our SynCLR is the classification gran-

15892
Caltech-101
ImageNet

VOC2007
Food-101
SUN397

Average
Flowers
Aircraft

DTD
Cars

Pets
text img # imgs
StableRep real syn 100M ViT-B/16 75.7 59.2 83.5 80.1 97.3 88.3 74.3 94.7 85.1 87.9 83.4
ViT-B/16 80.2 59.5 86.7 79.2 98.1 93.1 78.4 94.7 92.8 89.2 85.7
CLIP real real 400M
ViT-L/14 83.9 69.4 90.9 82.1 99.2 95.1 81.8 96.5 95.2 89.6 88.9
400M ViT-B/16 78.9 61.1 92.3 81.9 98.2 91.5 77.9 95.2 90.9 88.0 86.3
OpenCLIP real real 400M ViT-L/14 82.3 67.1 94.0 83.6 98.8 92.5 81.0 96.4 93.4 88.8 88.4
2B ViT-L/14 83.4 71.7 95.3 85.3 99.0 94.2 82.2 97.5 94.1 88.9 89.8
ViT-B/14 83.9† 79.4 88.2 83.3 99.6 96.2 77.3 96.1 92.8 88.2 89.0
DINO v2* - real 142M
ViT-L/14 85.7† 81.5 90.1 84.0 99.7 96.6 78.7 97.5 94.3 88.3 90.1
ViT-B/16 80.7 81.7 93.8 79.9 99.1 93.6 76.2 95.3 91.6 89.4 89.0
SynCLR syn syn 600M
ViT-L/14 83.0 85.6 94.2 82.1 99.2 94.1 78.4 96.1 93.4 90.3 90.4

Table 6. Comparison on ImageNet linear evaluation and fine-grained classificaton. SynCLR achieves comparable results with OpenAI’s
CLIP and DINO v2 models, despite only using synthetic data. *DINO v2 modes are distilled from a ViT-g model, thus advantageous in this
comparison. † we rerun only using cls token instead of concatenating multiple layers presented in the original DINO v2 paper [59].

ularity defined by the contrastive loss. For all of them, we method pre-train data distill ViT-B ViT-L
do pre-training and then linear probing on the target dataset.
StableRep hybrid, 100M 49.4 -
Table 5 presents the comparison. Our multi-positive ob- MoCo v3 real, IN1K-1M 47.3 49.1
jective, which defines images as the same class if they are BEiT real, IN1K-1M+DALLE 47.1 53.3
generated by the same caption, achieves the best perfor- MAE real, IN1K-1M 48.1 53.6
mance. It outperforms supervised cross-entropy training iBOT real, IN1K-1M 50.0 -
and SimCLR by 3.4% and 11.7% for top-1 accuracy on CLIP real, WIT-400M 52.6 -
ImageNet linear evaluation, and by 3.5% and 10.6% on fine- BEiT v2 real, WIT-400M, IN1K ✓ 53.1 56.7
grained classification tasks. Besides, our objective does not DINO v2 real, LVD-142M ✓ 54.4 † 57.5†
require balance between samples from a fixed set of classes, SynCLR synthetic, 600M 54.3 57.7 †
making it easier to scale up.
Table 7. ADE20K semantic segmentation (mIoU) using UperNet,
4.2. Scaling up with single scale at 512x512 resolution. † use patch size of 14x14,
thus adapt to 518x518 resolution.
After we have ablated different components, we scale up our
experiments. Specifically, we synthesize a dataset of 150M reaches 81.0% with ViT-L/16).
captions, called SynCaps-150M, from which we generate Fine-grained classification. On the nine fine-grained
600M images. We train both ViT-B/16 and ViT-L/14 (no datasets we have evaluated in Table 6, SynCLR achieves
SwiGLU [73] or LayerScale [82]), and extend the training very similar average accuracy as DINO v2, e.g., 89.0% v.s.
schedules to 500k steps with a batch size of 8192 captions. 89.0% for ViT-B, and 90.1% vs 90.4% for ViT-L. Both Syn-
We use 224x224 resolution for all pre-training tasks. CLR and DINO v2 have curated the pre-training data to
We compare SynCLR with OpenAI’s CLIP [62], Open- include the distribution for these datasets (but in different
CLIP [16], and DINO v2 [59], which represent learning ways and portions), and end up with similar performance.
from data. We note that ViT-B/14 and ViT-L/14 from DINO Interestingly, SynCLR outperforms others on Aircraft and
v2 are distilled from a ViT-g [93] model, which makes DINO Cars, possibly because we favor more frequent sampling
v2 advantageous in our comparison. We also includes Sta- towards them. This can be an advantage for synthetic data
bleRep [81], which uses the hybrid paradigm. when we know what downstream tasks to solve. Besides,
ImageNet linear evaluation. For fair comparison, cls SynCLR outperforms CLIP and StableRep by 3.3% and by
token from the last block is used as representation across all 5.6% for ViT-B, respectively.
models (whereas in DINO v2, results are from concatenating Semantic segmentation. To evaluate the pixel-level under-
multiple layers). As shown in Table 6, SynCLR achieves standing ability of SynCLR, we fine-tune the pre-trained
80.7% with ViT-B and 83.0% with ViT-L. This is similar models on ADE20k [97], following the setup in [5, 39].
as CLIP, but still lags behind DINO v2 by 3.2% and 2.7%, UperNet [89] is used as the task layer, and we evaluate with
respectively, partially because of the extra distillation in a single-scale, i.e. 512x512. Besides CLIP and DINO v2,
DINO v2. We note SynCLR has already outperformed other we also compare to self-supervised methods pre-trained on
self-supervised methods pre-trained directly on ImageNet- ImageNet, as well as BEiT v2 [61], which distills from CLIP.
1k (e.g., DINO achieves 78.2% with ViT-B/16 and iBOT Table 7 shows that our SynCLR outperforms self-supervised

15893
Dino v2 SynCLR (ours) Dino v2 SynCLR (ours) Dino v2 SynCLR (ours)

(a) (b) (c)

Figure 5. PCA visualization. Follow DINO v2 [59], we compute a PCA between the image patches from the same set and colorize by their
first 3 components. Compared to DINO v2, SynCLR produces more accurate maps for cars (e.g., zoom-in to see the two bars on the roof of
the first car, and the three side windows of the third car) and airplanes (e.g., the boundaries), while being slightly worse for dogs (e.g., heads).
We use ViT-L/14 for both methods. Images are resized to 336x448 resolution, yielding 24x32 visualization grids.

methods trained on IN-1k by a clear marge, e.g., 4.3 higher spend separate effort collecting datasets for different image
mIoU than iBOT. Despite not involving a high resolution pre- categories, e.g., cars, flowers, cats, dogs, and so on. DINO
training period like DINO v2 (e.g., 518x518), SynCLR per- v2 [59] achieves robust representations by curating and amal-
forms similarly with DINO v2 (0.1 lower for ViT-B possibly gamating numerous such datasets. Such a process introduces
because DINO v2 uses a smaller patch size of 14x14, but complexities such as clustering and search challenges. In
0.2 higher for ViT-L). This suggests SynCLR pre-training is contrast, advanced text-to-image generative models like Sta-
suitable for dense prediction tasks. ble Diffusion [63] or Imagen [68] have the capability to
ImageNet fine-tuning. We evaluate the fine-tuning transfer generate many diverse datasets. These models provide the
ability of SynCLR on ImageNet. Our SynCLR achieves flexibility to produce an infinite number of samples (albeit
87.9% top-1 accuracy with ViT-L, outperforming models finite diversity) and control the generation process through
trained on ImageNet images or large scale image datasets. textual input. Thus, generative models offer a convenient and
Specifically, SynCLR outperforms OpenCLIP ViT-L (87.1% effective method for curating training data. In our study, we
top-1) trained on Laion-2B, which is the dataset Stable Dif- harness this advantage to synthesize images encompassing a
fusion (the text2image model we used) is trained on. This broad spectrum of visual concepts.
contrasts with [26, 69], which shows that directly training
What can be further improved? Enhanced caption sets
a classifier on synthetic images yields bad classification ac-
can be achieved through various methods, such as enriching
curacy. Our finding suggests synthetic images are good for
the set of in-context examples, optimizing the sampling ra-
training representations, which later can be easily adapted to
tios among different concepts, and utilizing more advanced
a downstream task with limited amount of real data. Detailed
LLMs. In terms of the learning process, one approach is to
comparisons are provided in Appendix C.
distill knowledge from a larger model, and incorporate an ad-
PCA visualization. Following the method used in DINO
ditional high-resolution training phase (as discussed in [59])
v2 [59], we present visualizations derived from the Principal
or an intermediate IN-21k fine-tuning stage (as per [5, 61]).
Component Analysis (PCA) conducted on patch features
Regarding architectural improvements, the integration of
extracted using our model SynCLR. As depicted in Figure 5,
SwiGLU and LayerScale, coupled with superior model ini-
a comparative analysis is conducted between SynCLR and
tialization strategies (referenced in [28]), can be beneficial.
DINO v2, both utilizing the ViT-L/14 architecture. The
However, due to limited resources and the scope of this
results demonstrate that SynCLR effectively accentuates the
paper not being focused on achieving the highest possible
features of cars and planes, while efficiently minimizing
metrics, we propose these areas for further exploration in
background clutter.
future research endeavors.
In summary, this paper studies a new paradigm for visual
5. Discussions and Conclusion representation learning – learning from generative models.
Why learn from generative models? One compelling rea- Without using any real data, SynCLR learns visual represen-
son is that a generative model can act like hundreds of tations that are comparable with those achieved by state of
datasets simultaneously. Traditionally, researchers have to the art general-purpose visual representation learners.

15894
References [16] Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell
Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuh-
[1] Hassan Abu Alhaija, Siva Karthik Mustikovela, Lars mann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scal-
Mescheder, Andreas Geiger, and Carsten Rother. Augmented ing laws for contrastive language-image learning. In CVPR,
reality meets computer vision: Efficient data generation for 2023. 7
urban driving scenes. IJCV, 2018. 3
[17] Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy
[2] Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bo- Mohamed, and Andrea Vedaldi. Describing textures in the
janowski, Florian Bordes, Pascal Vincent, Armand Joulin, wild. In CVPR, 2014. 5, 6
Mike Rabbat, and Nicolas Ballas. Masked siamese networks
[18] Kevin Clark and Priyank Jaini. Text-to-image diffusion mod-
for label-efficient learning. In ECCV, 2022. 3
els are zero-shot classifiers. arXiv preprint arXiv:2303.15233,
[3] Shekoofeh Azizi, Simon Kornblith, Chitwan Saharia, Mo- 2023. 3
hammad Norouzi, and David J Fleet. Synthetic data from [19] Marco Cuturi. Sinkhorn distances: Lightspeed computation
diffusion models improves imagenet classification. arXiv of optimal transport. In NeurIPS, 2013. 5
preprint arXiv:2304.08466, 2023. 2
[20] Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr
[4] Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter
Jiatao Gu, and Michael Auli. Data2vec: A general framework Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdul-
for self-supervised learning in speech, vision and language. mohsin, et al. Scaling vision transformers to 22 billion pa-
In ICML, 2022. 3 rameters. In ICML, 2023. 3
[5] Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: [21] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li
Bert pre-training of image transformers. arXiv preprint Fei-Fei. Imagenet: A large-scale hierarchical image database.
arXiv:2106.08254, 2021. 3, 7, 8 In CVPR, 2009. 1, 3, 5
[6] Suzanna Becker and Geoffrey E Hinton. Self-organizing neu- [22] Jeff Donahue and Karen Simonyan. Large scale adversarial
ral network that discovers surfaces in random-dot stereograms. representation learning. NeurIPS, 2019. 3
Nature, 1992. 3 [23] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman,
[7] Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Ning Zhang, Eric Tzeng, and Trevor Darrell. Decaf: A deep
Food-101–mining discriminative components with random convolutional activation feature for generic visual recognition.
forests. In ECCV, 2014. 5, 6 In ICML, 2014. 3
[8] Emmanuel Asiedu Brempong, Simon Kornblith, Ting Chen, [24] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
Niki Parmar, Matthias Minderer, and Mohammad Norouzi. Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
Denoising pretraining for semantic segmentation. In CVPR, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-
2022. 3 vain Gelly, et al. An image is worth 16x16 words: Trans-
[9] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- formers for image recognition at scale. arXiv preprint
biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, arXiv:2010.11929, 2020. 2, 6
Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language [25] Mark Everingham, Luc Van Gool, Christopher KI Williams,
models are few-shot learners. NeurIPS, 2020. 3 John Winn, and Andrew Zisserman. The pascal visual object
[10] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, classes (voc) challenge. IJCV, 2010. 6
Piotr Bojanowski, and Armand Joulin. Unsupervised learn- [26] Lijie Fan, Kaifeng Chen, Dilip Krishnan, Dina Katabi, Phillip
ing of visual features by contrasting cluster assignments. In Isola, and Yonglong Tian. Scaling laws of synthetic images
NeurIPS, 2020. 5 for model training ... for now. arXiv:2312.04567, 2023. 2, 3,
[11] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, 8
Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- [27] Lijie Fan, Dilip Krishnan, Phillip Isola, Dina Katabi, and Yon-
ing properties in self-supervised vision transformers. In ICCV, glong Tian. Improving clip training with language rewrites.
2021. 3, 5 In NeurIPS, 2023. 3
[12] Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu [28] Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xin-
Soricut. Conceptual 12m: Pushing web-scale image-text long Wang, and Yue Cao. Eva-02: A visual representation for
pre-training to recognize long-tail visual concepts. In CVPR, neon genesis. arXiv preprint arXiv:2303.11331, 2023. 8
2021. 6 [29] Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu,
[13] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geof- Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao.
frey Hinton. A simple framework for contrastive learning of Eva: Exploring the limits of masked visual representation
visual representations. In ICML, 2020. 2, 3, 5 learning at scale. In CVPR, 2023. 3
[14] Xinlei Chen, Saining Xie, and Kaiming He. An empirical [30] Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning gener-
study of training self-supervised vision transformers. In ICCV, ative visual models from few training examples: An incre-
2021. 5 mental bayesian approach tested on 101 object categories. In
[15] Yuhua Chen, Wen Li, Xiaoran Chen, and Luc Van Gool. CVPR, 2004. 5, 6
Learning semantic segmentation from synthetic data: A geo- [31] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsu-
metrically guided input-output adaptation approach. In CVPR, pervised representation learning by predicting image rotations.
2019. 3 In ICLR, 2018. 2

15895
[32] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra [49] Yanghao Li, Saining Xie, Xinlei Chen, Piotr Dollar, Kaim-
Malik. Rich feature hierarchies for accurate object detection ing He, and Ross Girshick. Benchmarking detection
and semantic segmentation. In CVPR, 2014. 3 transfer learning with vision transformers. arXiv preprint
[33] Priya Goyal, Dhruv Mahajan, Abhinav Gupta, and Ishan arXiv:2111.11429, 2021. 3
Misra. Scaling and benchmarking self-supervised visual rep- [50] Hao Liu, Tom Zahavy, Volodymyr Mnih, and Satinder Singh.
resentation learning. In ICCV, 2019. 1 Palm up: Playing in the latent manifold for unsupervised
[34] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin pretraining. arXiv preprint arXiv:2210.10913, 2022. 3
Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doer- [51] Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew
sch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Ghesh- Blaschko, and Andrea Vedaldi. Fine-grained visual clas-
laghi Azar, et al. Bootstrap your own latent-a new approach sification of aircraft. arXiv:1306.5151, 2013. 5, 6
to self-supervised learning. In NeurIPS, 2020. 3, 5
[52] Nikolaus Mayer, Eddy Ilg, Philip Hausser, Philipp Fischer,
[35] Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimension- Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. A
ality reduction by learning an invariant mapping. In CVPR, large dataset to train convolutional networks for disparity,
2006. 3 optical flow, and scene flow estimation. In CVPR, 2016. 3
[36] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
[53] Yu Meng, Jiaxin Huang, Yu Zhang, and Jiawei Han. Generat-
Deep residual learning for image recognition. In CVPR, 2016.
ing training data with language models: Towards zero-shot
3
language understanding. arXiv preprint arXiv:2202.04538,
[37] Kaiming He, Ross Girshick, and Piotr Dollár. Rethinking
2022. 3
imagenet pre-training. In ICCV, 2019. 3
[38] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross [54] Masato Mimura, Sei Ueno, Hirofumi Inaguma, Shinsuke
Girshick. Momentum contrast for unsupervised visual repre- Sakai, and Tatsuya Kawahara. Leveraging sequence-to-
sentation learning. In CVPR, 2020. 3, 5 sequence speech synthesis for enhancing acoustic-to-word
[39] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr speech recognition. In SLT, 2018. 3
Dollár, and Ross Girshick. Masked autoencoders are scalable [55] Maria-Elena Nilsback and Andrew Zisserman. Automated
vision learners. In CVPR, 2022. 3, 7 flower classification over a large number of classes. In In-
[40] Ruifei He, Shuyang Sun, Xin Yu, Chuhui Xue, Wenqing dian Conference on Computer Vision, Graphics & Image
Zhang, Philip Torr, Song Bai, and Xiaojuan Qi. Is synthetic Processing, 2008. 5, 6
data from generative models ready for image recognition? [56] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of
arXiv preprint arXiv:2210.07574, 2022. 2, 3 visual representations by solving jigsaw puzzles. In ECCV,
[41] Ali Jahanian, Xavier Puig, Yonglong Tian, and Phillip Isola. 2016. 2
Generative models as a data source for multiview represen- [57] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre-
tation learning. arXiv preprint arXiv:2106.05258, 2021. 2, sentation learning with contrastive predictive coding. arXiv
3 preprint arXiv:1807.03748, 2018. 3
[42] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, [58] OpenAI. Gpt-4 technical report. arXiv preprint
Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom arXiv:2303.08774, 2023. 3, 4
Duerig. Scaling up visual and vision-language representation [59] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo,
learning with noisy text supervision. In ICML, 2021. 3 Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel
[43] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2:
Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Learning robust visual features without supervision. arXiv
Dilip Krishnan. Supervised contrastive learning. In NeurIPS, preprint arXiv:2304.07193, 2023. 1, 2, 3, 5, 7, 8
2020. 2, 4
[60] Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and
[44] Jonathan Krause, Jia Deng, Michael Stark, and Li Fei-Fei.
CV Jawahar. Cats and dogs. In CVPR, 2012. 5, 6
Collecting a large-scale dataset of fine-grained cars. tech
report, 2013. 5, 6 [61] Zhiliang Peng, Li Dong, Hangbo Bao, Qixiang Ye, and Furu
[45] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Im- Wei. Beit v2: Masked image modeling with vector-quantized
agenet classification with deep convolutional neural networks. visual tokenizers. arXiv preprint arXiv:2208.06366, 2022. 7,
In NeurIPS, 2012. 3 8
[46] Varun Kumar, Ashutosh Choudhary, and Eunah Cho. Data [62] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
augmentation using pre-trained transformer models. arXiv Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
preprint arXiv:2003.02245, 2020. 3 Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning
[47] Alexander C Li, Mihir Prabhudesai, Shivam Duggal, Ellis transferable visual models from natural language supervision.
Brown, and Deepak Pathak. Your diffusion model is secretly In ICML, 2021. 1, 2, 3, 7
a zero-shot classifier. arXiv preprint arXiv:2303.16203, 2023. [63] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
3 Patrick Esser, and Björn Ommer. High-resolution image
[48] Tianhong Li, Huiwen Chang, Shlok Mishra, Han Zhang, Dina synthesis with latent diffusion models. In CVPR, 2022. 8
Katabi, and Dilip Krishnan. Mage: Masked generative en- [64] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
coder to unify representation learning and image synthesis. Patrick Esser, and Björn Ommer. High-resolution image
In CVPR, 2023. 3 synthesis with latent diffusion models. In CVPR, 2022. 3

15896
[65] Andrew Rosenberg, Yu Zhang, Bhuvana Ramabhadran, Ye [80] Yonglong Tian, Olivier J Henaff, and Aäron van den Oord.
Jia, Pedro Moreno, Yonghui Wu, and Zelin Wu. Speech Divide and contrast: Self-supervised learning from uncurated
recognition with augmented synthesized speech. In ASRU, data. In ICCV, 2021. 1
2019. 3 [81] Yonglong Tian, Lijie Fan, Phillip Isola, Huiwen Chang, and
[66] Nick Rossenbach, Albert Zeyer, Ralf Schlüter, and Hermann Dilip Krishnan. Stablerep: Synthetic images from text-to-
Ney. Generating synthetic audio data for attention-based image models make strong visual representation learners. In
speech recognition systems. In ICASSP, 2020. 3 NeurIPS, 2023. 1, 2, 3, 4, 5, 6, 7
[67] Yangjun Ruan, Saurabh Singh, Warren Morningstar, Alexan- [82] Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles,
der A Alemi, Sergey Ioffe, Ian Fischer, and Joshua V Dillon. Gabriel Synnaeve, and Hervé Jégou. Going deeper with
Weighted ensemble self-supervised learning. arXiv preprint image transformers. In ICCV, 2021. 7
arXiv:2211.09981, 2022. 5 [83] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Am-
[68] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, jad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya
Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2:
Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Pho- Open foundation and fine-tuned chat models. arXiv preprint
torealistic text-to-image diffusion models with deep language arXiv:2307.09288, 2023. 3, 4
understanding. In NeurIPS, 2022. 8 [84] Gul Varol, Javier Romero, Xavier Martin, Naureen Mahmood,
[69] Mert Bulent Sariyildiz, Karteek Alahari, Diane Larlus, and Michael J Black, Ivan Laptev, and Cordelia Schmid. Learning
Yannis Kalantidis. Fake it till you make it: Learning trans- from synthetic humans. In CVPR, 2017. 3
ferable representations from synthetic imagenet clones. In [85] Tongzhou Wang and Phillip Isola. Understanding contrastive
CVPR, 2023. 2, 3, 8 representation learning through alignment and uniformity on
the hypersphere. In ICML, 2020. 3
[70] Saurabh Saxena, Charles Herrmann, Junhwa Hur, Abhishek
Kar, Mohammad Norouzi, Deqing Sun, and David J Fleet. [86] Chen Wei, Haoqi Fan, Saining Xie, Chao-Yuan Wu, Alan
The surprising effectiveness of diffusion models for opti- Yuille, and Christoph Feichtenhofer. Masked feature predic-
cal flow and monocular depth estimation. arXiv preprint tion for self-supervised visual pre-training. In CVPR, 2022.
arXiv:2306.01923, 2023. 3 3
[87] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin.
[71] Christoph Schuhmann, Romain Beaumont, Richard Vencu,
Unsupervised feature learning via non-parametric instance
Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes,
discrimination. In CVPR, 2018. 3
Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al.
Laion-5b: An open large-scale dataset for training next gener- [88] Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva,
ation image-text models. In NeurIPS, 2022. 1 and Antonio Torralba. Sun database: Large-scale scene recog-
nition from abbey to zoo. In CVPR, 2010. 5, 6
[72] Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan,
[89] Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and
and Stefan Carlsson. Cnn features off-the-shelf: an astound-
Jian Sun. Unified perceptual parsing for scene understanding.
ing baseline for recognition. In CVPR workshops, 2014. 3
In ECCV, 2018. 7
[73] Noam Shazeer. Glu variants improve transformer. arXiv
[90] Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin
preprint arXiv:2002.05202, 2020. 7
Bao, Zhuliang Yao, Qi Dai, and Han Hu. Simmim: A simple
[74] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis framework for masked image modeling. In CVPR, 2022. 3
Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas
[91] Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xiaolong
Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game
Wang, and Shalini De Mello. Open-vocabulary panoptic
of go without human knowledge. Nature, 2017. 3
segmentation with text-to-image diffusion models. In CVPR,
[75] Karen Simonyan and Andrew Zisserman. Very deep convo- 2023. 3
lutional networks for large-scale image recognition. arXiv [92] Yiben Yang, Chaitanya Malaviya, Jared Fernandez, Swabha
preprint arXiv:1409.1556, 2014. 3 Swayamdipta, Ronan Le Bras, Ji-Ping Wang, Chandra Bha-
[76] Andreas Steiner, Alexander Kolesnikov, Xiaohua Zhai, Ross gavatula, Yejin Choi, and Doug Downey. Generative data
Wightman, Jakob Uszkoreit, and Lucas Beyer. How to train augmentation for commonsense reasoning. arXiv preprint
your vit? data, augmentation, and regularization in vision arXiv:2004.11546, 2020. 3
transformers. arXiv preprint arXiv:2106.10270, 2021. 6 [93] Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lu-
[77] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, cas Beyer. Scaling vision transformers. In CVPR, 2022. 7
Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B [94] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful
Hashimoto. Alpaca: A strong, replicable instruction- image colorization. In ECCV, 2016. 2
following model. Stanford Center for Research on Foundation [95] Wenliang Zhao, Yongming Rao, Zuyan Liu, Benlin Liu, Jie
Models., 2023. 3 Zhou, and Jiwen Lu. Unleashing text-to-image diffusion mod-
[78] Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive els for visual perception. arXiv preprint arXiv:2303.02153,
multiview coding. arXiv:1906.05849, 2019. 3 2023. 3
[79] Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, [96] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and
Cordelia Schmid, and Phillip Isola. What makes for good Antonio Torralba. Learning deep features for discriminative
views for contrastive learning? In NeurIPS, 2020. 3 localization. In CVPR, 2016. 3, 5

15897
[97] Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler,
Adela Barriuso, and Antonio Torralba. Semantic understand-
ing of scenes through the ade20k dataset. IJCV, 2019. 7
[98] Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang
Xie, Alan Yuille, and Tao Kong. ibot: Image bert pre-training
with online tokenizer. arXiv preprint arXiv:2111.07832, 2021.
2, 3, 5
[99] Yongchao Zhou, Hshmat Sahak, and Jimmy Ba. Training on
thin air: Improve image classification with generated data.
arXiv preprint arXiv:2305.15316, 2023. 3

15898

You might also like