0% found this document useful (0 votes)
54 views12 pages

LAVILA: Video-Language Representation Gain

Uploaded by

LAIB Lakhdar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views12 pages

LAVILA: Video-Language Representation Gain

Uploaded by

LAIB Lakhdar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

This CVPR paper is the Open Access version, provided by the Computer Vision Foundation.

Except for this watermark, it is identical to the accepted version;


the final published version of the proceedings is available on IEEE Xplore.

Learning Video Representations from Large Language Models

Yue Zhao1,2 * Ishan Misra1 Philipp Krähenbühl2 Rohit Girdhar1


1 2
FAIR, Meta AI University of Texas, Austin
[Link]/LaViLa

CharadesEgo
Recognition
(mAP)
Abstract EK-100
EgoMCQ
Multi-Instance
(intra-vid. acc.) 36.1 Retrieval (mAP)
We introduce L AV I L A, a new approach to learning
video-language representations by leveraging Large Lan- 63.1 32.1 50.9

guage Models (LLMs). We repurpose pre-trained LLMs to 57.2 45.0


EK-100
be conditioned on visual input, and finetune them to create EGTEA
Multi-Instance
Recognition
automatic video narrators. Our auto-generated narrations (mean acc.) 76.0 65.9 59.4 66.5
Retrieval
(nDCG)
offer a number of advantages, including dense coverage
of long videos, better temporal synchronization of the vi- 82.7
54.3
50.5
sual information and text, and much higher diversity of text. 51.0
61.5
HMDB-51
EK-100 Recog-
The video-language embedding learned contrastively with nition (top-1
Recognition
88.1 (linear probing
action acc.)
these narrations outperforms the previous state-of-the-art UCF-101
mean acc.)

on multiple first-person and third-person video tasks, both Recognition L AV I L A (Ours)


(linear probing
Previous SOTA
in zero-shot and finetuned setups. Most notably, L AV I L A mean acc.)

obtains an absolute gain of 10.1% on EGTEA classifica- Figure 1. L AV I L A sets a new state-of-the-art across a number
tion and 5.9% Epic-Kitchens-100 multi-instance retrieval of first and third-person video understanding tasks (cf . Table 1 for
benchmarks. Furthermore, L AV I L A trained with only half details), by learning a video-language representation using super-
vision from large language models as narrators.
the narrations from the Ego4D dataset outperforms models
trained on the full set, and shows positive scaling behavior
on increasing pre-training data and model size.
and third-person video benchmarks.
Our method, called L AV I L A: Language-model
augmented Video-Language pre-training, leverages pre-
1. Introduction trained LLMs, e.g. GPT-2 [50], which encode within
Learning visual representation using web-scale image- their weights a treasure trove of factual knowledge and
text data is a powerful tool for computer vision. Vision- conversational ability. As shown in Figure 2, we repurpose
language approaches [31, 49, 80] have pushed the state-of- these LLMs to be “visually-conditioned narrators”, and
the-art across a variety of tasks, including zero-shot classi- finetune on all accessible paired video-text clips. Once
fication [49], novel object detection [87], and even image trained, we use the model to densely annotate thousands
generation [52]. Similar approaches for videos [4, 39, 46], of hours of videos by generating rich textual descriptions.
however, have been limited by the small size of paired This pseudo-supervision can thus pervade the entire video,
video-text corpora compared to the billion-scale image-text in between and beyond the annotated snippets. Paired
datasets [31, 49, 84]—even though access to raw video data with another LLM trained to rephrase existing narrations,
has exploded in the past decade. In this work, we show it L AV I L A is able to create a much larger and more diverse
is possible to automatically generate text pairing for such set of text targets for video-text contrastive learning. In
videos by leveraging Large Language Models (LLMs), thus addition to setting a new state-of-the-art as noted earlier,
taking full advantage of the massive video data. Learning the stronger representation learned by L AV I L A even
video-language models with these automatically generated outperforms prior work using only half the groundtruth
annotations leads to stronger representations, and as Fig- annotations (Figure 5).
ure 1 shows, sets a new state-of-the-art on six popular first L AV I L A’s strong performance can be attributed to a
number of factors. First, L AV I L A can provide temporally
* Work done during an internship at Meta. dense supervision for long-form videos, where the associ-

6586
Conventional
C looks around the
C operates the phone.
Video-Language video retrieval on Epic-Kitchens-100 (5.9% absolute gain
open space. Representation Learning
on mAP), multiple-choice question answering on Ego4D
Human
Narration / ASR (5.9% absolute gain on intra-video accuracy), and action
Dual-Encoder Model recognition on EGTEA (10.1% absolute gain on mean ac-
curacy). It obtains gains both when evaluated for zero-shot
Large
LM transfer to the new dataset, as well as after fine-tuning on
that dataset. Similar gains are shown in third-person video
time
data. When training L AV I L A after densely re-narrating
HowTo100M, we outperform prior work on downstream ac-
LAVILA
tion classification on UCF-101 and HMDB-51. In a case
study of semi-supervised learning, we show that our model,
Dual-Encoder Model
which only ever sees 50% of the human-labeled data, is ca-
NARRATOR pable of outperforming the baseline model trained with all
A lady walks past a car. A woman converses with C. A man walks towards a building
the narrations. Moreover, the gains progressively increase
C walks on the pavement C takes a selfie with the phone
as we go to larger data regimes and larger backbones, sug-
Figure 2. L AV I L A leverages Large Language Models (LLMs) gesting the scalability of our method.
to densely narrate long videos, and uses those narrations to train
strong dual-encoder models. While prior work uses sparsely la- 2. Related Work
beled text by humans, or weakly aligned text transcribed from
speech, L AV I L A is able to leverage dense, diverse, and well- Vision-language representation learning maps visual and
aligned text generated by a LLM. textual embeddings into a common space using metric-
learning techniques [21, 73]. Recently, different pretext
tasks are proposed to learn a finer-grained association be-
tween visual and textual modality, e.g. masked language
ated captions are either too sparse, or the video-level “Alt- modeling (MLM) [10, 41, 62] and captioning [16, 80]. An-
Text” (in the case of web videos) does not describe all the other line of research focuses on scaling up both mod-
nuanced activities happening in it. Second, the generated els and pre-training data. For instance, CLIP [49] is
text is well-aligned with the visual input. Although prior pre-trained on 400M image-text pairs with a contrastive
work has leveraged automatic speech transcription on How- loss (InfoNCE [48, 59]) while CoCa [80] unifies con-
To videos [45] to automatically extract clips paired with trastive and generative approaches with a single founda-
text from the speech, such datasets have relatively poor tion model. Similar trends are also witnessed in the video-
alignment between the visual and textual content (≤ 50%, text domain [36, 64, 88]. However, collecting high-quality
cf . [25, 45]), limiting the quality of the learned represen- video-text data is more difficult than image-text. There-
tations. Third, L AV I L A can significantly expand annota- fore, efforts are made to learn from uncurated videos with
tions when only a little is available. For instance, videos of machine-generated audio transcripts via contrastive learn-
mundane day-to-day activities, especially from an egocen- ing [44, 77, 82] or unsupervised alignment [25] while other
tric viewpoint, could be very useful for assistive and aug- works focus on either adapting well-performing image-text
mented reality applications. Such videos, however, are rare models to videos [32, 40, 47, 78], or curriculum learning
on the internet, and hence do not readily exist with associ- from a single frame to multiple frames [4]. In contrast, our
ated web text. Recent work [24] instead opted to manually approach leverages language models to generate temporally
capture and narrate such video data. These narrations how- dense textual supervision on long-form videos.
ever required significant manual effort: 250K hours of an- Generative Visual Language Models (VLM) were first
notator time spent in narrating 3.6K hours of video. In con- used for image/video captioning using recurrent net-
trast, L AV I L A is able to automatically narrate each video works [17, 68] and Transformer-based architectures [42,
multiple times and far more densely, and hence learns much 56]. More recently, generative VLMs have unified multiple
stronger representations. vision tasks [11, 89] by training multi-modal Transformers
We extensively evaluate L AV I L A across multiple video- on visual-text pairs [30, 81]. Meanwhile, generative VLMs
text pre-training datasets and downstream tasks to validate also excel at multimodal tasks via zero-shot or few-shot
its effectiveness. Specifically, after being pre-trained on prompting [1, 65, 83] by leveraging multi-billion-parameter
Ego4D, the largest egocentric video datasets with narra- LLMs pre-trained on massive text corpus [7, 28, 50]. In
tions, L AV I L A can re-narrate the whole dataset 10× over. our work, we demonstrate that generative VLMs can nar-
The resulting model learned on these expanded narrations rate long videos and the resulting video-text data benefits
sets a new state-of-the-art on a wide range of downstream video-language representation learning.
tasks across challenging datasets, including multi-instance Large-scale multimodal video datasets are crucial for

6587
video understanding tasks but are hard to collect. Conven-
tional video-text datasets [8, 53, 86] either have limited sce-
narios, e.g. cooking, or are not large enough to learn generic
video representation. Miech et al. [45] scrape over 100 mil-
lion video-text pairs via automatic audio transcription from
C stretches the thread
long-form How-To videos. However, ASR introduces tex- Narration C separates the yarn
with both hands.
tual noise and visual-text unalignment [25]. WebVid [4] REPHRASER
C divides the yarn NARRATOR C pulls out the yarn
by a few stitches. with her right hand.
contains 10 million short videos with textual descriptions.
But it is still several orders of magnitude smaller than the
image counterparts [49, 54] and is harder to scale up since
it is sourced from stock footage sites. The recently released
Ego4D [24] dataset offers 3,600 hours of egocentric videos
in which written sentence narrations are manually annotated C wipes the countertop
Narration C lifts container.
with a sponge.
every few seconds but requires significant manual effort. In NARRATOR
REPHRASER C raises the container. C moves the container.
contrast, our method shows a promising alternative by au-
tomatically narrating videos using supervision from LLM.
Data augmentation techniques in NLP, including word- Figure 3. Generated samples by N ARRATOR and R EPHRASER.
NARRATOR generates new descriptions of the action taking
level replacement based on synonyms [72, 85] or nearest-
place, potentially focusing on other objects being interacted with.
neighbor retrieval [19, 70], improve text classification ac- R EPHRASER not only changes the word order of the human narra-
curacy. We refer readers to [20] for a comprehensive sur- tion but also diversifies it by using related verbs or nouns.
vey. In this paper, we show that sentence-level paraphras-
ing based on text-to-text models [51] is helpful for video-
language pre-training. A contrastive loss, such as InfoNCE [48], learns global em-
beddings that associate corresponding video and text em-
3. Preliminaries beddings within a batch of samples B,
A video V is a stream of moving images I. The num- 1 X
(InfoNCE(v, u) + InfoNCE(u, v)) . (1)
ber of frames |V | can be arbitrarily long while video mod- |B|
(x,y)∈B
els typically operate on shorter clips, which are often in the
range of a few seconds. Therefore, we skim through a long-
form video and represent it by a set of N short clips, i.e. X . 4. L AV I L A
Each clip xi is defined by a specific start and end frame
xi = {Iti , · · · , Iei }, where 0 < ti < ei ≤ |V |, and is In L AV I L A, we leverage large language models (LLMs)
typically associated with some annotation yi . This anno- as supervision to train the dual-encoder model, where the
tation could be a class label or free-form textual descrip- LLMs serve as vision-conditioned narrators and automati-
tion of the clip. We denote a video by the set of annotated cally generate textual descriptions from video clips. In par-
clips with their corresponding annotations, i.e. (X , Y) = ticular, we exploit supervision from two LLMs: (1) N AR -
{(x1 , y1 ), · · · , (xN , yN )}. Note that the annotated clips of- RATOR (§ 4.1) is a visually-conditioned LLM that pseudo-
ten cannot densely cover the entire video labels existing and new clips with narrations, generating
S due to the annota-
tion cost and visual redundancy, i.e. i [ti , ei ] ⊊ [0, |V |]. new annotations (X ′ , Y ′ ). (2) R EPHRASER (§ 4.2) is a
A typical video model F(X , Y) learns from these clip- standard LLM that paraphrases narrations in existing clips,
level annotations using a standard training objective such augmenting those annotations to (X , Y ′′ ). As illustrated
as a cross-entropy loss when the annotations are class la- in Figure 3, NARRATOR generates new descriptions of the
bels with a fixed vocabulary. However, more recently, dual- action taking place, potentially focusing on other objects
encoder-based contrastive approaches like CLIP [49, 77] being interacted with. R EPHRASER serves to augment the
have gained popularity. They work with free-form textual
text input, e.g., changes word order of the human narration
annotations which are tokenized [55] into sequences of dis-
and additionally replaces common verbs or nouns, mak-
crete symbols, i.e. y = (s1 , s2 , · · · , sL ) ∈ {1, 0}|S|×L . The
model consists of a visual encoder fv : RT ×3×H×W 7→ ing annotations more diverse. Finally, we train the dual-
RDv plus a projection head hv : RDv 7→ Rd and a tex- encoders (§ 4.3) on all these annotations combined, i.e.
tual encoder ft : {1, 0}|S|×L 7→ RDt plus a projection head (X , Y) ∪ (X ′ , Y ′ ) ∪ (X , Y ′′ ).
ht : RDt 7→ Rd in parallel to obtain the global visual and
textual embedding respectively:
4.1. NARRATOR
Traditional LLMs, such as GPT-2 [50], are trained to
v = hv (fv (x)), u = ht (ft (y)). generate a sequence of text tokens (s1 · · ·sL ) from scratch

6588
C stirs gravy sauce in the pan

Cross-Attention
Cross Attention Module
tanh(𝛼1)
paraphrase: C stirs gravy sauce in the pan </s> GPT2Block +
FFN
<s>: C stirs <s> C


tanh(𝛼2)
Video Encoder Cross Attention Module
+
Text Encoder Text Decoder Text Decoder Self-Attention
Attention Pooling GPT2Block
+
Auto- Auto-
REPHRASER food NARRATOR holds Text Decoder FFN
regressive regressive
+

<s> C stirs food in the cooking pan </s> <s> C holds a spatula </s>

Figure 4. Language supervision from R EPHRASER and N ARRATOR. R EPHRASER(left) takes the narration as input, passes it through a
text encoder and uses a text decoder to autoregressively generate the rephrased output. NARRATOR(right) takes video frames as input and
obtains the visual embeddings through a video encoder followed by attentional pooling. Equipped with a few additional cross-attention
modules, the text decoder autoregressively generates new narrations for those new frames.

by modeling the probability of the next token given all to- (X , Y). We use features before global pooling to allow the
kens seen so far: p(sl |s<l ). NARRATOR repurposes existing LLM to leverage fine-grained spatial-temporal information.
LLMs to be conditioned on the visual input and is trained Training. We train NARRATOR on all of, or a subset of, the
on the original annotations (X , Y). The resulting model ground-truth annotations (X , Y). For each pair (x, y), the
produces dense new annotations (X ′ , Y ′ ) on the full video. captioning loss is the sum of the negative log-likelihood of
Following the formulation of factorized probabilities in lan- the correct word at each step,
guage models [5], we model the visually conditioned text
likelihood as follows: L
X
LNARRATOR (x, y) = − log p(sℓ |s<ℓ , x). (3)
L
ℓ=1
′ ′
Y
pNARRATOR (y |x ) = p(s′ℓ |s′<ℓ , x′ ). (2)
ℓ=1
Inference. At inference time, we query NARRATOR by
Architecture. We design NARRATOR to closely follow the feeding visual input x plus a special start-of-sentence to-
architecture of standard LLMs, with only a few additional ken <s>. We sample from the distribution recursively, i.e.
cross-attention modules added to provide visual condition- s̃ℓ ∼ p(s|[<s>, · · · , s̃ℓ−1 ], x) until an end-of-sentence to-
ing, as illustrated in Figure 4 (right). This enables NAR - ken </s> is reached. Following [29], at each step we sam-
RATOR to be initialized from pre-trained weights, which is ple from a subset of tokens that contain the vast majority of
crucial for our task as the data we use to train NARRATOR the probability mass, which is known as nucleus sampling.
(narrations associated with video clips) are far smaller in The effect of nucleus sampling is two-fold. On the one
scale compared to the large text corpus typically used to hand, it generates more diverse, open-ended, and human-
train LLMs. Moreover, video narrations are less diverse and like text than maximum-likelihood-based methods such as
noisier because they are either collected by only a few an- beam search and its variants [67]. On the other hand, the
notators or automatically transcribed from speech. Similar generated text may contain irrelevant or noisy information
“frozen-LM” approaches have shown effectiveness in mul- due to sampling without post-processing based on sentence-
timodal few-shot adaptation in recent work [1, 65]. level likelihood. To address this, we repeat the sampling
Specifically, we take a frozen pre-trained LLM and add process for K times on the same visual input. We later
a cross-attention module before each Transformer decoder demonstrate that the contrastive pre-training objective is ro-
layer so that the text input can attend to visual information. bust to the noise caused by sampling, and the final perfor-
The cross-attended output then sums with the input text fea- mance benefits from a more diverse set of narrations.
ture via residual connection [26] and goes to the Trans- To sample video clips for captioning, we start by sim-
former decoder layer. Each cross-attention module com- ply re-captioning the existing clips labeled in the dataset
prises a cross-attention layer, which takes textual tokens as X , resulting in expanded annotations. Furthermore, long-
queries and visual embedding as keys and values, followed form videos are typically sparsely narrated, meaning that
by a feed-forward network (FFN). Layer Normalization [3] the temporal union of all labeled clips cannot cover the en-
is applied at the beginning of both cross-attention and FFN. tire video. Hence, we use NARRATOR to annotate the re-
We add tanh-gating [27], with an initial value of zero, such mainder of the video to obtain additional annotations by
that the output of the new model is the same as that from the pseudo-captioning. With a simple assumption that video
original language model at the beginning. is a stationary process, we uniformly sample clips from the
While features from any video model are applicable for unlabeled intervals. The clip duration is equal PN to the av-
conditioning, for convenience we adopt the video encoder erage of all ground-truth clips, i.e. ∆ = N1 i=1 (ei − ti )
from F in § 3, trained contrastively on the ground-truth data while the sampling stride is computed likewise. Finally, by

6589
combining both re-captioned and pseudo-captioned narra- Datasets Task Ego? Metrics Eval. Prot.
tions, we refer to the final set of annotations generated by EK-100 [14] MIR ✓ mAP, nDCG ZS, FT
EK-100 [14] CLS ✓ top-1 action acc. FT
NARRATOR as (X ′ , Y ′ ).
Ego4D [24] MCQ ✓ inter-/intra-video acc. ZS
Post-processing. Exhaustive pseudo-captioning may con- Ego4D [24] NLQ ✓ Recall@N FT
tain some uninformative visual clips and generate text that EGTEA [37] CLS ✓ top-1, mean acc. ZS, FT
is not useful. Thus, we add a filtering process to elimi- CharadesEgo [58] CLS ✓ video-level mAP ZS, FT
nate low-quality clips and their associated descriptions. We UCF-101 [60] CLS ✗ mean acc. LP
use the baseline dual-encoder model F, which is trained HMDB-51 [35] CLS ✗ mean acc. LP
on the ground-truth paired clips, to compute the visual and
Table 1. Downstream datasets and metrics used for evaluation.
textual embedding of pseudo-labeled pairs and filter based We evaluate L AV I L A on a wide range of tasks including Multi-
on the similarity score, i.e. Filter(fv (x′j )⊤ · ft (yj′ )), where Instance Retrieval (MIR), Multiple-Choice Question (MCQ), Nat-
Filter(·) can be either top-k of all generated text or a thresh- ural Language Query (NLQ), and Action Recognition (CLS). The
old filtering. In the experiments, we use a threshold of 0.5. evaluation protocols include zero-shot (ZS), fine-tuning (FT), and
linear-probing (LP). Please refer to Appendix C for more details.
4.2. R EPHRASER
The data generated by NARRATOR is several times larger
than the ground-truth pairs. To ensure that we do not overfit 5. Experiments
the pseudo-labeled data, we increase the number of ground-
truth narrations by paraphrasing. In particular, we use a Dual-Encoder Architecture. The video-language model
text-to-text LLM which models conditional text likelihood: follows a dual-encoder architecture as CLIP [49]. The Vi-
sual encoder is a TimeSformer (TSF) [6], whose spatial at-
tention modules are initialized from a ViT [18] which is
L
Y contrastively pre-trained on large-scale paired image-text
pR EPHRASER (y ′′ |y) = p(s′′ℓ |s′′<ℓ , y).
ℓ=1
data as in CLIP [49]. We sample 4 frames per clip during
pre-training and 16 when finetuning on downstream tasks.
The text-to-text model is implemented by an encoder- The text encoder is a 12-layer Transformer [50, 66]. We use
decoder architecture, e.g. T5 [51], to auto-regressively gen- BPE tokenizer [55] to pre-process the full sentence corre-
erate a new sentence given the original one. We observe sponding to the video clip and keep at most 77 tokens.
that R EPHRASER is able to do basic manipulations such as N ARRATOR’s architecture is a visually conditioned auto-
replacing synonyms or changing word order, which serves regressive Language Model. The visual encoder is by de-
as an efficient way of automatic data augmentation. The fault TimeSformer-L while the text decoder is a GPT-2
resulting annotations are referred to as (X , Y ′′ ). XL. During inference, we use nucleus sampling [29] with
p = 0.95 and return K = 10 candidate outputs.
4.3. Training the Dual-Encoders
R EPHRASER. We use an open-source paraphraser [23]
We train the dual-encoders as described in Algorithm 1 based on T5-large [51]. It is pre-trained on C4 [51] and
in Appendix E. In each iteration, we first sample a batch B then finetuned on a cleaned subset of ParaNMT [74]. Dur-
of video clips. It comprises a subset of clips Bl with labeled ing inference, we use Diverse Beam Search [67] with group
timestamps as well as narrations, and a subset Bu whose number the same as beam number (G = B = 20) and set
clips are randomly sampled from videos without narrations. the diversity penalty to be 0.7. We keep 3 candidates per
For clip xi ∈ Bu , we obtain the pseudo-caption yi′ by query- sentence, remove punctuations, and do basic de-duplication.
ing the NARRATOR yi′ ∼ pNARRATOR (y ′ |x), resulting in a Pre-training dataset. We train on the video-narration pairs
set of clips with LLM-generated narrations Beu . For clip from Ego4D [13, 24], the largest egocentric video dataset
(xi , yi ) ∈ Bl , the text supervision is obtained from either to date. We exclude videos that appear in the validation and
the R EPHRASER or the NARRATOR, with a probability of test sets of the Ego4D benchmark and determine each clip’s
0.5. Hence, the effective number of iterations per epoch for interval using the same pairing strategy in [39]. This results
L AV I L A is the same as that for the baseline Dual-Encoder. in around 4M video-text pairs with an average clip length of
We denote the resulting set of pairs to be Bel similarly. Fol- 1 second. We also experiment with third-person videos by
lowing CLIP [49], we use the symmetric cross-entropy loss pre-training on HowTo100M [45] in § 5.2.
over the similarity scores of samples in the batch Bel ∪ Beu . Evaluation protocols. We evaluate the learned video-text
In practice, we run R EPHRASER and NARRATOR in encoders using three evaluation protocols. (1) Zero-Shot
advance and cache the resulting video-narration pairs so (ZS), meaning that we apply the pre-trained video-text en-
that there is no computational overhead during pre-training. coders directly on the downstream validation dataset to per-
Therefore, training a dual-encoder in L AV I L A is as fast as form video↔text retrieval tasks, without any tuning on the
training a standard dual-encoder contrastive model. downstream dataset. Zero-shot classification is performed

6590
mAP nDCG EgoMCQ EgoNLQ
Method Backbone
V→T T→V Avg. V→T T→V Avg. Method Accuracy (%) [email protected] [email protected]
(Z ERO - SHOT) Inter-video Intra-video R@1 R@5 R@1 R@5
EgoVLP [39] TSF-B 19.4 13.9 16.6 24.1 22.0 23.1 SlowFast [24] - - 5.45 10.74 3.12 6.63
EgoVLP∗ [39] TSF-B 26.0 20.6 23.3 28.8 27.0 27.9 EgoVLP [39] 90.6 57.2 10.84 18.84 6.81 13.45
L AV I L A TSF-B 35.1 26.6 30.9 33.7 30.4 32.0 L AV I L A (B) 93.8 59.9 10.53 19.13 6.69 13.68
L AV I L A TSF-L 40.0 32.2 36.1 36.1 33.2 34.6 L AV I L A (L) 94.5 63.1 12.05 22.38 7.43 15.44
(F INETUNED)
MME [75] TBN 43.0 34.0 38.5 50.1 46.9 48.5 Table 3. Ego4D EgoMCQ and EgoNLQ. L AV I L A outperforms
JPoSE [75] TBN 49.9 38.1 44.0 55.5 51.6 53.5 prior work on both Multiple-Choice Questions and Natural Lan-
EgoVLP [39] TSF-B 49.9 40.5 45.0 60.9 57.9 59.4 guage Questions on Ego4D, with nearly 6% absolute gain on the
L AV I L A TSF-B 55.2 45.7 50.5 66.5 63.4 65.0 challenging intra-video MCQ task that requires reasoning over
L AV I L A TSF-L 54.7 47.1 50.9 68.1 64.9 66.5 multiple clips from the same video to answer a question.
Table 2. EK-100 MIR. L AV I L A outperforms prior work across all
settings, metrics and directions of retrieval, with larger gains when
In all tables, we bold and underline the best and second-
switching to a larger model. Specifically, our best model achieves
over 10% absolute gain in the zero-shot setting and 5.9 ∼ 7.1% best performing methods with comparable backbones archi-
gain in the finetuned setting. EgoVLP∗ refers to our improved tectures. We highlight the overall best performing method,
version of [39], details of which are given in Appendix F. which typically uses a larger backbone, if applicable.

5.1. Main Results


similarly, where we compute the similarity score between EK-100. We compare L AV I L A with prior works on EK-
the video clip and the textual description of all possible cat- 100 MIR in Table 2. In the zero-shot setup, L AV I L A re-
egories. (2) Finetuned (FT), where we take the pre-trained markably surpasses an improved version of EgoVLP [39]
video-text model and perform end-to-end fine-tuning on the under similar model complexity: we use TSF-Base+GPT-2
training split of the target downstream dataset. (3) Linear- as the dual-encoder architecture while EgoVLP uses TSF-
Probe (LP), where we compute the video features from a Base+Distil-BERT. With a stronger video encoder, i.e. TSF-
frozen encoder and train a linear SVM on top of the train- Large, the performance improves further. In the fine-
ing split of the downstream dataset. tuned setting, L AV I L A significantly outperforms all pre-
Downstream benchmarks. We use multiple benchmarks vious supervised approaches including MME, JPOSE [75]
across four first-person (egocentric) and two third-person and EgoVLP [39]. We also compare L AV I L A on EK-100
datasets, as enumerated in Table 1. We summarize them CLS in Appendix E, and establish a new state-of-the-art.
here and refer the reader to Appendix C for details on Ego4D. We evaluate the pre-trained L AV I L A model on
datasets and metrics. (1) Two tasks on Epic-Kitchens-100: EgoMCQ and EgoNLQ tasks and compare the results in Ta-
Multi-Instance Retrieval (EK-100 MIR) and Action Recog- ble 3. On EgoMCQ, our method achieves 93.8% inter-video
nition (EK-100 CLS) [14]. EK-100 is a very popular and accuracy and 59.9% intra-video accuracy, outperforming
challenging egocentric video recognition benchmark. The EgoVLP by a noticeable margin. Note that EgoVLP’s per-
MIR task requires retrieving the text given videos (V→T) formance reported in Table 3 is obtained by using EgoNCE
and videos given text (T→V). The CLS task requires clas- loss [39], a variant of InfoNCE specialized for Ego4D while
sifying each video into one of 97 verbs and 300 nouns each, ours uses a standard InfoNCE loss. EgoVLP with InfoNCE
resulting in a combination of 3,806 action categories. (2) has lower performance (89.4% inter-video and 51.5% intra-
Two downstream tasks of Ego4D: Multiple-Choice Ques- video accuracy). On EgoNLQ, L AV I L A achieves compara-
tions (EgoMCQ) and Natural Language Query (EgoNLQ). ble results with EgoVLP with similar model complexity.
EgoMCQ requires selecting the correct textual description EGTEA. We evaluate the learned video representation by
from five choices given a query video clip while EgoNLQ finetuning the video encoder for action classification in Ta-
asks the model to output the relevant temporal intervals of ble 4 on another popular egocentric dataset, EGTEA [37].
video given a text query. We select these two benchmarks Our method surpasses the previous state-of-the-art which
because they require reasoning about both visual and textual takes multiple modalities including visual, auditory and tex-
information. (3) Action Recognition on EGTEA [37]. It re- tual inputs [33] by a more than 10% absolute margin on the
quires classifying into 106 classes of fine-grained cooking mean accuracy metric. Since previous methods are based
activities. (4) Action Recognition on CharadesEgo [58]. It on different backbones, we experiment with a TSF-Base
requires classification into 157 classes of daily indoor activ- (“Visual only”) model pre-trained on Kinetics [9] as a fair
ities. Note that CharadesEgo is very different from EK-100, baseline for L AV I L A. We observe that its accuracy is com-
Ego4D and EGTEA since its videos are captured by head- parable to previous methods but much lower than L AV I L A,
mounted phone cameras in a crowd-sourcing way. implying the effectiveness of learning visual representation

6591
Method Backbone Pretrain Top-1 Acc. Mean Acc. Method Vis. Enc. UCF-101 HMDB-51
Li et al. [37] I3D K400 - 53.30 MIL-NCE [44] S3D 82.7 54.3
LSTA [63] ConvLSTM IN-1k 61.86 53.00 TAN [25] S3D 83.2 56.7
IPL [71] I3D K400 - 60.15 Baseline (w/o LLM) TSF-B 86.5 59.4
MTCN [33] SlowFast (V+A+T) K400+VGG-Sound 73.59 65.87 L AV I L A TSF-B 87.4 57.2
Visual only TSF-B IN-21k+K400 65.58 59.32 L AV I L A TSF-L 88.1 61.5
L AV I L A TSF-B WIT+Ego4D 77.45 70.12
L AV I L A TSF-L WIT+Ego4D 81.75 76.00 Table 6. L AV I L A on third-person videos. We measure
the linear-probing action classification performance of the video
Table 4. EGTEA Classification. L AV I L A obtains significant model after pre-training on HowTo100M [45].
gains on this task, outperforming prior work with over 10% mean
accuracy. Since the backbones used are not all comparable, we
also report a comparable baseline with TSF-B (“Visual only”).
UCF-101 [60] and HMDB-51 [35] for action classification
using the linear probing protocol. For more details, please
Method Backbone mAP (ZS) mAP (FT)
refer to Appendix D. From Table 6, we see that L AV I L A
ActorObserverNet [57] ResNet-152 - 20.0
SSDA [12] I3D - 25.8 outperforms previous methods such as MIL-NCE [44] and
Ego-Exo [38] SlowFast-R101 - 30.1 TAN [25] by a large margin. Since we use a different back-
EgoVLP [39] TSF-B 25.0 32.1 bone, we report a baseline without LLM and show that
L AV I L A TSF-B 26.8 33.7 L AV I L A indeed benefits from the language supervision.
L AV I L A TSF-L 28.9 36.1
5.3. Application to Semi-supervised Learning
Table 5. CharadesEgo Action Recognition. L AV I L A sets new
state-of-the-art in both zero-shot (ZS) and finetuned (FT) settings. While L AV I L A is very effective at leveraging existing
Note that CharadesEgo videos are visually different compared to narrations to augment them, we now show that it is also
Ego4D videos, on which L AV I L A is pretrained. applicable when only a limited number of narrations are
available to begin with. We first divide each long video
on large-scale egocentric videos and using LLM as textual from Ego4D into 15-second chunks and assume only the
supervision during pre-training. annotated clips within every N chunks is available during
CharadesEgo. Next, we compare L AV I L A’s represen- pre-training, leading to approximately 100N % of the full set,
tation on the CharadesEgo action classification task. As where N ∈ {2, 5, 10}. This can be considered a practi-
shown in Table 5, L AV I L A’s representation excels on this cal scenario when we want to annotate as many videos as
task as well, which is notable as CharadesEgo videos are possible for diversity when the annotation budget is lim-
significantly different compared to Ego4D, being captured ited. In the remainder (1 − 100 N %) part that is skipped, we
by crowdsourced workers using mobile cameras. uniformly sample the same number of the clips per chunk
with the same clip length as that in the seen chunks. Both
5.2. Application to Third-Person Video Pre-training the dual-encoder model and NARRATOR are trained on the
100
We apply L AV I L A to third-person videos by experiment- N % available annotations.
ing with the HowTo100M [45] dataset. Specifically, we use We plot the zero-shot performance curve of pre-training
the temporally aligned subset provided by [25], which con- with different proportions in Figure 5. We can see that L AV-
tains 3.3M sentences from 247k videos. We evaluate the I L A consistently outperforms the ground-truth-only base-
video representation on two third-person video datasets, i.e. line at all points (10, 20, 50, and 100%). The improvement

L AV I L A Baseline SOTA [39]

29 31 30 58
27 30 28 56
25 29 26 54
23 28 24 52
21 27 22 50

0% 20% 50% 100% 0% 20% 50% 100% 0% 20% 50% 100% 0% 20% 50% 100%
(a) EK-100 MIR mAP. (b) EK-100 MIR nDCG. (c) EGTEA mean accuracy. (d) EgoMCQ Intra-video accuracy.

Figure 5. L AV I L A is effective in a semi-supervised setting where only a limited amout of narrations are given. Comparing zero-shot
performance of pre-training, L AV I L A consistently outperforms the groundtruth-only baseline when 10, 20, 50, 100% data is used. We also
achieve comparable result with state-of-the-art with only 50% of the annotated data.

6592
L
R

E-
EO

Er
Text Dec. Text Freeze Avg. Sampling # of Avg. 36 Dual-Encoder’s Video Architecture

G
35.0

D
ET

U
TSF-B

CI
34

RO
arch. Dec. init. LM mAP method sentences mAP

Avg. mAP
TSF-L
32 31.0
(baseline) - - - 26.0 N/A (baseline) - 26.0 30 29.8 29.6 29.7

GPT-2 random ✗ 0.284 0.524 0.882 24.3 Beam search 1 27.9 28


28.1

26.0 26.2
GPT-2 WebText ✓ 0.270 0.505 0.804 24.0 Nucleus 1 29.6 26
GPT-2 XL WebText ✓ 0.289 0.530 0.940 26.2 Nucleus 10 29.7 24
Default TSF-B TSF-L TSF-L@HR
NARRATOR’s architecture
(a) Generation Quality. Using a sufficiently large language model as (b) Sampling. L AV I L A benefits more (c) Scaling effect of L AV I L A. Gains increase on scaling
the text decoder is crucial for good text generation quality and down- from narrations produced by nucleus the video encoder in NARRATOR. Default refers to only
stream performance. sampling than beam search. using the original narrations.
Table 7. Ablations of N ARRATOR. We report zero-shot average mAP on EK-100 MIR for comparing downstream performance. We study
NARRATOR from the perspective of generation quality (left), sampling techniques (middle), and scaling effect (right).

Pseudo EK-100 MIR EgoMCQ EGTEA consider two scenarios in Table 7a: (1) LM is randomly
Rephr. Recap.
cap. Avg. mAP Avg. nDCG inter-video Intra-video Mean Top-1
initialized but jointly trained with the gated cross-attention
26.0 28.8 93.6 54.3 27.3 30.1
✓ 28.0 30.1 93.5 56.9 29.8 30.8
modules, and (2) LM is initialized from the original GPT-2.
✓ 27.1 29.9 93.2 59.2 26.8 31.2 The generation quality decreases compared to GPT-2 XL in
✓ ✓ 29.7 31.5 93.6 58.3 29.4 36.6 both cases and the zero-shot retrieval result on EK-100 MIR
✓ ✓ ✓ 29.9 31.4 93.6 59.1 31.1 36.0 is worse. This indicates that the language model should be
sufficiently large and pre-trained on web text data.
Table 8. Contributions of different Language Supervision. We
can see that (1) using R EPHRASER (“Rephr.”) and NARRATOR Sampling. In Table 7b, we investigate different sampling
(“Recap.”) improve downstream zero-shot performance comple- methods for text generation from NARRATOR. We see
mentarily, (2) dense pseudo-captioning further improves perfor- that nucleus sampling works much better than beam search
mance on 3 out of 6 metrics. while repetitive sampling shows marginal improvement.
Scaling effect. In Table 7c, we compare the zero-shot re-
trieval result by progressively increasing the size of NAR -
tends to be larger when more data is available, indicating
RATOR ’s video encoder from TSF-B to TSF-L and TSF-
the method’s scalability as more videos are narrated in the
L@HR, which increases the input resolution to be narrated
future. Furthermore, we observe our method can achieve a
from 224 to 336 while fixing the dual-encoder architec-
similar level of performance with the baseline often using
ture. The retrieval performance steadily increases while
less than 50% data. We also achieve a comparable result
NARRATOR becomes stronger. We conduct this experiment
with the state-of-the-art using much fewer data.
by varying the dual-encoder architecture, namely TSF-Base
5.4. Ablation Studies and TSF-Large, and show similar trends. Both phenomena
suggest that L AV I L A can scale to larger models.
Contributions of Different Language Supervisions. We
ablate different language supervisions in Table 8 on EK-
100 MIR (zero-shot), EgoMCQ and EGTEA. Using the 6. Conclusion and Future Work
text-only R EPHRASER (“rephr.”) or visually conditioned
NARRATOR (“recap.”) separately improves the ground- In this paper, we proposed L AV I L A, a new approach
truth baseline noticeably. Combining both R EPHRASER and to video-language representation learning by automatically
NARRATOR gives an improvement of 3.5% average mAP narrating long videos with LLMs. We achieve strong im-
on EK-100 MIR. We see that dense captioning on the entire provements over baselines trained with the same amount
video (“pseudo-cap.”) is also helpful. Though the gain on of human-narrated videos and set new state-of-the-art on
EK-100 MIR is not as significant, it shows nontrivial im- six popular benchmark tasks across first- and third-person
provements on EgoMCQ intra-video accuracy and EGTEA video understanding benchmarks. L AV I L A also shows pos-
mean accuracy. Our conjecture for this marginal gain is that itive scaling behavior when adding more training narrations,
informative clips are mostly covered in Ego4D because all using larger visual backbones, and using stronger LLMs, all
videos are inspected by two annotators. of which are promising areas for future work.
Generation Quality of N ARRATOR. We study how the Acknowledgements: We thank Naman Goyal, Stephen
NARRATOR’s configurations affect the quality of generated Roller and Susan Zhang for help with language models,
text and the downstream performance. The generation qual- Kevin Qinghong Lin for help with EgoVLP, and the Meta
ity is measured by standard unsupervised automatic metrics AI team for helpful discussions and feedback. This mate-
including METEOR, ROUGE, and CIDEr [43]. We use a rial is based upon work in-part supported by the National
NARRATOR with a smaller GPT-2 as the text decoder and Science Foundation under Grant No. IIS-1845485.

6593
References [14] Dima Damen, Hazel Doughty, Giovanni Maria Farinella,
Antonino Furnari, Evangelos Kazakos, Jian Ma, Davide
[1] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and
toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Michael Wray. Rescaling egocentric vision: collection,
Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, pipeline and challenges for Epic-Kitchens-100. IJCV, 2022.
Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, [15] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,
Sina Samangooei, Marianne Monteiro, Jacob Menick, Se- and Li Fei-Fei. ImageNet: A large-scale hierarchical image
bastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sa- database. In CVPR, 2009.
hand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, [16] Karan Desai and Justin Johnson. Virtex: Learning visual
Oriol Vinyals, Andrew Zisserman, and Karen Simonyan. representations from textual annotations. In CVPR, 2021.
Flamingo: a visual language model for few-shot learning. [17] Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama,
In NeurIPS, 2022. Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko,
[2] Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen and Trevor Darrell. Long-term recurrent convolutional net-
Sun, Mario Lučić, and Cordelia Schmid. Vivit: A video vi- works for visual recognition and description. In CVPR, 2015.
sion transformer. In ICCV, 2021. [18] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
[3] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin- Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
ton. Layer normalization. arXiv preprint arXiv:1607.06450, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-
2016. vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is
[4] Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisser- worth 16x16 words: Transformers for image recognition at
man. Frozen in time: A joint video and image encoder for scale. In ICLR, 2020.
end-to-end retrieval. In ICCV, 2021. [19] Marzieh Fadaee, Arianna Bisazza, and Christof Monz. Data
[5] Yoshua Bengio, Réjean Ducharme, and Pascal Vincent. A augmentation for low-resource neural machine translation.
neural probabilistic language model. In NIPS, 2000. In ACL, 2017.
[6] Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is [20] Steven Y Feng, Varun Gangal, Jason Wei, Sarath Chandar,
space-time attention all you need for video understanding? Soroush Vosoughi, Teruko Mitamura, and Eduard Hovy. A
In ICML, 2021. survey of data augmentation approaches for nlp. In ACL
[7] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- Findings, 2021.
biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- [21] Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio,
tan, Pranav Shyam, Girish Sastry, Amanda Askell, Sand- Jeff Dean, Marc’Aurelio Ranzato, and Tomas Mikolov. De-
hini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom vise: A deep visual-semantic embedding model. NeurIPS,
Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, 2013.
Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, [22] Rohit Girdhar, Mannat Singh, Nikhila Ravi, Laurens van der
Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Maaten, Armand Joulin, and Ishan Misra. Omnivore: A sin-
Jack Clark, Christopher Berner, Sam McCandlish, Alec Rad- gle model for many visual modalities. In CVPR, 2022.
ford, Ilya Sutskever, and Dario Amodei. Language models [23] Ramsri Goutham Golla. High-quality sentence paraphraser
are few-shot learners. NeurIPS, 2020. using transformers in nlp. [Link]
[8] Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, co/ramsrigouthamg/t5- large- paraphraser-
and Juan Carlos Niebles. Activitynet: A large-scale video diverse-high-quality. Accessed: 2022-06-01.
benchmark for human activity understanding. In CVPR, [24] Kristen Grauman, Andrew Westbury, Eugene Byrne,
2015. Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson
[9] Joao Carreira and Andrew Zisserman. Quo vadis, action Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Mar-
recognition? a new model and the kinetics dataset. In CVPR, tin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar
2017. Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray,
[10] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant
Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Bansal, Dhruv Batra, Vincent Cartillier, Sean Crane, Tien
Universal image-text representation learning. In ECCV, Do, Morrie Doulaty, Akshay Erapalli, Christoph Feichten-
2020. hofer, Adriano Fragomeni, Qichen Fu, Abrham Gebrese-
[11] Jaemin Cho, Jie Lei, Hao Tan, and Mohit Bansal. Unify- lasie, Cristina Gonzalez, James Hillis, Xuhua Huang, Yifei
ing vision-and-language tasks via text generation. In ICML, Huang, Wenqi Jia, Weslie Khoo, Jachym Kolar, Satwik Kot-
2021. tur, Anurag Kumar, Federico Landini, Chao Li, Yanghao
[12] Jinwoo Choi, Gaurav Sharma, Manmohan Chandraker, and
Li, Zhenqiang Li, Karttikeya Mangalam, Raghava Modhugu,
Jia-Bin Huang. Unsupervised and semi-supervised domain
Jonathan Munro, Tullie Murrell, Takumi Nishiyasu, Will
adaptation for action recognition from drones. In WACV,
Price, Paola Ruiz Puentes, Merey Ramazanova, Leda Sari,
2020.
Kiran Somasundaram, Audrey Southerland, Yusuke Sugano,
[13] Ego4D Consortium. Egocentric live 4d perception (Ego4D)
Ruijie Tao, Minh Vo, Yuchen Wang, Xindi Wu, Takuma
database: A large-scale first-person video database, support-
Yagi, Ziwei Zhao, Yunyi Zhu, Pablo Arbelaez, David Cran-
ing research in multi-modal machine perception for daily
dall, Dima Damen, Giovanni Maria Farinella, Christian Fue-
life activity. [Link]
gen, Bernard Ghanem, Vamsi Krishna Ithapu, C. V. Jawahar,
ego4d/home. Accessed: 2022-11-22.
Hanbyul Joo, Kris Kitani, Haizhou Li, Richard Newcombe,

6594
Aude Oliva, Hyun Soo Park, James M. Rehg, Yoichi Sato, training. In NeurIPS, 2022.
Jianbo Shi, Mike Zheng Shou, Antonio Torralba, Lorenzo [40] Ziyi Lin, Shijie Geng, Renrui Zhang, Peng Gao, Gerard de
Torresani, Mingfei Yan, and Jitendra Malik. Ego4d: Around Melo, Xiaogang Wang, Jifeng Dai, Yu Qiao, and Hongsheng
the world in 3,000 hours of egocentric video. In CVPR, 2022. Li. Frozen clip models are efficient video learners. In ECCV,
[25] Tengda Han, Weidi Xie, and Andrew Zisserman. Temporal 2022.
alignment networks for long-term video. In CVPR, 2022. [41] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert:
[26] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Pretraining task-agnostic visiolinguistic representations for
Deep residual learning for image recognition. In CVPR, vision-and-language tasks. NeurIPS, 2019.
2016. [42] Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan
[27] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term Duan, Tianrui Li, Jason Li, Taroon Bharti, and Ming Zhou.
memory. Neural computation, 1997. Univl: A unified video and language pre-training model for
[28] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, multimodal understanding and generation. arXiv preprint
Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego arXiv:2002.06353, 2020.
de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan [43] Maluuba. nlg-eval. [Link]
Clark, Tom Hennigan, Eric Noland, Katie Millican, George nlg-eval. Accessed: 2022-06-01.
van den Driessche, Bogdan Damoc, Aurelia Guy, Simon [44] Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan
Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Laptev, Josef Sivic, and Andrew Zisserman. End-to-end
Vinyals, and Laurent Sifre. Training compute-optimal large learning of visual representations from uncurated instruc-
language models. arXiv preprint arXiv:2203.15556, 2022. tional videos. In CVPR, 2020.
[29] Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin [45] Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac,
Choi. The curious case of neural text degeneration. In ICLR, Makarand Tapaswi, Ivan Laptev, and Josef Sivic.
2020. Howto100m: Learning a text-video embedding by watching
[30] Xiaowei Hu, Zhe Gan, Jianfeng Wang, Zhengyuan Yang, hundred million narrated video clips. In ICCV, 2019.
Zicheng Liu, Yumao Lu, and Lijuan Wang. Scaling up [46] Arsha Nagrani, Paul Hongsuck Seo, Bryan Seybold, Anja
vision-language pre-training for image captioning. In CVPR, Hauth, Santiago Manen, Chen Sun, and Cordelia Schmid.
2022. Learning audio-video modalities from image captions. In
[31] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, ECCV, 2022.
Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom [47] Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang,
Duerig. Scaling up visual and vision-language representation Gaofeng Meng, Jianlong Fu, Shiming Xiang, and Haibin
learning with noisy text supervision. In ICML, 2021. Ling. Expanding language-image pretrained models for gen-
[32] Chen Ju, Tengda Han, Kunhao Zheng, Ya Zhang, and Weidi eral video recognition. In ECCV, 2022.
Xie. Prompting visual-language models for efficient video [48] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre-
understanding. In ECCV, 2022. sentation learning with contrastive predictive coding. arXiv
[33] Evangelos Kazakos, Jaesung Huh, Arsha Nagrani, Andrew preprint arXiv:1807.03748, 2018.
Zisserman, and Dima Damen. With a little help from my [49] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
temporal context: Multimodal egocentric action recognition. Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
In BMVC, 2021. Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen
[34] Dan Kondratyuk, Liangzhe Yuan, Yandong Li, Li Zhang, Krueger, and Ilya Sutskever. Learning transferable visual
Mingxing Tan, Matthew Brown, and Boqing Gong. models from natural language supervision. In ICML, 2021.
Movinets: Mobile video networks for efficient video recog- [50] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario
nition. In CVPR, 2021. Amodei, and Ilya Sutskever. Language models are unsuper-
[35] Hildegard Kuehne, Hueihan Jhuang, Estíbaliz Garrote, vised multitask learners. OpenAI blog, 2019.
Tomaso Poggio, and Thomas Serre. Hmdb: a large video [51] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee,
database for human motion recognition. In ICCV, 2011. Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and
[36] Jie Lei, Linjie Li, Luowei Zhou, Zhe Gan, Tamara L Berg, Peter J. Liu. Exploring the limits of transfer learning with a
Mohit Bansal, and Jingjing Liu. Less is more: Clipbert for unified text-to-text transformer. JMLR, 2020.
video-and-language learning via sparse sampling. In CVPR, [52] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu,
2021. and Mark Chen. Hierarchical text-conditional image gen-
[37] Yin Li, Miao Liu, and James M Rehg. In the eye of beholder: eration with clip latents. arXiv preprint arXiv:2204.06125,
Joint learning of gaze and actions in first person video. In 2022.
ECCV, 2018. [53] Anna Rohrbach, Atousa Torabi, Marcus Rohrbach, Niket
[38] Yanghao Li, Tushar Nagarajan, Bo Xiong, and Kristen Grau- Tandon, Christopher Pal, Hugo Larochelle, Aaron Courville,
man. Ego-exo: Transferring visual representations from and Bernt Schiele. Movie description. IJCV, 2017.
third-person to first-person videos. In CVPR, 2021. [54] Christoph Schuhmann, Romain Beaumont, Cade W Gor-
[39] Kevin Qinghong Lin, Alex Jinpeng Wang, Mattia Sol- don, Ross Wightman, Theo Coombes, Aarush Katta, Clayton
dan, Michael Wray, Rui Yan, Eric Zhongcong Xu, Difei Mullis, Patrick Schramowski, Srivatsa R Kundurthy, Kather-
Gao, Rongcheng Tu, Wenzhe Zhao, Weijie Kong, Hongfa ine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and
Cai Chengfei, Wang, Dima Damen, Bernard Ghanem, Wei Jenia Jitsev. Laion-5b: An open large-scale dataset for train-
Liu, and Mike Zheng Shou. Egocentric video-language pre- ing next generation image-text models. In NeurIPS D&B,

6595
2022. In EMNLP, 2019.
[55] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural [73] Jason Weston, Samy Bengio, and Nicolas Usunier. Large
machine translation of rare words with subword units. In scale image annotation: learning to rank with joint word-
ACL, 2016. image embeddings. Machine learning, 2010.
[56] Paul Hongsuck Seo, Arsha Nagrani, Anurag Arnab, and [74] John Wieting and Kevin Gimpel. Paranmt-50m: Pushing the
Cordelia Schmid. End-to-end generative pretraining for mul- limits of paraphrastic sentence embeddings with millions of
timodal video captioning. In CVPR, 2022. machine translations. In ACL, 2018.
[57] Gunnar A Sigurdsson, Abhinav Gupta, Cordelia Schmid, Ali [75] Michael Wray, Diane Larlus, Gabriela Csurka, and Dima
Farhadi, and Karteek Alahari. Actor and observer: Joint Damen. Fine-grained action retrieval through multiple parts-
modeling of first and third-person videos. In CVPR, 2018. of-speech embeddings. In ICCV, 2019.
[58] Gunnar A Sigurdsson, Abhinav Gupta, Cordelia Schmid, Ali [76] Chao-Yuan Wu, Yanghao Li, Karttikeya Mangalam, Haoqi
Farhadi, and Karteek Alahari. Charades-ego: A large-scale Fan, Bo Xiong, Jitendra Malik, and Christoph Feichtenhofer.
dataset of paired third and first person videos. arXiv preprint Memvit: Memory-augmented multiscale vision transformer
arXiv:1804.09626, 2018. for efficient long-term video recognition. In CVPR, 2022.
[59] Kihyuk Sohn. Improved deep metric learning with multi- [77] Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko,
class n-pair loss objective. NeurIPS, 2016. Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, and
[60] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Christoph Feichtenhofer. Videoclip: Contrastive pre-training
Ucf101: A dataset of 101 human actions classes from videos for zero-shot video-text understanding. In EMNLP, 2021.
in the wild. arXiv preprint arXiv:1212.0402, 2012. [78] Hongwei Xue, Yuchong Sun, Bei Liu, Jianlong Fu, Ruihua
[61] Andreas Steiner, Alexander Kolesnikov, Xiaohua Zhai, Ross Song, Houqiang Li, and Jiebo Luo. Clip-vip: Adapting pre-
Wightman, Jakob Uszkoreit, and Lucas Beyer. How to train trained image-text model to video-language representation
your vit? data, augmentation, and regularization in vision alignment. In ICLR, 2023.
transformers. TMLR, 2022. [79] Shen Yan, Xuehan Xiong, Anurag Arnab, Zhichao Lu, Mi
[62] Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Zhang, Chen Sun, and Cordelia Schmid. Multiview trans-
Wei, and Jifeng Dai. Vl-bert: Pre-training of generic visual- formers for video recognition. In CVPR, 2022.
linguistic representations. In ICLR, 2020. [80] Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mo-
[63] Swathikiran Sudhakaran, Sergio Escalera, and Oswald Lanz. jtaba Seyedhosseini, and Yonghui Wu. CoCa: Contrastive
Lsta: Long short-term attention for egocentric action recog- captioners are image-text foundation models. TMLR, 2022.
nition. In CVPR, 2019. [81] Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella,
[64] Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang,
Cordelia Schmid. Videobert: A joint model for video and Boxin Li, Chunyuan Li, Ce Liu, Mengchen Liu, Zicheng Liu,
language representation learning. In ICCV, 2019. Yumao Lu, Yu Shi, Lijuan Wang, Jianfeng Wang, Bin Xiao,
[65] Maria Tsimpoukelli, Jacob L Menick, Serkan Cabi, SM Es- Zhen Xiao, Jianwei Yang, Michael Zeng, Luowei Zhou, and
lami, Oriol Vinyals, and Felix Hill. Multimodal few-shot Pengchuan Zhang. Florence: A new foundation model for
learning with frozen language models. In NeurIPS, 2021. computer vision. arXiv preprint arXiv:2111.11432, 2021.
[66] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- [82] Rowan Zellers, Ximing Lu, Jack Hessel, Youngjae Yu,
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Jae Sung Park, Jize Cao, Ali Farhadi, and Yejin Choi. Mer-
Polosukhin. Attention is all you need. In NeurIPS, 2017. lot: Multimodal neural script knowledge models. NeurIPS,
[67] Ashwin K Vijayakumar, Michael Cogswell, Ramprasath R 2021.
Selvaraju, Qing Sun, Stefan Lee, David Crandall, and [83] Andy Zeng, Adrian Wong, Stefan Welker, Krzysztof Choro-
Dhruv Batra. Diverse beam search: Decoding diverse manski, Federico Tombari, Aveek Purohit, Michael Ryoo,
solutions from neural sequence models. arXiv preprint Vikas Sindhwani, Johnny Lee, Vincent Vanhoucke, and Pete
arXiv:1610.02424, 2016. Florence. Socratic models: Composing zero-shot multi-
[68] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Du- modal reasoning with language. In ICLR, 2023.
mitru Erhan. Show and tell: A neural image caption gen- [84] Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lu-
erator. In CVPR, 2015. cas Beyer. Scaling vision transformers. In CVPR, 2022.
[69] Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhil- [85] Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level
iang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mo- convolutional networks for text classification. In NeurIPS,
hammed, Saksham Singhal, Subhojit Som, et al. Image as a 2015.
foreign language: Beit pretraining for all vision and vision- [86] Luowei Zhou, Chenliang Xu, and Jason J Corso. Towards
language tasks. arXiv preprint arXiv:2208.10442, 2022. automatic learning of procedures from web instructional
[70] William Yang Wang and Diyi Yang. That’s so annoying!!!: A videos. In AAAI, 2018.
lexical and frame-semantic embedding based data augmen- [87] Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp
tation approach to automatic categorization of annoying be- Krähenbühl, and Ishan Misra. Detecting twenty-thousand
haviors using# petpeeve tweets. In EMNLP, 2015. classes using image-level supervision. In ECCV, 2022.
[71] Xiaohan Wang, Linchao Zhu, Heng Wang, and Yi Yang. In- [88] Linchao Zhu and Yi Yang. Actbert: Learning global-local
teractive prototype learning for egocentric action recogni- video-text representations. In CVPR, 2020.
tion. In ICCV, 2021. [89] Xizhou Zhu, Jinguo Zhu, Hao Li, Xiaoshi Wu, Hongsheng
[72] Jason Wei and Kai Zou. Eda: Easy data augmentation tech- Li, Xiaohua Wang, and Jifeng Dai. Uni-perceiver: Pre-
niques for boosting performance on text classification tasks.

6596
training unified architecture for generic perception for zero-
shot and few-shot tasks. In CVPR, 2022.

6597

You might also like