[CVPR 2024] MeaCap: Memory-Augmented Zero-shot Image Captioning

Authors: Zequn Zeng, Yan Xie, Hao Zhang, Chiyu Chen, Zhengjue Wang, Bo Chen
official implementation of MeaCap.

Catalogue:

Introduction
Citation
Data Preparation
Inference
Experiments
Acknowledgments

Introduction

Zero-shot image captioning (IC) without well-paired image-text data can be divided into two categories, training-free and text-only-training. The main difference between them is whether using a textual corpus to train the LM. Though achieving attractive performance w.r.t. some metrics, existing methods often exhibit some common drawbacks. Training-free methods tend to produce hallucinations, while text-only-training often lose generalization capability. To move forward, in this paper, we propose a novel Memory-Augmented zero-shot image Captioning framework (MeaCap). Specifically, equipped with a textual memory, we introduce a retrieve-then-filter module to get key concepts that are highly related to the image. By deploying our proposed memory-augmented visual-related fusion score in a keywords-to-sentence LM, MeaCap can generate concept-centered captions that keep high consistency with the image with fewer hallucinations and more world-knowledge.

Citation

If you think MeaCap is useful, please cite this paper!

@article{zeng2024meacap,
  title={MeaCap: Memory-Augmented Zero-shot Image Captioning},
  author={Zeng, Zequn and Xie, Yan and Zhang, Hao and Chen, Chiyu and Wang, Zhengjue and Chen, Bo},
  journal={arXiv preprint arXiv:2403.03715},
  year={2024}
}

Data Preparation

Environment

Prepare the python environment:

pip install -r requirements.txt

Memory bank

We have preprocessed textual corpus of CC3M, SS1M, COCO, and Flickr30k and transformed them into CLIP and SentenceBERT embeddings for fast retrieval. Download our preprocessed memory files and put them into ./data/memory/ , as:

data
└── memory
    ├── cc3m
    │   ├── memory_captions.json
    │   ├── memory_clip_embeddings.pt
    │   └── memory_wte_embeddings.pt
    ├── coco
    │   ├── memory_captions.json
    │   ├── memory_clip_embeddings.pt
    │   └── memory_wte_embeddings.pt
    └── ...

you can also preprocess a new textual memory bank, for example:

python prepare_embedding.py --memory_id coco --memory_path data/memory/coco/memory_captions.json

Model Zoo

MeaCap use multiple pretrained models to finish different purposes. The default version of language model is CBART. We provide the download link of pretrained CBART and caption-finetuned CBART. Please download these weights and put them into ./checkpoints/ .

Methods	Training Datasets	Download link	Purposes
(Needful)	Unable to render expression. $~~~~$	Unable to render expression. $~~~~$	Unable to render expression. $~~~~$
CLIP	✗	link	Image-text similarity computation
SceneGraphParser	✗	link	Parse caption into scene graph
SentenceBERT	✗	link	Sentence similarity computation
(Optional)	Unable to render expression. $~~~~$	Unable to render expression. $~~~~$	Unable to render expression. $~~~~$
CBART-large	One-billion-word	link	keyword-to-sentence LM for Unable to render expression. $MeaCap_{TF}$
CBART-large	CC3M	link	keyword-to-sentence LM for Unable to render expression. $MeaCap_{ToT}$
CBART-large	SS1M	link	keyword-to-sentence LM for Unable to render expression. $MeaCap_{ToT}$
CBART-large	COCO	link	keyword-to-sentence LM for Unable to render expression. $MeaCap_{ToT}$
CBART-large	Flickr30K	link	keyword-to-sentence LM for Unable to render expression. $MeaCap_{ToT}$
ViECAP	COCO/Flickr30k	link	baseline of Unable to render expression. $MeaCap_{InvLM}$

If you want to finetune CBART on your own caption corpus, please follow the official training instruction from CBART.

Inference

Training-free

For training-free version

Unable to render expression.

$MeaCap_{TF}$

, we use a pretrained CBART. To bridge the gap between pretrained dataset one-billion-word and caption-style texts, we use a default prompt "The image depicts that". We also support prompt ensembling by setting --prompt_ensembling.

python inference.py --use_prompt  --memory_id cc3m --img_path ./image_example --lm_model_path ./checkpoints/CBART_one_billion

Text-only-training

For text-only-training version

Unable to render expression.

$MeaCap_{ToT}$

, we use finetuned CBART where prompts are needless.

python inference.py --memory_id coco --img_path ./image_example --lm_model_path ./checkpoints/CBART_COCO

Memory concepts + ViECAP

We also supporting add memory concepts to strong baseline ViECAP in a plug-and-play way, namely

Unable to render expression.

$MeaCap_{InvLM}$

. We simply need to replace the entity module by our proposed retrieve-then-filter module in the inference stage and then the performance can be improved. Details are shown in Appendix of our paper.

python viecap_inference.py --memory_id coco --image_path "*.jpg" --weight_path "checkpoints/train_coco/coco_prefix-0014.pt"

Experiments

Zero-shot captioning

Methods	Training	Memory	MSCOCO	NoCaps val (CIDEr)
			CIDEr	In / Near / Out / Overall
ConZIC	✗	✗	5.0	15.4 / 16.0 / 20.3 / 17.5
CLIPRe	✗	CC3M	25.6	23.3 / 26.8 / 36.5 / 28.2
Unable to render expression. $MeaCap_{TF}$	✗	CC3M	42.5	35.3 / 39.0 / 45.1 / 40.2
DeCap	CC3M	CC3M	42.1	34.8 / 37.7 / 49.9 / 39.7
Unable to render expression. $MeaCap_{ToT}$	CC3M	CC3M	48.3	38.5 / 43.6 / 50.0 / 45.1
DeCap	SS1M	SS1M	50.6	41.9 / 41.7 / 46.2 / 42.7
Unable to render expression. $MeaCap_{TF}$	✗	SS1M	51.7	42.0 / 42.8 / 45.4 / 43.8
Unable to render expression. $MeaCap_{ToT}$	SS1M	SS1M	54.9	44.1 / 46.0 /49.7 / 47.3

In/Cross-domain captioning

Task	COCO	Flickr30k	COCO Unable to render expression. $\Rightarrow$ Flickr30k	Flickr30k Unable to render expression. $\Rightarrow$ COCO
Metric	CIDEr	CIDEr	CIDEr	CIDEr
MAGIC	49.3	20.4	17.5	18.3
CLIPRe	53.4	31.7	30.1	26.5
Unable to render expression. $MeaCap_{TF}$	56.9	36.5	34.4	46.4
Unable to render expression. $MeaCap_{ToT}$	84.8	50.2	40.3	51.7
Unable to render expression. $~~~~$	Unable to render expression. $~~~~$	Unable to render expression. $~~~~$	Unable to render expression. $~~~~$	Unable to render expression. $~~~~$
DeCap	91.2	56.7	35.7	44.4
CapDec	91.8	39.1	35.7	27.3
ViECap	92.9	47.9	38.4	54.2
Unable to render expression. $MeaCap_{InvLM}$	95.4	59.4	43.9	56.4

Acknowledgements

This code is heavily depend on ConZIC, CBART and ViECAP.

Thanks for their good work.

Name	Name	Last commit message	Last commit date
Latest commit joeyz0z Aug 16, 2024 1a929a9 · Aug 16, 2024 History 27 Commits
assets	assets	MeaCap v0.0	May 16, 2024
data/tokens	data/tokens	ToT version MeaCap	Apr 16, 2024
dataset	dataset	MeaCap v0.0	May 16, 2024
image_example	image_example	MeaCap v0.0	May 16, 2024
language_models	language_models	MeaCap v0.0	May 16, 2024
models	models	MeaCap v0.0	May 16, 2024
src/transformers	src/transformers	ToT version MeaCap	Apr 16, 2024
utils	utils	MeaCap v0.0	May 16, 2024
viecap	viecap	MeaCap v0.0	May 16, 2024
README.md	README.md	fix bugs	Aug 16, 2024
args.py	args.py	fix bugs	Aug 16, 2024
compute_clip_score.py	compute_clip_score.py	MeaCap v0.0	May 16, 2024
inference.py	inference.py	Update inference.py	May 17, 2024
prepare_embedding.py	prepare_embedding.py	MeaCap v0.0	May 16, 2024
requirements.txt	requirements.txt	fix bugs	Aug 16, 2024
viecap_inference.py	viecap_inference.py	change weights path	Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[CVPR 2024] MeaCap: Memory-Augmented Zero-shot Image Captioning

Catalogue:

Introduction

Citation

Data Preparation

Environment

Memory bank

Model Zoo

Inference

Training-free

Text-only-training

Memory concepts + ViECAP

Experiments

Zero-shot captioning

In/Cross-domain captioning

Acknowledgements

About

Releases

Packages

Contributors 2

Languages

joeyz0z/MeaCap

Folders and files

Latest commit

History

Repository files navigation

[CVPR 2024] MeaCap: Memory-Augmented Zero-shot Image Captioning

Catalogue:

Introduction

Citation

Data Preparation

Environment

Memory bank

Model Zoo

Inference

Training-free

Text-only-training

Memory concepts + ViECAP

Experiments

Zero-shot captioning

In/Cross-domain captioning

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages