ViTamin: Designing Scalable Vision Models in the Vision-language Era

🔥 Officially supported by timm and OpenCLIP. Thanks @rwightman!

One line of code to call ViTamin:

model = timm.create_model('vitamin_xlarge_384')

ViTamin-XL, with only 436M parameters and trained on the public DataComp-1B dataset, achieves an impressive 82.9% zero-shot ImageNet accuracy.

ViTamin-L sets a new SOTA across seven benchmarks for open-vocabulary segmentation, and also push forward the capabilities of large multi-modal models (e.g., LLaVA) significantly.

🤗 The HuggingFace collection of ViTamin model cards has been released! Check out the model cards!

Get Started

It currently includes code and models for the following tasks:

ViTamin Pre-training: See ./ViTamin/README.md for a quick start, which includes CLIP pre-training / fine-tuning pipelines and zero-shot evaluation pipelines.

Open-vocabulary Detection and Segmentation: See ViTamin for Open-vocab Detection and ViTamin for Open-vocab Segmentation.

Large Multi-Modal Models: See ViTamin for Large Multi-Modal Models.

We also support ViTamin with Hugging Face model jienengchen/ViTamin-XL-384px.

import torch
import open_clip
from PIL import Image
from transformers import AutoModel, CLIPImageProcessor
device = "cuda" if torch.cuda.is_available() else "cpu"

model = AutoModel.from_pretrained(
    'jienengchen/ViTamin-XL-384px',
    trust_remote_code=True).to(device).eval()

image = Image.open('./image.png').convert('RGB')
image_processor = CLIPImageProcessor.from_pretrained('jienengchen/ViTamin-XL-384px')

pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
pixel_values = pixel_values.to(torch.bfloat16).cuda()

tokenizer = open_clip.get_tokenizer('hf-hub:laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K')
text = tokenizer(["a photo of vitamin", "a dog", "a cat"]).to(device)

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features, text_features, logit_scale = model(pixel_values, text)
    text_probs = (100.0 * image_features @ text_features.to(torch.float).T).softmax(dim=-1)

print("Label probs:", text_probs)

Main Results with CLIP Pre-training on DataComp-1B

We will provide 61 trained VLMs (48 benchmarked + 13 best performing) in Hugging Face for community use. Stay tuned!

image encoder	🤗 HuggingFace	image size	num patches	text encoder depth/width	seen samples (B)	trainable params Image+Text (M)	MACs Image+Text (G)	ImageNet Acc.	avg. 38 datasets	ImageNet dist. shift.	VTAB	retrieval
ViTamin-L	Link	224	196	12/768	12.8	333.3+123.7	72.6+6.6	80.8	66.7	69.8	65.3	60.3
ViTamin-L	Link	256	256	12/768	12.8+0.2	333.4+123.7	94.8+6.6	81.2	67.0	71.1	65.3	61.2
ViTamin-L	Link	336	441	12/768	12.8+0.2	333.6+123.7	163.4+6.6	81.6	67.0	72.1	64.4	61.6
ViTamin-L	Link	384	576	12/768	12.8+0.2	333.7+123.7	213.4+6.6	81.8	67.2	72.4	64.7	61.8
ViTamin-L2	Link	224	196	24/1024	12.8	333.6+354.0	72.6+23.3	80.9	66.4	70.6	63.4	61.5
ViTamin-L2	Link	256	256	24/1024	12.8+0.5	333.6+354.0	94.8+23.3	81.5	67.4	71.9	64.1	63.1
ViTamin-L2	Link	336	441	24/1024	12.8+0.5	333.8+354.0	163.4+23.3	81.8	67.8	73.0	64.5	63.6
ViTamin-L2	Link	384	576	24/1024	12.8+0.5	334.0+354.0	213.4+23.3	82.1	68.1	73.4	64.8	63.7
ViTamin-XL	Link	256	256	27/1152	12.8+0.5	436.1+488.7	125.3+33.1	82.1	67.6	72.3	65.4	62.7
ViTamin-XL	Link	384	576	27/1152	12.8+0.5	436.1+488.7	281.9+33.1	82.6	68.1	73.6	65.6	63.8
ViTamin-XL	Link	256	256	27/1152	40	436.1+488.7	125.3+33.1	82.3	67.5	72.8	64.0	62.1
ViTamin-XL	Link	336	441	27/1152	40+1	436.1+488.7	215.9+33.1	82.7	68.0	73.9	64.1	62.6
ViTamin-XL	Link	384	576	27/1152	40+1	436.1+488.7	281.9+33.1	82.9	68.1	74.1	64.0	62.5

Main Results on Downstream tasks

Open-Vocab Detection

image encoder	detector	OV-COCO (AP₅₀^novel)	OV-LVIS (AP_r)
ViT-L/14	Sliding F-ViT	36.1	32.5
ViTamin-L	Sliding F-ViT	37.5	35.6

Open-Vocab Segmentation

image encoder	segmentor	ADE	Cityscapes	MV	A-150	A-847	PC-459	PC-59	PAS-21
ViT-L/14	Sliding FC-CLIP	24.6	40.7	16.5	31.8	14.3	18.3	55.1	81.5
ViTamin-L	Sliding FC-CLIP	27.3	44.0	18.2	35.6	16.1	20.4	58.4	83.4

Note: Panoptic dataset (ADE, CityScapes, MV) are with the metric of PQ. Semantic dataset (A-150, A-847, PC-459, PC-59, PAS-21) are with the metric of mIoU.

Large Multi-modal Models

image encoder	image size	VQAv2	GQA	VizWiz	SQA	T-VQA	POPE	MME	MM-Bench	MM-B-CN	SEED	LLaVA-Wild	MM-Vet
ViTamin-L	336	78.4	61.6	51.1	66.9	58.7	84.6	1421	65.4	58.4	57.7	64.5	33.6
ViTamin-L	384	78.9	61.6	55.4	67.6	59.8	85.5	1447	64.5	58.3	57.9	66.1	33.6

Citing ViTamin

@inproceedings{chen2024vitamin,
  title={ViTamin: Designing Scalable Vision Models in the Vision-language Era},
  author={Chen, Jieneng and Yu, Qihang and Shen, Xiaohui and Yuille, Alan and Chen, Liang-Chieh},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2024}
}

Name	Name	Last commit message	Last commit date
Latest commit Beckschen Jun 9, 2024 24f7475 · Jun 9, 2024 History 48 Commits
ViTamin	ViTamin	update	May 5, 2024
vitamin_fcclip	vitamin_fcclip	update	Apr 9, 2024
vitamin_fvit	vitamin_fvit	update	Apr 9, 2024
vitamin_llava	vitamin_llava	update	Apr 9, 2024
LICENSE.txt	LICENSE.txt	update	Apr 3, 2024
README.md	README.md	call ViTamin!	Jun 9, 2024
image0.png	image0.png	update	Apr 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ViTamin: Designing Scalable Vision Models in the Vision-language Era

Get Started

Main Results with CLIP Pre-training on DataComp-1B

Main Results on Downstream tasks

Citing ViTamin

About

Releases

Packages

Languages

License

Beckschen/ViTamin

Folders and files

Latest commit

History

Repository files navigation

ViTamin: Designing Scalable Vision Models in the Vision-language Era

Get Started

Main Results with CLIP Pre-training on DataComp-1B

Main Results on Downstream tasks

Citing ViTamin

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages