Skip to content

[CVPR 2024] Official implementation of "ViTamin: Designing Scalable Vision Models in the Vision-language Era"

License

Notifications You must be signed in to change notification settings

Beckschen/ViTamin

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Jun 9, 2024
24f7475 Â· Jun 9, 2024

History

48 Commits
May 5, 2024
Apr 9, 2024
Apr 9, 2024
Apr 9, 2024
Apr 3, 2024
Jun 9, 2024
Apr 8, 2024

Repository files navigation

🔥 Officially supported by timm and OpenCLIP. Thanks @rwightman!

One line of code to call ViTamin:

model = timm.create_model('vitamin_xlarge_384')

ViTamin-XL, with only 436M parameters and trained on the public DataComp-1B dataset, achieves an impressive 82.9% zero-shot ImageNet accuracy.

ViTamin-L sets a new SOTA across seven benchmarks for open-vocabulary segmentation, and also push forward the capabilities of large multi-modal models (e.g., LLaVA) significantly.

🤗 The HuggingFace collection of ViTamin model cards has been released! Check out the model cards!

teaser

Get Started

It currently includes code and models for the following tasks:

ViTamin Pre-training: See ./ViTamin/README.md for a quick start, which includes CLIP pre-training / fine-tuning pipelines and zero-shot evaluation pipelines.

Open-vocabulary Detection and Segmentation: See ViTamin for Open-vocab Detection and ViTamin for Open-vocab Segmentation.

Large Multi-Modal Models: See ViTamin for Large Multi-Modal Models.

We also support ViTamin with Hugging Face model jienengchen/ViTamin-XL-384px.

import torch
import open_clip
from PIL import Image
from transformers import AutoModel, CLIPImageProcessor
device = "cuda" if torch.cuda.is_available() else "cpu"

model = AutoModel.from_pretrained(
    'jienengchen/ViTamin-XL-384px',
    trust_remote_code=True).to(device).eval()

image = Image.open('./image.png').convert('RGB')
image_processor = CLIPImageProcessor.from_pretrained('jienengchen/ViTamin-XL-384px')

pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
pixel_values = pixel_values.to(torch.bfloat16).cuda()

tokenizer = open_clip.get_tokenizer('hf-hub:laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K')
text = tokenizer(["a photo of vitamin", "a dog", "a cat"]).to(device)

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features, text_features, logit_scale = model(pixel_values, text)
    text_probs = (100.0 * image_features @ text_features.to(torch.float).T).softmax(dim=-1)

print("Label probs:", text_probs) 

Main Results with CLIP Pre-training on DataComp-1B

We will provide 61 trained VLMs (48 benchmarked + 13 best performing) in Hugging Face for community use. Stay tuned!

image encoder 🤗 HuggingFace image size num patches text encoder depth/width seen samples (B) trainable params Image+Text (M) MACs Image+Text (G) ImageNet Acc. avg. 38 datasets ImageNet dist. shift. VTAB retrieval
ViTamin-L Link 224 196 12/768 12.8 333.3+123.7 72.6+6.6 80.8 66.7 69.8 65.3 60.3
ViTamin-L Link 256 256 12/768 12.8+0.2 333.4+123.7 94.8+6.6 81.2 67.0 71.1 65.3 61.2
ViTamin-L Link 336 441 12/768 12.8+0.2 333.6+123.7 163.4+6.6 81.6 67.0 72.1 64.4 61.6
ViTamin-L Link 384 576 12/768 12.8+0.2 333.7+123.7 213.4+6.6 81.8 67.2 72.4 64.7 61.8
ViTamin-L2 Link 224 196 24/1024 12.8 333.6+354.0 72.6+23.3 80.9 66.4 70.6 63.4 61.5
ViTamin-L2 Link 256 256 24/1024 12.8+0.5 333.6+354.0 94.8+23.3 81.5 67.4 71.9 64.1 63.1
ViTamin-L2 Link 336 441 24/1024 12.8+0.5 333.8+354.0 163.4+23.3 81.8 67.8 73.0 64.5 63.6
ViTamin-L2 Link 384 576 24/1024 12.8+0.5 334.0+354.0 213.4+23.3 82.1 68.1 73.4 64.8 63.7
ViTamin-XL Link 256 256 27/1152 12.8+0.5 436.1+488.7 125.3+33.1 82.1 67.6 72.3 65.4 62.7
ViTamin-XL Link 384 576 27/1152 12.8+0.5 436.1+488.7 281.9+33.1 82.6 68.1 73.6 65.6 63.8
ViTamin-XL Link 256 256 27/1152 40 436.1+488.7 125.3+33.1 82.3 67.5 72.8 64.0 62.1
ViTamin-XL Link 336 441 27/1152 40+1 436.1+488.7 215.9+33.1 82.7 68.0 73.9 64.1 62.6
ViTamin-XL Link 384 576 27/1152 40+1 436.1+488.7 281.9+33.1 82.9 68.1 74.1 64.0 62.5

Main Results on Downstream tasks

Open-Vocab Detection

image encoder detector OV-COCO (AP50novel) OV-LVIS (APr)
ViT-L/14 Sliding F-ViT 36.1 32.5
ViTamin-L Sliding F-ViT 37.5 35.6

Open-Vocab Segmentation

image encoder segmentor ADE Cityscapes MV A-150 A-847 PC-459 PC-59 PAS-21
ViT-L/14 Sliding FC-CLIP 24.6 40.7 16.5 31.8 14.3 18.3 55.1 81.5
ViTamin-L Sliding FC-CLIP 27.3 44.0 18.2 35.6 16.1 20.4 58.4 83.4

Note: Panoptic dataset (ADE, CityScapes, MV) are with the metric of PQ. Semantic dataset (A-150, A-847, PC-459, PC-59, PAS-21) are with the metric of mIoU.

Large Multi-modal Models

image encoder image size VQAv2 GQA VizWiz SQA T-VQA POPE MME MM-Bench MM-B-CN SEED LLaVA-Wild MM-Vet
ViTamin-L 336 78.4 61.6 51.1 66.9 58.7 84.6 1421 65.4 58.4 57.7 64.5 33.6
ViTamin-L 384 78.9 61.6 55.4 67.6 59.8 85.5 1447 64.5 58.3 57.9 66.1 33.6

Citing ViTamin

@inproceedings{chen2024vitamin,
  title={ViTamin: Designing Scalable Vision Models in the Vision-language Era},
  author={Chen, Jieneng and Yu, Qihang and Shen, Xiaohui and Yuille, Alan and Chen, Liang-Chieh},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2024}
}

About

[CVPR 2024] Official implementation of "ViTamin: Designing Scalable Vision Models in the Vision-language Era"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published