𝕞CLIP𝔼val

An easy-to-use and easily-extendible evaluation tookit for vision-language models.

The mCLIPEval (multilingual evaluation for CLIP) tookit provides generic evaluation for pretrained vision-language models (such as CLIP, Contrastive Language–Image Pre-training). More precisely, mCLIPEval provides evaluations:

On multilingual (12 languages) datasets and monolingual (English/Chinese) datasets, for tasks like zeroshot image/video classification, zeroshot image-to-text and text-to image retrieval and zeroshot visio-linguistic compositionality.
Adapted to various open-source pretrained models, like FlagAI pretrained models (AltCLIP, EVA-CLIP), OpenCLIP pretrained models, Chinese CLIP models, Multilingual CLIP models, Taiyi Series pretrained models. Customized models can be also supported with model scripts.

mCLIPEval provides APIs to quickly download and preparation of public datasets from torchvision, huggingface, kaggle, as well as download and use those open-source pretrained models on these datasets.

mCLIPEval provides visualization of evaluation results through streamlit web app, to see the comparsions of performances on specific languages, tasks, or model parameters.

Below is the scatter plot to show the performances of some open-source models.

Online Demo

You can use the online demo to experience mCLIPEval.

with mCLIPEval, I can ...

Easy-to-compare capabilities of open-source models
- on specific datasets, languages, tasks, model parameters
- through interactive inferfaces to switch different configuration settings
- without computational resources needs to support inference and evaluation.
Easy-to-evaluate pretrained models using checkpoint files
- trained through various frameworks, like FlagAI, OpenCLIP, Transformers
- or any customized frameworks with model scripts
- on specific datasets, languages, tasks.
Easy-to-build a evaluation framework from scratch
- including download and preparation of datasets and models,
- evaluation with various configuration settings
- and visualization of the evaluation results.

Guide

Requiremnets Installation
How to Use
Datasets and Models
- Supported Datasets
- Built-in Models
Reference
Credits
License

Requiremnets Installation

To use mCLIPEval, you need:

Pytorch version >= 1.8.0
Python version >= 3.8
For evaluating models on GPUs, you'll also need install CUDA and NCCL

[Recommended] For complete usage, you are recommended to install the required packages through:

pip install -r requirements.txt

[Optional] To use datasets from huggingface (imagenet1k in any languages, winoground), you need to:
- 1. generate huggingface API TOKEN (select the role "read") from huggingface following the instructions;
- 1. run the command and add the generated token as git credential: huggingface-cli login or modify the download/constants.py file with generated token >>> _HUGGINGFACE_AUTH_TOKEN = "hf_..."
- 1. click the Agree and access repository button on dataset pages (imagenet-1k and winoground) to accept the license agreements of the datasets.
[Optional] To use datasets from kaggle (fer2013, flickr30k, flickr30k_cn, multi30k), you need to:
- 1. generate API token from kaggle following the instructions.
- 1. install unzip command, for Debian/Ubuntu Linux, use sudo apt-get install unzip; for CentOS/RHEL Linux, use yum install unzip; for macOS, use brew install unzip.

[Partial] For partial usage, you need to install required packages:

[Necessary] for basic evaluation: pip install torch,torchvision,sklearn
[Optional] for data preparation: pip install -r requirements_dataset.txt
[Optional] for usage of pretrained models: pip install -r requirements_models.txt
[Optional] for visualization: pip install -r requirements_visual.txt

How to use?

Structure

The complete use of mCLIPEval contains three standalone modules: data preparation, evaluation and visualization.

Module	Entry	Function	Documentation
Data Preparation	download.py	Download datasets and organize the data directories properly for evaluation	Data Doc
Evaluation	evaluate.py	Evaluate a model on selected datasets and output results in a json file	Evaluation Doc
Visualization	visual.py	Visualize the evaluation results through an interactive web app	Visualization Doc

Quick tour

To immediately see the comparison results of built-in open-source models, we provide outputs as early-run evaluation results. You just need to run:
```
streamlit run visual.py -- --json="outputs/*.json"
```
To evaluate a pretrained model with checkpoint files, you need to:
- specify the model script, for example models/altclip.py
- choose the evaluation datasets (for example cifar10 is a classical image classification dataset)
- download and prepare the datasets with:
```
python download.py --datasets=cifar10
```
- evaluate the pretrained model in the directory [MODEL_DIR]
```
python evaluate.py --model_name=[MODEL_NAME] --model_dir=[MODEL_DIR] --model_script=models.altclip --datasets=cifar10 --output=[MODEL_NAME].json
```
- the evaluation results are saved in [MODEL_NAME].json file
- [Tips] if the parameter --datasets is not specified, all supported datasets are chosen (the process of data preparation and evaluation would take a long time).

Advanced usage examples

Function	Description
Multi-datasets Preparation	Download and prepare the datasets specified by names
Full-datasets Preparation	Download and prepare all supported datasets
Specified-datasets Evaluation	Evaluation on specified datasets
Specified-languages Evaluation	Evaluation on specified languages
Specified-tasks Evaluation	Evaluation on specified tasks
Built-in Model Evaluation	Evaluate a built-in model specified by name
Pretrained checkpoint Evaluation	Evaluate a pretrained model with supported pretraining framework and checkpoint directories
Customized-model Evaluation	Evaluate a customized model with customized pretraining framework
Visualization	Visualize the evaluation result json/jsonl files

Datasets and Models

Supported Datasets

Dataset Names	Languages	Task	Instructions
imagenet1k	EN/CN/JP/IT	Image Classification	multilinual classnames and prompts, including `imagenet1k, imagenet1k_cn, imagenet1k_jp, imagenet1k_it`
imagenet-a	EN/CN/JP/IT	Image Classification	multilinual classnames and prompts, including `imagenet-a, imagenet-a_cn, imagenet-a_jp, imagenet-a_it`
imagenet-r	EN/CN/JP/IT	Image Classification	multilinual classnames and prompts, including `imagenet-r, imagenet-r_cn, imagenet-r_jp, imagenet-r_it`
imagenet-sketch	EN/CN/JP/IT	Image Classification	multilinual classnames and prompts, including `imagenet-sketch, imagenet-sketch_cn, imagenet-sketch_jp, imagenet-sketch_it`
imagenetv2	EN/CN/JP/IT	Image Classification	multilinual classnames and prompts, including `imagenetv2, imagenetv2_cn, imagenetv2_jp, imagenetv2_it`
objectnet	EN	Image Classification	`objectnet`
torchvision datasets	EN	Image Classification, OCR, Geo-Localization	including: `caltech101, cars, cifar10, cifar100, country211, dtd, eurosat, fer2013, fgvc-aircraft, flowers, food101, gtsrb, mnist, objectnet, pcam, pets, renderedsst2, resisc45, stl10, sun397, voc2007, voc2007_multilabel`
winoground	EN	Image-Text Compositionality	visio-linguistic compositional reasoning, `winoground`
mscoco_captions	EN/CN	Image-Text Retrieval	apart from English version, there are also 1k/5k Chinese translation version with different splits, including `mscoco_captions, mscoco_captions_cn_1k, mscoco_captions_cn_5k`
xtd	EN/CN/DE/ES/FR/IT/JP/KO/PL/RU/TR	Image-Text Retrieval	multilingual translation of MSCOCO captions, with the same image test splits, including `xtd_en, xtd_de, xtd_es, xtd_fr, xtd_it, xtd_jp, xtd_ko, xtd_pl, xtd_ru, xtd_tr, xtd_zh`
flickr_30k	EN/CN	Image-Text Retrieval	apart from English version, there is also Chinese translation version with different splits, including `flickr30k, flickr30k_cn`
multi30k	EN/FR/DE/CS	Image-Text Retrieval	multilingual translation of Flickr30k captions, with the same image test splits, including `multi30k_en, multi30k_fr, multi30k_de, multi30k_cs`
birdsnap	EN	Image Classification	download not supported, `birdsnap`
kinetics	EN	Video Act Recognition	download not supported, including `kinetics400, kinetics600, kinetics700`
ucf101	EN	Video Act Recognition	download not supported, including `ucf101`

Built-in Models

Model Name	Text Encoder	Vision Encoder	Description
openai-clip-L	CLIP-L	VIT-L	openai's CLIP with `Vit-L-14` as image encoder and default tranformer as text encoder [paper] [github]
openai-clip-L-336	CLIP-L	VIT-L	the same to `openai-clip-L` while the input image size is 336*336
openclip-L	CLIP-L	VIT-L	openclip's implementation with `openai-clip-L` trained with `laion 2B` data [github]
openclip-L-v0	CLIP-L	VIT-L	openclip's implementation with `openai-clip-L` trained with `laion 400m` data
openclip-H	CLIP-H	VIT-H	openclip's pretrained model with `Vit-H-14` as vision encoder
openclip-H-XLMR-L	XLMR-L	VIT-H	openclip's pretrained model with `Vit-H-14` as vision encoder and `XLMR-Large` as text encoder
openclip-B-XLMR-B	XLMR-B	VIT-B	openclip's pretrained model with `Vit-B-32` as vision encoder and `XLMR-Base` as text encoder
cn-clip-L	RoBERTa-wwm-L	VIT-L	damo's Chinese CLIP model with `Vit-L-14` as image encoder and `RoBERTa-wwm-Large` as text encoder [paper] [github]
cn-clip-L-336	RoBERTa-wwm-L	VIT-L	the same to `cn-clip-L` while the input image size is 336*336
M-CLIP	XLMR-L	VIT-L	RISE's multilingual clip model with `Vit-L-14` as image encoder and `XLMR-Large` as text encoder [paper] [github]
AltCLIP-XLMR-L	XLMR-L	VIT-L	BAAI's bilingual clip model with `Vit-L-14` as image encoder and `XLMR-Large` as text encoder [paper] [github]
AltCLIP-XLMR-L-m9	XLMR-L	VIT-L	BAAI's multilingual clip model with `Vit-L-14` as image encoder and `XLMR-Large` as text encoder
eva-clip	CLIP-L	eva VIT-g	BAAI's clip model with the pretrained `eva`(size `Vit-g-14`) as image encoder and and default tranformer as text encoder [paper] [github]
Taiyi-CLIP-L	RoBERTa-wwm-L	VIT-L	Noah's Chinese CLIP model with `Vit-L-14` as image encoder and `RoBERTa-wwm-Large` as text encoder [paper] [github]

Contributing

Thanks for your interest in contributing! Apart from regular commits, we also welcome contributions to resources (datasets, models, tasks).

Credits

Thanks to CLIP Benchmark authors, the zeroshot classification and retrieval code, and the dataset building code are adpated from there.
Thanks to CLIP, Lit, SLIP, imagenet_classes_chinese, japanese_clip, clip-italian authors, as the orginal zeroshot templates and multilingual classnames.
Thanks to Winoground, ImageNet-Sketch, ImageNet-A, ImageNet-R authors to provide original datasets.
Thanks to COCO-CN and Flickr30k CN authors to provide the original datasets.

License

The majority of mCLIPEval is licensed under the Apache 2.0 license, however portions of the project are available under separate license terms:

The usage of CLIP_benchmark is licensed under the MIT license
The usage of ImageNet1k datasets in under the huggingface datasets license and ImageNet licenese

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

README.md

README.md

𝕞CLIP𝔼val

An easy-to-use and easily-extendible evaluation tookit for vision-language models.

Online Demo

with mCLIPEval, I can ...

Guide

Requiremnets Installation

How to use?

Structure

Quick tour

Advanced usage examples

Datasets and Models

Supported Datasets

Built-in Models

Contributing

Credits

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

𝕞CLIP𝔼val

An easy-to-use and easily-extendible evaluation tookit for vision-language models.

Online Demo

with mCLIPEval, I can ...

Guide

Requiremnets Installation

How to use?

Structure

Quick tour

Advanced usage examples

Datasets and Models

Supported Datasets

Built-in Models

Contributing

Credits

License