The mCLIPEval (multilingual evaluation for CLIP) tookit provides generic evaluation for pretrained vision-language models (such as CLIP, Contrastive Language–Image Pre-training). More precisely, mCLIPEval provides evaluations:
-
On multilingual (12 languages) datasets and monolingual (English/Chinese) datasets, for tasks like zeroshot image/video classification, zeroshot image-to-text and text-to image retrieval and zeroshot visio-linguistic compositionality.
-
Adapted to various open-source pretrained models, like FlagAI pretrained models (AltCLIP, EVA-CLIP), OpenCLIP pretrained models, Chinese CLIP models, Multilingual CLIP models, Taiyi Series pretrained models. Customized models can be also supported with model scripts.
mCLIPEval provides APIs to quickly download and preparation of public datasets from torchvision, huggingface, kaggle, as well as download and use those open-source pretrained models on these datasets.
mCLIPEval provides visualization of evaluation results through streamlit web app, to see the comparsions of performances on specific languages, tasks, or model parameters.
Below is the scatter plot to show the performances of some open-source models.
You can use the online demo to experience mCLIPEval.
-
Easy-to-compare capabilities of open-source models
- on specific datasets, languages, tasks, model parameters
- through interactive inferfaces to switch different configuration settings
- without computational resources needs to support inference and evaluation.
-
Easy-to-evaluate pretrained models using checkpoint files
- trained through various frameworks, like FlagAI, OpenCLIP, Transformers
- or any customized frameworks with model scripts
- on specific datasets, languages, tasks.
-
Easy-to-build a evaluation framework from scratch
- including download and preparation of datasets and models,
- evaluation with various configuration settings
- and visualization of the evaluation results.
To use mCLIPEval, you need:
- Pytorch version >= 1.8.0
- Python version >= 3.8
- For evaluating models on GPUs, you'll also need install CUDA and NCCL
[Recommended] For complete usage, you are recommended to install the required packages through:
pip install -r requirements.txt
-
[Optional] To use datasets from huggingface (
imagenet1k
in any languages,winoground
), you need to:-
- generate huggingface API TOKEN (select the role "read") from huggingface following the instructions;
-
- run the command and add the generated token as git credential:
huggingface-cli login
or modify the download/constants.py file with generated token>>> _HUGGINGFACE_AUTH_TOKEN = "hf_..."
- run the command and add the generated token as git credential:
-
- click the
Agree and access repository
button on dataset pages (imagenet-1k and winoground) to accept the license agreements of the datasets.
- click the
-
-
[Optional] To use datasets from kaggle (
fer2013
,flickr30k
,flickr30k_cn
,multi30k
), you need to:-
- generate API token from kaggle following the instructions.
-
- install
unzip
command, for Debian/Ubuntu Linux, usesudo apt-get install unzip
; for CentOS/RHEL Linux, useyum install unzip
; for macOS, usebrew install unzip
.
- install
-
[Partial] For partial usage, you need to install required packages:
-
[Necessary] for basic evaluation:
pip install torch,torchvision,sklearn
-
[Optional] for data preparation:
pip install -r requirements_dataset.txt
-
[Optional] for usage of pretrained models:
pip install -r requirements_models.txt
-
[Optional] for visualization:
pip install -r requirements_visual.txt
The complete use of mCLIPEval contains three standalone modules: data preparation, evaluation and visualization.
Module | Entry | Function | Documentation |
---|---|---|---|
Data Preparation | download.py | Download datasets and organize the data directories properly for evaluation | Data Doc |
Evaluation | evaluate.py | Evaluate a model on selected datasets and output results in a json file | Evaluation Doc |
Visualization | visual.py | Visualize the evaluation results through an interactive web app | Visualization Doc |
-
To immediately see the comparison results of built-in open-source models, we provide
outputs
as early-run evaluation results. You just need to run:streamlit run visual.py -- --json="outputs/*.json"
-
To evaluate a pretrained model with checkpoint files, you need to:
- specify the model script, for example models/altclip.py
- choose the evaluation datasets (for example
cifar10
is a classical image classification dataset) - download and prepare the datasets with:
python download.py --datasets=cifar10
- evaluate the pretrained model in the directory
[MODEL_DIR]
python evaluate.py --model_name=[MODEL_NAME] --model_dir=[MODEL_DIR] --model_script=models.altclip --datasets=cifar10 --output=[MODEL_NAME].json
- the evaluation results are saved in
[MODEL_NAME].json
file - [Tips] if the parameter
--datasets
is not specified, all supported datasets are chosen (the process of data preparation and evaluation would take a long time).
Function | Description |
---|---|
Multi-datasets Preparation | Download and prepare the datasets specified by names |
Full-datasets Preparation | Download and prepare all supported datasets |
Specified-datasets Evaluation | Evaluation on specified datasets |
Specified-languages Evaluation | Evaluation on specified languages |
Specified-tasks Evaluation | Evaluation on specified tasks |
Built-in Model Evaluation | Evaluate a built-in model specified by name |
Pretrained checkpoint Evaluation | Evaluate a pretrained model with supported pretraining framework and checkpoint directories |
Customized-model Evaluation | Evaluate a customized model with customized pretraining framework |
Visualization | Visualize the evaluation result json/jsonl files |
Dataset Names | Languages | Task | Instructions |
---|---|---|---|
imagenet1k | EN/CN/JP/IT | Image Classification | multilinual classnames and prompts, including imagenet1k, imagenet1k_cn, imagenet1k_jp, imagenet1k_it |
imagenet-a | EN/CN/JP/IT | Image Classification | multilinual classnames and prompts, including imagenet-a, imagenet-a_cn, imagenet-a_jp, imagenet-a_it |
imagenet-r | EN/CN/JP/IT | Image Classification | multilinual classnames and prompts, including imagenet-r, imagenet-r_cn, imagenet-r_jp, imagenet-r_it |
imagenet-sketch | EN/CN/JP/IT | Image Classification | multilinual classnames and prompts, including imagenet-sketch, imagenet-sketch_cn, imagenet-sketch_jp, imagenet-sketch_it |
imagenetv2 | EN/CN/JP/IT | Image Classification | multilinual classnames and prompts, including imagenetv2, imagenetv2_cn, imagenetv2_jp, imagenetv2_it |
objectnet | EN | Image Classification | objectnet |
torchvision datasets | EN | Image Classification, OCR, Geo-Localization |
including: caltech101, cars, cifar10, cifar100, country211, dtd, eurosat, fer2013, fgvc-aircraft, flowers, food101, gtsrb, mnist, objectnet, pcam, pets, renderedsst2, resisc45, stl10, sun397, voc2007, voc2007_multilabel |
winoground | EN | Image-Text Compositionality | visio-linguistic compositional reasoning, winoground |
mscoco_captions | EN/CN | Image-Text Retrieval | apart from English version, there are also 1k/5k Chinese translation version with different splits, including mscoco_captions, mscoco_captions_cn_1k, mscoco_captions_cn_5k |
xtd | EN/CN/DE/ES/FR/IT/JP/KO/PL/RU/TR | Image-Text Retrieval | multilingual translation of MSCOCO captions, with the same image test splits, including xtd_en, xtd_de, xtd_es, xtd_fr, xtd_it, xtd_jp, xtd_ko, xtd_pl, xtd_ru, xtd_tr, xtd_zh |
flickr_30k | EN/CN | Image-Text Retrieval | apart from English version, there is also Chinese translation version with different splits, including flickr30k, flickr30k_cn |
multi30k | EN/FR/DE/CS | Image-Text Retrieval | multilingual translation of Flickr30k captions, with the same image test splits, including multi30k_en, multi30k_fr, multi30k_de, multi30k_cs |
birdsnap | EN | Image Classification | download not supported, birdsnap |
kinetics | EN | Video Act Recognition | download not supported, including kinetics400, kinetics600, kinetics700 |
ucf101 | EN | Video Act Recognition | download not supported, including ucf101 |
Model Name | Text Encoder | Vision Encoder | Description |
---|---|---|---|
openai-clip-L | CLIP-L | VIT-L | openai's CLIP with Vit-L-14 as image encoder and default tranformer as text encoder [paper] [github] |
openai-clip-L-336 | CLIP-L | VIT-L | the same to openai-clip-L while the input image size is 336*336 |
openclip-L | CLIP-L | VIT-L | openclip's implementation with openai-clip-L trained with laion 2B data [github] |
openclip-L-v0 | CLIP-L | VIT-L | openclip's implementation with openai-clip-L trained with laion 400m data |
openclip-H | CLIP-H | VIT-H | openclip's pretrained model with Vit-H-14 as vision encoder |
openclip-H-XLMR-L | XLMR-L | VIT-H | openclip's pretrained model with Vit-H-14 as vision encoder and XLMR-Large as text encoder |
openclip-B-XLMR-B | XLMR-B | VIT-B | openclip's pretrained model with Vit-B-32 as vision encoder and XLMR-Base as text encoder |
cn-clip-L | RoBERTa-wwm-L | VIT-L | damo's Chinese CLIP model with Vit-L-14 as image encoder and RoBERTa-wwm-Large as text encoder [paper] [github] |
cn-clip-L-336 | RoBERTa-wwm-L | VIT-L | the same to cn-clip-L while the input image size is 336*336 |
M-CLIP | XLMR-L | VIT-L | RISE's multilingual clip model with Vit-L-14 as image encoder and XLMR-Large as text encoder [paper] [github] |
AltCLIP-XLMR-L | XLMR-L | VIT-L | BAAI's bilingual clip model with Vit-L-14 as image encoder and XLMR-Large as text encoder [paper] [github] |
AltCLIP-XLMR-L-m9 | XLMR-L | VIT-L | BAAI's multilingual clip model with Vit-L-14 as image encoder and XLMR-Large as text encoder |
eva-clip | CLIP-L | eva VIT-g | BAAI's clip model with the pretrained eva (size Vit-g-14 ) as image encoder and and default tranformer as text encoder [paper] [github] |
Taiyi-CLIP-L | RoBERTa-wwm-L | VIT-L | Noah's Chinese CLIP model with Vit-L-14 as image encoder and RoBERTa-wwm-Large as text encoder [paper] [github] |
Thanks for your interest in contributing! Apart from regular commits, we also welcome contributions to resources (datasets, models, tasks).
-
Thanks to CLIP Benchmark authors, the zeroshot classification and retrieval code, and the dataset building code are adpated from there.
-
Thanks to CLIP, Lit, SLIP, imagenet_classes_chinese, japanese_clip, clip-italian authors, as the orginal zeroshot templates and multilingual classnames.
-
Thanks to Winoground, ImageNet-Sketch, ImageNet-A, ImageNet-R authors to provide original datasets.
-
Thanks to COCO-CN and Flickr30k CN authors to provide the original datasets.
The majority of mCLIPEval is licensed under the Apache 2.0 license, however portions of the project are available under separate license terms:
- The usage of CLIP_benchmark is licensed under the MIT license
- The usage of ImageNet1k datasets in under the huggingface datasets license and ImageNet licenese