Skip to content

Files

Latest commit

Jan 18, 2023
e4e094b · Jan 18, 2023

History

History

mCLIPEval

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
Jan 18, 2023
Dec 27, 2022
Dec 22, 2022
Jan 18, 2023
Dec 19, 2022
Dec 31, 2022
Dec 29, 2022
Dec 19, 2022
Dec 31, 2022
Dec 27, 2022
Dec 19, 2022
Dec 27, 2022
Dec 19, 2022
Dec 27, 2022
Dec 30, 2022
Dec 29, 2022
Dec 22, 2022
Dec 22, 2022
Dec 29, 2022
Dec 31, 2022
Dec 21, 2022

𝕞CLIP𝔼val

An easy-to-use and easily-extendible evaluation tookit for vision-language models.


The mCLIPEval (multilingual evaluation for CLIP) tookit provides generic evaluation for pretrained vision-language models (such as CLIP, Contrastive Language–Image Pre-training). More precisely, mCLIPEval provides evaluations:

  • On multilingual (12 languages) datasets and monolingual (English/Chinese) datasets, for tasks like zeroshot image/video classification, zeroshot image-to-text and text-to image retrieval and zeroshot visio-linguistic compositionality.

  • Adapted to various open-source pretrained models, like FlagAI pretrained models (AltCLIP, EVA-CLIP), OpenCLIP pretrained models, Chinese CLIP models, Multilingual CLIP models, Taiyi Series pretrained models. Customized models can be also supported with model scripts.

mCLIPEval provides APIs to quickly download and preparation of public datasets from torchvision, huggingface, kaggle, as well as download and use those open-source pretrained models on these datasets.

mCLIPEval provides visualization of evaluation results through streamlit web app, to see the comparsions of performances on specific languages, tasks, or model parameters.

Below is the scatter plot to show the performances of some open-source models.

snapshot0.png


Online Demo

You can use the online demo to experience mCLIPEval.

with mCLIPEval, I can ...

  1. Easy-to-compare capabilities of open-source models

    • on specific datasets, languages, tasks, model parameters
    • through interactive inferfaces to switch different configuration settings
    • without computational resources needs to support inference and evaluation.
  2. Easy-to-evaluate pretrained models using checkpoint files

    • trained through various frameworks, like FlagAI, OpenCLIP, Transformers
    • or any customized frameworks with model scripts
    • on specific datasets, languages, tasks.
  3. Easy-to-build a evaluation framework from scratch

    • including download and preparation of datasets and models,
    • evaluation with various configuration settings
    • and visualization of the evaluation results.

Guide

Requiremnets Installation

To use mCLIPEval, you need:

  • Pytorch version >= 1.8.0
  • Python version >= 3.8
  • For evaluating models on GPUs, you'll also need install CUDA and NCCL

[Recommended] For complete usage, you are recommended to install the required packages through:

pip install -r requirements.txt
  • [Optional] To use datasets from huggingface (imagenet1k in any languages, winoground), you need to:

      1. generate huggingface API TOKEN (select the role "read") from huggingface following the instructions;
      1. run the command and add the generated token as git credential: huggingface-cli login or modify the download/constants.py file with generated token >>> _HUGGINGFACE_AUTH_TOKEN = "hf_..."
      1. click the Agree and access repository button on dataset pages (imagenet-1k and winoground) to accept the license agreements of the datasets.

  • [Optional] To use datasets from kaggle (fer2013, flickr30k, flickr30k_cn, multi30k), you need to:

      1. generate API token from kaggle following the instructions.
      1. install unzip command, for Debian/Ubuntu Linux, use sudo apt-get install unzip; for CentOS/RHEL Linux, use yum install unzip; for macOS, use brew install unzip.

[Partial] For partial usage, you need to install required packages:

  • [Necessary] for basic evaluation: pip install torch,torchvision,sklearn

  • [Optional] for data preparation: pip install -r requirements_dataset.txt

  • [Optional] for usage of pretrained models: pip install -r requirements_models.txt

  • [Optional] for visualization: pip install -r requirements_visual.txt

How to use?

Structure

The complete use of mCLIPEval contains three standalone modules: data preparation, evaluation and visualization.

Module Entry Function Documentation
Data Preparation download.py Download datasets and organize the data directories properly for evaluation Data Doc
Evaluation evaluate.py Evaluate a model on selected datasets and output results in a json file Evaluation Doc
Visualization visual.py Visualize the evaluation results through an interactive web app Visualization Doc

Quick tour

  • To immediately see the comparison results of built-in open-source models, we provide outputs as early-run evaluation results. You just need to run:

    streamlit run visual.py -- --json="outputs/*.json"
    
  • To evaluate a pretrained model with checkpoint files, you need to:

    • specify the model script, for example models/altclip.py
    • choose the evaluation datasets (for example cifar10 is a classical image classification dataset)
    • download and prepare the datasets with:
      python download.py --datasets=cifar10
      
    • evaluate the pretrained model in the directory [MODEL_DIR]
      python evaluate.py --model_name=[MODEL_NAME] --model_dir=[MODEL_DIR] --model_script=models.altclip --datasets=cifar10 --output=[MODEL_NAME].json
      
    • the evaluation results are saved in [MODEL_NAME].json file
    • [Tips] if the parameter --datasets is not specified, all supported datasets are chosen (the process of data preparation and evaluation would take a long time).

Advanced usage examples

Function Description
Multi-datasets Preparation Download and prepare the datasets specified by names
Full-datasets Preparation Download and prepare all supported datasets
Specified-datasets Evaluation Evaluation on specified datasets
Specified-languages Evaluation Evaluation on specified languages
Specified-tasks Evaluation Evaluation on specified tasks
Built-in Model Evaluation Evaluate a built-in model specified by name
Pretrained checkpoint Evaluation Evaluate a pretrained model with supported pretraining framework and checkpoint directories
Customized-model Evaluation Evaluate a customized model with customized pretraining framework
Visualization Visualize the evaluation result json/jsonl files

Datasets and Models

Supported Datasets

Dataset Names Languages Task Instructions
imagenet1k EN/CN/JP/IT Image Classification multilinual classnames and prompts, including imagenet1k, imagenet1k_cn, imagenet1k_jp, imagenet1k_it
imagenet-a EN/CN/JP/IT Image Classification multilinual classnames and prompts, including imagenet-a, imagenet-a_cn, imagenet-a_jp, imagenet-a_it
imagenet-r EN/CN/JP/IT Image Classification multilinual classnames and prompts, including imagenet-r, imagenet-r_cn, imagenet-r_jp, imagenet-r_it
imagenet-sketch EN/CN/JP/IT Image Classification multilinual classnames and prompts, including imagenet-sketch, imagenet-sketch_cn, imagenet-sketch_jp, imagenet-sketch_it
imagenetv2 EN/CN/JP/IT Image Classification multilinual classnames and prompts, including imagenetv2, imagenetv2_cn, imagenetv2_jp, imagenetv2_it
objectnet EN Image Classification objectnet
torchvision datasets EN Image Classification,
OCR,
Geo-Localization
including: caltech101, cars, cifar10, cifar100, country211, dtd, eurosat, fer2013, fgvc-aircraft, flowers, food101, gtsrb, mnist, objectnet, pcam, pets, renderedsst2, resisc45, stl10, sun397, voc2007, voc2007_multilabel
winoground EN Image-Text Compositionality visio-linguistic compositional reasoning, winoground
mscoco_captions EN/CN Image-Text Retrieval apart from English version, there are also 1k/5k Chinese translation version with different splits, including mscoco_captions, mscoco_captions_cn_1k, mscoco_captions_cn_5k
xtd EN/CN/DE/ES/FR/IT/JP/KO/PL/RU/TR Image-Text Retrieval multilingual translation of MSCOCO captions, with the same image test splits, including xtd_en, xtd_de, xtd_es, xtd_fr, xtd_it, xtd_jp, xtd_ko, xtd_pl, xtd_ru, xtd_tr, xtd_zh
flickr_30k EN/CN Image-Text Retrieval apart from English version, there is also Chinese translation version with different splits, including flickr30k, flickr30k_cn
multi30k EN/FR/DE/CS Image-Text Retrieval multilingual translation of Flickr30k captions, with the same image test splits, including multi30k_en, multi30k_fr, multi30k_de, multi30k_cs
birdsnap EN Image Classification download not supported, birdsnap
kinetics EN Video Act Recognition download not supported, including kinetics400, kinetics600, kinetics700
ucf101 EN Video Act Recognition download not supported, including ucf101

Built-in Models

Model Name Text Encoder Vision Encoder Description
openai-clip-L CLIP-L VIT-L openai's CLIP with Vit-L-14 as image encoder and default tranformer as text encoder [paper] [github]
openai-clip-L-336 CLIP-L VIT-L the same to openai-clip-L while the input image size is 336*336
openclip-L CLIP-L VIT-L openclip's implementation with openai-clip-L trained with laion 2B data [github]
openclip-L-v0 CLIP-L VIT-L openclip's implementation with openai-clip-L trained with laion 400m data
openclip-H CLIP-H VIT-H openclip's pretrained model with Vit-H-14 as vision encoder
openclip-H-XLMR-L XLMR-L VIT-H openclip's pretrained model with Vit-H-14 as vision encoder and XLMR-Large as text encoder
openclip-B-XLMR-B XLMR-B VIT-B openclip's pretrained model with Vit-B-32 as vision encoder and XLMR-Base as text encoder
cn-clip-L RoBERTa-wwm-L VIT-L damo's Chinese CLIP model with Vit-L-14 as image encoder and RoBERTa-wwm-Large as text encoder [paper] [github]
cn-clip-L-336 RoBERTa-wwm-L VIT-L the same to cn-clip-L while the input image size is 336*336
M-CLIP XLMR-L VIT-L RISE's multilingual clip model with Vit-L-14 as image encoder and XLMR-Large as text encoder [paper] [github]
AltCLIP-XLMR-L XLMR-L VIT-L BAAI's bilingual clip model with Vit-L-14 as image encoder and XLMR-Large as text encoder [paper] [github]
AltCLIP-XLMR-L-m9 XLMR-L VIT-L BAAI's multilingual clip model with Vit-L-14 as image encoder and XLMR-Large as text encoder
eva-clip CLIP-L eva VIT-g BAAI's clip model with the pretrained eva(size Vit-g-14) as image encoder and and default tranformer as text encoder [paper] [github]
Taiyi-CLIP-L RoBERTa-wwm-L VIT-L Noah's Chinese CLIP model with Vit-L-14 as image encoder and RoBERTa-wwm-Large as text encoder [paper] [github]

Contributing

Thanks for your interest in contributing! Apart from regular commits, we also welcome contributions to resources (datasets, models, tasks).

Credits

License

The majority of mCLIPEval is licensed under the Apache 2.0 license, however portions of the project are available under separate license terms: