Transductive zero-shot and few-shot CLIP

This GitHub repository features code from our CVPR 2024 paper where we tackle zero-shot and few-shot classification using the vision-language model CLIP. Our approach handles groups of unlabeled images together, enhancing accuracy over traditional methods that consider each image separately. We build a new classification framework based on classification of probability features and an optimization technique that mimics the Expectation-Maximization algorithm. On zero-shot tasks with test batches of 75 samples, our approaches EM-Dirichlet and Hard EM-Dirichlet yield near 20% improvement in ImageNet accuracy over CLIP's zero-shot performance.

1. Getting started

1.1. Requirements

torch 1.13.1 (or later)
torchvision
tqdm
numpy
pillow
pyyaml
scipy
clip

1.2 Download datasets and splits

For downloading the datasets and splits, we follow the instructions given in the Github repository of TIP-Adapter. We use train/val/test splits from CoOp's Github for all datasets except ImageNet where the validation set is used as test set. The splits for imagenet can be downloaded at https://drive.google.com/drive/fol .

The downloaded datasets should be placed in the folder data/ the following way:

.
├── ...
├── data           
│   ├── food101       
│   ├── eurosat       
│   ├── dtd       
│   ├── oxfordpets       
│   ├── flowers101     
│   ├── caltech101      
│   ├── ucf101       
│   ├── fgvcaircraft                
│   ├── stanfordcars      
│   ├── sun397        
│   └── imagenet               
└── ...

1.3 Extracting and saving the features

For a fixed temperature (

Unable to render expression.

$T=30$

recommended), we extract and save the softmax features defined as

Unable to render expression.

$$z_n = \text{softmax}(T \cos(f_{\text{im}}(x_n), f_{\text{text}}(t_k) )).$$

To do so, run bash scripts/extract_softmax_features.sh For instance, for the dataset eurosat, the temperature T=30 and the backbone RN50, the features will be saved under

eurosat
├── saved_features                    
│   ├── test_softmax_RN50_T30.plk
│   ├── val_softmax_RN50_T30.plk           
│   ├── train_softmax_RN50_T30.plk           
└── ...

Alternatively, to reproduce the comparisons in the paper, you can also compute directly the visual embeddings running bash scripts/extract_visual_features.sh.

The process of extracting features might be time-consuming, but once completed, the methods operates quite efficiently.

2. Reproducing the zero-shot results

You can reproduce the results displayed in Table 1 in the paper by using the config/main_config.yaml file. Small variations in the results may be observed due to the randomization of the tasks.

The zero-shot methods are EM-Dirichlet (em_dirichlet), Hard EM-Dirichlet (hard_em_dirichlet), Hard K-means (hard_kmeans), Soft K-means (soft_kmeans), EM-Gaussian (Id cov) (em_gaussian), EM-Gaussian (diagonal con) (em_dirichlet_cov), KL K-means (kl_kmeans).

The methods can be tested on the softmax features by setting use_softmax_features=True or the visual features use_softmax_features=False.

For example, to run the method EM-Dirichlet on Caltech101 on 1000 realistic tranductive zero-shot tasks:

python main.py --opts shots 0 dataset caltech101 batch_size 100 number_tasks 1000 use_softmax_feature True

3. Reproducing the few-shot results

You can reproduce the results displayed in Table 2 in the paper by using the config/main_config.yaml file. Small variations in the results may be observed due to the randomization of the tasks.

The zero-shot methods are EM-Dirichlet (em_dirichlet), Hard EM-Dirichlet (hard_em_dirichlet),

Unable to render expression.

$\alpha$

-TIM (alpha_tim), PADDLE (paddle), Laplacian Shot (laplacian_shot), BDSCPN (bdcpsn).

Methods (

Unable to render expression.

$\alpha$

-TIM, PADDLE, Laplacian Shot, BDCPSN) having a hyper-parameter have be previously tuned on the validation set (results stored in results_few_shot/val/).

For example, to run the method EM-Dirichlet on Caltech101 on 1000 realistic tranductive 4-shot tasks:

python main.py --opts shots 4 dataset caltech101 batch_size 100 number_tasks 1000 use_softmax_feature True

Aknowlegments

This repository was inspired by the publicly available code from the paper Realistic evaluation of transductive few-shot learning and TIP-Adapter.

Name	Name	Last commit message	Last commit date
Latest commit SegoleneMartin Mar 6, 2025 9066ad3 · Mar 6, 2025 History 128 Commits
config	config	camera-ready	Apr 5, 2024
figures	figures	table 1	Mar 28, 2024
results_few_shot/val	results_few_shot/val	fixed torch version issue	Apr 5, 2024
scripts	scripts	updated script	Apr 5, 2024
src	src	fixed torch version issue	Apr 5, 2024
.gitignore	.gitignore	updated README	Apr 5, 2024
README.md	README.md	Update README.md	Mar 6, 2025
main.py	main.py	cleaning src /dataset done	Apr 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Transductive zero-shot and few-shot CLIP

1. Getting started

1.1. Requirements

1.2 Download datasets and splits

1.3 Extracting and saving the features

2. Reproducing the zero-shot results

3. Reproducing the few-shot results

Aknowlegments

About

Releases

Packages

Languages

SegoleneMartin/transductive-CLIP

Folders and files

Latest commit

History

Repository files navigation

Transductive zero-shot and few-shot CLIP

1. Getting started

1.1. Requirements

1.2 Download datasets and splits

1.3 Extracting and saving the features

2. Reproducing the zero-shot results

3. Reproducing the few-shot results

Aknowlegments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages