Official PyTorch implementation of "EMOv2: Pushing 5M Vision Model Frontier", which is the extension version of "Rethinking Mobile Block for Efficient Attention-based Models, ICCV'23".
Abstract: This paper focuses on developing parameter-efficient and lightweight models for dense predictions while trading off parameters, FLOPs, and performance, exploring potential of the 5M magnitude lightweight model on various downstream tasks. Inverted Residual Block (IRB) serves as the infrastructure for lightweight CNNs, but no counterpart has been recognized by attention-based design. This work rethinks lightweight infrastructure of efficient IRB and effective components in Transformer from a unified perspective, extending CNN-based IRB to attention-based models and abstracting a one-residual Meta Mobile Block (MMBlock) for lightweight model design. Following neat but effective design criterion, we deduce a modern Improved Inverted **R}**esidual Mobile Block (i2RMB) and improve a hierarchical Efficient MOdel (EMOv2) with no elaborate complex structures. Considering the imperceptible latency for mobile users when downloading models under 4G/5G bandwidth and ensuring model performance, this paper investigates the performance upper limit of lightweight models with a magnitude of 5M. Extensive experiments on various vision recognition, dense prediction, and image generation tasks demonstrate the superiority of our EMOv2 over state-of-the-art methods, e.g., EMOv2-1M/2M/5M achieve 72.3, 75.8, and 79.4 Top-1 that surpass equal-order CNN-/Attention-based models significantly, while achieving 41.5 mAP with RetinaNet under high-resolution detection task that surpasses the previous EMO-5M by +2.6↑.
Image Classification on ImageNet-1K, †: Using knowledge distillation, *: Using stronger training strategy:
Model | #Params | FLOPs | Resolution | Top-1 | Log |
---|---|---|---|---|---|
EMOv2-1M | 1.4M | 285M | 224 x 224 | 72.3 | log |
EMOv2-1M† | 1.4M | 285M | 224 x 224 | 73.5 | log |
EMOv2-2M | 2.3M | 487M | 224 x 224 | 75.8 | log |
EMOv2-2M† | 2.3M | 487M | 224 x 224 | 76.7 | log |
EMOv2-5M | 5.1M | 1035M | 224 x 224 | 79.4 | log |
EMOv2-5M† | 5.1M | 1035M | 224 x 224 | 80.9 | log |
EMOv2-5M* | 5.1M | 5627M | 512 x 512 | 82.9 | log |
Object Detection Performance on COCO2017:
Backbone | #Params | #Reso | FLOPs | mAP | mAP50 | mAP75 | mAPS | mAPM | mAPL | Log |
---|---|---|---|---|---|---|---|---|---|---|
EMOv2-1M | 2.4 | 300×300 | 0.7G | 22.3 | 37.5 | 22.4 | 2.0 | 21.3 | 43.4 | log |
EMOv2-1M | 2.4 | 512×512 | 2.3G | 26.6 | 44.4 | 27.5 | 7.3 | 31.4 | 43.0 | log |
EMOv2-2M | 3.3 | 300×300 | 1.2G | 26.0 | 43.0 | 26.5 | 3.6 | 26.6 | 50.2 | log |
EMOv2-2M | 3.3 | 512×512 | 4.0G | 30.7 | 49.8 | 31.7 | 9.9 | 37.1 | 47.3 | log |
EMOv2-5M | 6.0 | 300×300 | 2.4G | 29.6 | 47.6 | 30.1 | 5.5 | 32.2 | 54.8 | log |
EMOv2-5M | 6.0 | 512×512 | 8.0G | 34.8 | 54.7 | 36.4 | 13.7 | 42.0 | 52.0 | log |
EMOv2-20M | 21.2 | 300×300 | 9.1G | 33.1 | 51.9 | 33.9 | 8.9 | 36.8 | 57.3 | log |
EMOv2-20M | 21.2 | 512×512 | 30.3G | 38.3 | 58.4 | 40.7 | 17.9 | 45.2 | 54.6 | log |
Backbone | #Params | FLOPs | mAP | mAP50 | mAP75 | mAPS | mAPM | mAPL | Log |
---|---|---|---|---|---|---|---|---|---|
EMOv2-1M | 10.5 | 142G | 36.9 | 57.1 | 39.0 | 22.1 | 39.8 | 49.5 | log |
EMOv2-2M | 11.5 | 146G | 39.3 | 60.0 | 41.4 | 23.9 | 43.1 | 51.6 | log |
EMOv2-5M | 14.4 | 158G | 41.5 | 62.7 | 44.1 | 25.7 | 45.5 | 55.5 | log |
EMOv2-20M | 29.8 | 220G | 43.8 | 65.0 | 47.1 | 28.0 | 47.4 | 59.0 | log |
Backbone | #Params | FLOPs | mAP | mAP50 | mAP75 | mAPS | mAPM | mAPL | Log |
---|---|---|---|---|---|---|---|---|---|
EMOv2-1M | 21.2 | 165G | 37.1 | 59.2 | 39.6 | 21.8 | 39.9 | 49.5 | log |
EMOv2-2M | 22.1 | 170G | 39.5 | 61.8 | 42.4 | 22.9 | 43.0 | 52.6 | log |
EMOv2-5M | 24.8 | 181G | 42.3 | 64.3 | 46.3 | 25.8 | 45.6 | 56.3 | log |
EMOv2-20M | 39.8 | 244G | 44.2 | 66.2 | 48.7 | 27.4 | 47.6 | 58.7 | log |
Semantic Segmentation Performance on ADE20k:
Backbone | #Params | FLOPs | mIoU | aAcc | mAcc | Log |
---|---|---|---|---|---|---|
EMOv2-1M | 5.6 | 3.3G | 34.6 | 75.9 | 45.5 | log |
EMOv2-2M | 6.6 | 5.0G | 36.8 | 77.1 | 48.6 | log |
EMOv2-5M | 9.9 | 9.1G | 39.8 | 78.3 | 51.5 | log |
EMOv2-20M | 26.0 | 31.6G | 43.3 | 79.6 | 56.0 | log |
Backbone | #Params | FLOPs | mIoU | aAcc | mAcc | Log |
---|---|---|---|---|---|---|
EMOv2-1M | 5.3 | 23.4G | 37.1 | 78.2 | 47.6 | log |
EMOv2-2M | 6.2 | 25.1G | 39.9 | 79.3 | 51.1 | log |
EMOv2-5M | 8.9 | 29.1G | 42.4 | 80.8 | 53.4 | log |
EMOv2-20M | 23.9 | 51.5G | 46.8 | 82.2 | 58.3 | log |
Backbone | #Params | FLOPs | mIoU | aAcc | mAcc | Log |
---|---|---|---|---|---|---|
EMOv2-1M | 1.4 | 5.0G | 37.0 | 77.7 | 47.5 | log |
EMOv2-2M | 2.6 | 10.3G | 40.2 | 79.0 | 51.1 | log |
EMOv2-5M | 5.3 | 14.4G | 43.0 | 80.5 | 53.9 | log |
EMOv2-20M | 20.4 | 36.8G | 47.3 | 82.1 | 58.7 | log |
Backbone | #Params | FLOPs | mIoU | aAcc | mAcc | Log |
---|---|---|---|---|---|---|
EMOv2-1M | 4.2 | 2.9G | 33.6 | 75.8 | 44.8 | log |
EMOv2-2M | 5.2 | 4.6G | 35.7 | 76.7 | 47.0 | log |
EMOv2-5M | 8.1 | 8.6G | 39.1 | 78.2 | 51.0 | log |
EMOv2-20M | 23.6 | 30.9G | 43.4 | 79.6 | 55.7 | log |
pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu118
pip install inplace_abn timm==0.9.16 mmselfsup pandas transformers openpyxl numpy-hilbert-curve pyzorder imgaug numba protobuf==3.20.1 scikit-image faiss-gpu
pip install timm==0.6.5 tensorboardX einops torchprofile fvcore
pip install mmcv==2.1.0 -f https://download.openmmlab.com/mmcv/dist/cu118/torch2.1/index.html
(Opt.) git clone https://github.com/NVIDIA/apex && cd apex && pip3 install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
(Downstream) pip install terminaltables pycocotools prettytable xtcocotools
(Downstream) pip install mmdet==3.3.0
(Downstream) pip install mmsegmentation==1.2.2
(Downstream) pip install mmaction2==1.2.0
Download and extract ImageNet-1K dataset in the following directory structure:
├── imagenet
├── train
├── n01440764
├── n01440764_10026.JPEG
├── ...
├── ...
├── train.txt (optional)
├── val
├── n01440764
├── ILSVRC2012_val_00000293.JPEG
├── ...
├── ...
└── val.txt (optional)
- Download pre-trained weights to
resources/Cls/
. - Test with 8 GPUs in one node:
EMOv2-1M
```python3 -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --use_env run.py -c configs/emo2/emo2_224.py -m test model.name=EMO2_1M_k5_hybrid model.model_kwargs.checkpoint_path=resources/Cls/EMOv2_1M_224.pth```==> `Top-1: 72.326`
EMOv2-1M†
```python3 -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --use_env run.py -c configs/emo2/emo2_224.py -m test model.name=EMO2_1M_k5_hybrid model.model_kwargs.checkpoint_path=resources/Cls/EMOv2_1M_224_KD.pth```==> `Top-1: 72.326`
EMOv2-2M
```python3 -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --use_env run.py -c configs/emo2/emo2_224.py -m test model.name=EMO2_2M_k5_hybrid model.model_kwargs.checkpoint_path=resources/Cls/EMOv2_2M_224.pth```==> `Top-1: 72.326`
EMOv2-2M†
```python3 -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --use_env run.py -c configs/emo2/emo2_224.py -m test model.name=EMO2_2M_k5_hybrid model.model_kwargs.checkpoint_path=resources/Cls/EMOv2_2M_224_KD.pth```==> `Top-1: 72.326`
EMOv2-5M
```python3 -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --use_env run.py -c configs/emo2/emo2_224.py -m test model.name=EMO2_5M_k5_hybrid model.model_kwargs.checkpoint_path=resources/Cls/EMOv2_5M_224.pth```==> `Top-1: 72.326`
EMOv2-5M†
```python3 -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --use_env run.py -c configs/emo2/emo2_224.py -m test model.name=EMO2_5M_k5_hybrid model.model_kwargs.checkpoint_path=resources/Cls/EMOv2_5M_224_KD.pth```==> `Top-1: 72.326`
EMOv2-5M*
```python3 -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --use_env run.py -c configs/emo2/emo2_512.py -m test model.name=EMO2_5M_k5_hybrid model.model_kwargs.checkpoint_path=resources/Cls/EMOv2_5M_512_KD.pth```==> `Top-1: 72.326`
- Train with 8 GPUs in one node:
python3 -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --use_env run.py -c configs/emo2/emo2_224.py -m train model.name=EMO2_5M_k5_hybrid trainer.checkpoint=runs/emo2
- Train with 8 GPUs in one node with KD:
python3 -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --use_env run.py -c configs/emo2/emo2_224_kd.py -m train model.name=EMO2_5M_k5_hybrid trainer.checkpoint=runs/emo2
- Refer to MMDetection for the environments.
- Configs can be found in
downstreams/det/configs
- E.g.:
run.CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 PORT=29502 ./tools/dist_train.sh configs/ssd/ssdlite_emo2_5M_8gpu_2lr_coco.py 8
for SSDLite with EMO-5M
.
- Refer to MMSegmentation for the environments.
- Configs can be found in
downstreams/seg/configs
- E.g.:
runCUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29502 ./tools/dist_train.sh configs/deeplabv3/deeplabv3_emo2_5M-80k_ade20k-512x512.py 4
for DeepLabv3 with EMO-5M
If our work is helpful for your research, please consider citing:
@inproceedings{emo2,
title={Rethinking mobile block for efficient neural models},
author={Zhang, Jiangning and Li, Xiangtai and Li, Jian and Liu, Liang and Xue, Zhucun and Zhang, Boshen and Jiang, Zhengkai and Huang, Tianxin and Wang, Yabiao and Wang, Chengjie},
booktitle={ICCV},
pages={1--8},
year={2023}
}
We thank but not limited to following repositories for providing assistance for our research: