Semantic Segment Anything
Jiaqi Chen, Zeyu Yang, and Li Zhang
Zhang Vision Group, Fudan Univerisity
SAM is a powerful model for arbitrary object segmentation, while SA-1B is the largest segmentation dataset to date. However, SAM lacks the ability to predict semantic categories for each mask. (I) To address above limitation, we propose a pipeline on top of SAM to predict semantic category for each mask, called Semantic Segment Anything (SSA). (II) Moreover, our SSA can serve as an automated dense open-vocabulary annotation engine called Semantic segment anything labeling engine (SSA-engine), providing rich semantic category annotations for SA-1B or any other dataset. This engine significantly reduces the need for manual annotation and associated costs.
- SAM is a highly generalizable object segmentation algorithm that can provide precise masks. SA-1B is the largest image segmentation dataset to date, providing fine mask segmentation annotations. Neither SAM nor SA-1B provide category predictions or annotations for each mask. This makes it difficult for researchers to use the powerful SAM algorithm to directly solve semantic segmentation tasks or to utilize SA-1B to train their own models.
- Advanced close-set segmenters like Segformer, Oneformer, open-set segmenters like CLIPSeg, and image caption methods like BLIP can provide rich semantic annotations. However, their mask segmentation predictions may not be as comprehensive and accurate as those generated by SAM, which has highly precise and detailed boundaries.
- Therefore, by combining the fine image segmentation masks from SAM and SA-1B with the rich semantic annotations provided by these advanced models, we can generate semantic segmentation models with stronger generalization ability, as well as a large-scale densely categorized image segmentation dataset.
- SSA: This is the first open framework that utilizes SAM for semantic segmentation task. It supports users to seamlessly integrate their existing semantic segmenters with SAM without the need for retraining or fine-tuning SAM's weights, enabling them to achieve better generalization and more precise mask boundaries.
- SSA-engine: SSA-engine provides dense open-vocabulary category annotations for large-scale SA-1B dataset. After manual review and refinement, these annotations can be used to train segmentation models or fine-grained CLIP models.
Before the introduction of SAM, most semantic segmentation application scenarios already had their own models. These models could provide rough category classifications for regions, but were blurry and imprecise at the edges, lacking accurate masks. To address this issue, we propose an open framework called SSA that leverages SAM to enhance the performance of existing models. Specifically, the original semantic segmentation models provide category predictions while the powerful SAM provides masks.
If you have already trained a semantic segmentation model on your dataset, you don't need to retrain a new SAM-based model for more accurate segmentation. Instead, you can continue to use the existing model as the Semantic branch. SAM's strong generalization and image segmentation abilities can improve the performance of the original model. It is worth noting that SSA is suitable for scenarios where the predicted mask boundaries by the original segmentor are not highly accurate. If the original model's segmentation is already very accurate, SSA may not provide a significant improvement.
SSA consists of two branches, Mask branch and Semantic branch, as well as a voting module that determines the category for each mask.
-
(I) Mask branch (blue). SAM serves as the Mask branch and provides a set of masks with clear boundaries.
-
(II) Semantic branch (purple). This branch provides the category for each pixel, which is implemented by a semantic segmentor that users can customize in terms of the segmentor's architecture and the interested categories. The segmentor does not need to have highly detailed boundaries, but it should classify each region as accurately as possible.
-
(III) Semantic Voting module (red). This module crops out the corresponding pixel categories based on the mask's position. The top-1 category among these pixel categories is considered as the classification result for that mask.
SSA-engine is an automated annotation engine that serves as the initial semantic labeling for the SA-1B dataset. While human review and refinement may be required for more accurate labeling. Thanks to the combined architecture of close-set segmentation and open-vocabulary segmentation, SSA-engine produces satisfactory labeling for most samples and has the capability to provide more detailed annotations using image caption method.
This tool fills the gap in SA-1B's limited fine-grained semantic labeling, while also significantly reducing the need for manual annotation and associated costs. It has the potential to serve as a foundation for training large-scale visual perception models and more fine-grained CLIP models.
The SSA-engine consists of three components:
- (I) Close-set semantic segmentor (green). Two close-set semantic segmentation models trained on COCO and ADE20K datasets respectively are used to segment the image and obtain rough category information. The predicted categories only include simple and basic categories to ensure that each mask receives a relevant label.
- (II) Open-vocabulary classifier (blue). An image captioning model is utilized to describe the cropped image patch corresponding to each mask. Nouns or phrases are then extracted as candidate open-vocabulary categories. This process provides more diverse category labels.
- (III) Final decision module (orange). The SSA-engine uses a Class proposal filter (i.e. a CLIP) to filter out the top-k most reasonable predictions from the mixed class list. Finally, the Open-vocabulary Segmentor predicts the most suitable category within the mask region based on the top-k classes and image patch.
π₯ 2023/04/14: SSA benchmarks semantic segmentation on ADE20K and Cityscapes.
π₯ 2023/04/10: Semantic Segment Anything (SSA and SSA-engine) is released.
π₯ 2023/04/05: SAM and SA-1B are released.
All results were tested on a single NVIDIA A6000 GPU.
Dataset | model | Inference time per image (s) | Inference time per mask (s) |
---|---|---|---|
SA-1B | SSA (Close set) | 1.149 | 0.012 |
SA-1B | SSA-engine (Open-vocabulary) | 33.333 | 0.334 |
Dataset | model | GPU Memory (MB) |
---|---|---|
ADE20K | SSA | 8798 |
Cityscapes | SSA | 19012 |
Dataset | model | GPU Memory without SAM (MB) | GPU Memory with SAM (MB) |
---|---|---|---|
SA-1B | SSA-engine-small | 11914 | 28024 |
SA-1B | SSA-engine-base | 14466 | 30576 |
For the sake of convenience, we utilized different versions of Segformer from Hugging Face, which come with varying parameter sizes and accuracy levels (including B0, B2, and B5), to simulate semantic branches with less accurate masks. The results show that when the accuracy of original Semantic branch is NOT very high, SSA can lead to an improvement in mIoU.
Model | Semantic branch | mIoU of Semantic branch | mIoU of SSA |
---|---|---|---|
SSA | Segformer-B0 | 31.78 | 33.60 |
SSA | Segformer-B2 | 41.38 | 42.92 |
SSA | Segformer-B5 | 45.92 | 47.14 |
Model | Semantic branch | mIoU of Semantic branch | mIoU of SSA |
---|---|---|---|
SSA | Segformer-B0 | 52.52 | 55.14 |
SSA | Segformer-B2 | 59.76 | 62.25 |
SSA | Segformer-B5 | 71.67 | 72.99 |
Note that all Segformer checkpoint and data pipeline are sourced from Hugging Face released by NVIDIA, which shows lower mIoU compared to those on official repository.
We also evaluate the performance of SSA on the Foggy Driving dataset, with OneFormer as Semantic branch. The weight and data pipeline of OneFormer is sourced from Hugging Face.
Model | Training dataset | validation dataset | mIoU |
---|---|---|---|
SSA | Cityscapes | Foggy Driving | 55.61 |
- Addition example for Open-vocabulary annotations
- Python 3.7+
- CUDA 11.1+
git clone [email protected]:fudan-zvg/Semantic-Segment-Anything.git
cd Semantic-Segment-Anything
conda env create -f environment.yaml
conda activate ssa
python -m spacy download en_core_web_sm
# install segment-anything
cd ..
git clone [email protected]:facebookresearch/segment-anything.git
cd segment-anything; pip install -e .; cd ../Semantic-Segment-Anything
Dowload the ADE20K or Cityscapes dataset, and unzip them to the data
folder.
Folder sturcture:
βββ Semantic-Segment-Anything
βββ data
β βββ ade
β β βββ ADEChallengeData2016
β β β βββ images
β β β β βββ training
β β β β βββ validation
β β β β β βββ ADE_val_00002000.jpg
β β β β β βββ ...
β β β β βββ test
β β β βββ annotations
β β β β βββ training
β β β β βββ validation
β β β β β βββ ADE_val_00002000.png
β β β β β βββ ...
β βββ cityscapes
β β βββ leftImg8bit
β β β βββ train
β β β βββ val
β β β β βββ frankfurt
β β β β βββ lindau
β β β β βββ munster
β β β β β βββ munster_000173_000019_leftImg8bit.png
β β βββ gtFine
β β β βββ train
β β β βββ val
β β β β βββ frankfurt
β β β β βββ lindau
β β β β βββ munster
β β β β β βββ munster_000173_000019_gtFine_labelTrainIds.png
β β βββ ...
Dowload the checkpoint of SAM and put it to the ckp
folder.
mkdir ckp && cd ckp
wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth
cd ..
Run our SSA on ADE20K with 8 GPUs:
python scripts/main_ssa.py --ckpt_path ./ckp/sam_vit_h_4b8939.pth --save_img --world_size 8 --dataset ade20k --data_dir data/ade20k/ADEChallengeData2016/images/validation/ --gt_path data/ade20k/ADEChallengeData2016/annotations/validation/ --out_dir output_ade20k
Run our SSA on Cityscapes with 8 GPUs:
python scripts/main_ssa.py --ckpt_path ./ckp/sam_vit_h_4b8939.pth --save_img --world_size 8 --dataset cityscapes --data_dir data/cityscapes/leftImg8bit/val/ --gt_path data/cityscapes/gtFine/val/ --out_dir output_cityscapes
Run our SSA on Foggy Driving with 8 GPUs:
python scripts/main_ssa.py --data_dir data/Foggy_Driving/leftImg8bit/test/ --ckpt_path ckp/sam_vit_h_4b8939.pth --out_dir output_foggy_driving --save_img --world_size 8 --dataset foggy_driving --eval --gt_path data/Foggy_Driving/gtFine/test/ --model oneformer
Get the evaluate result of ADE20K:
python scripts/evaluation.py --gt_path data/ade20k/ADEChallengeData2016/annotations/validation --result_path output_ade20k/ --dataset ade20k
Get the evaluate result of Cityscapes:
python scripts/evaluation.py --gt_path data/cityscapes/gtFine/val/ --result_path output_cityscapes/ --dataset cityscapes
Get the evaluate result of Foggy Driving:
# if you haven't downloaded the Foggy Driving dataset, you can run the following command to download it.
wget -P data https://data.vision.ee.ethz.ch/csakarid/shared/SFSU_synthetic/Downloads/Foggy_Driving.zip & unizp data/Foggy_Driving.zip -d data/
python scripts/evaluation.py --gt_path data/Foggy_Driving/gtFine/test/ --result_path output_foggy_driving/ --dataset foggy_driving
Organize your dataset as follows:
βββ Semantic-Segment-Anything
βββ data
β βββ <The name of your dataset>
β β βββ img_name_1.jpg
β β βββ img_name_2.jpg
β β βββ ...
Run our SSA-engine-base with 8 GPUs (The GPU memory needed is dependent on the size of the input images):
python scripts/main_ssa_engine.py --data_dir=data/<The name of your dataset> --out_dir=output --world_size=8 --save_img --sam --ckpt_path=ckp/sam_vit_h_4b8939.pth
If you want to run the SSA-engine-small, you can use the following command (add the --light_mode
flag):
python scripts/main_ssa_engine.py --data_dir=data/<The name of your dataset> --out_dir=output --world_size=8 --save_img --sam --ckpt_path=ckp/sam_vit_h_4b8939.pth --light_mode
Download the SA-1B dataset and unzip it to the data/sa_1b
folder.
Or you use your own dataset.
Folder sturcture:
βββ Semantic-Segment-Anything
βββ data
β βββ sa_1b
β β βββ sa_223775.jpg
β β βββ sa_223775.json
β β βββ ...
Run our SSA-engine-base with 8 GPUs:
python scripts/main_ssa_engine.py --data_dir=data/sa_1b --out_dir=output --world_size=8 --save_img
Run the SSA-engine-small with 8 GPUs (add the --light_mode
flag):
python scripts/main_ssa_engine.py --data_dir=data/sa_1b --out_dir=output --world_size=8 --save_img --light_mode
For each mask, we add two new fields (e.g. 'class_name': 'face' and 'class_proposals': ['face', 'person', 'sun glasses']). The class name is the most likely category for the mask, and the class proposals are the top-k most likely categories from Class proposal filter. k is set to 3 by default.
{
'bbox': [81, 21, 434, 666],
'area': 128047,
'segmentation': {
'size': [1500, 2250],
'counts': 'kYg38l[18oeN8mY14aeN5\\Z1>'
},
'predicted_iou': 0.9704002737998962,
'point_coords': [[474.71875, 597.3125]],
'crop_box': [0, 0, 1381, 1006],
'id': 1229599471,
'stability_score': 0.9598413705825806,
'class_name': 'face',
'class_proposals': ['face', 'person', 'sun glasses']
}
We hope that excellent researchers in the community can come up with new improvements and ideas to do more work based on SSA. Some of our ideas are as follows:
- (I) The masks in SA-1B are often in three levels: whole, part, and subpart, and SSA-engine often cannot provide accurate descriptions for too small part or subpart regions. Instead, we use broad categories. For example, SSA-engine may predict "person" for body parts like neck or hand. Therefore, an architecture for more detailed semantic prediction is needed.
- (II) SSA and SSA-engine is an ensemble of multiple models, which makes the inference speed slower compared to end-to-end models. We look forward to more efficient designs in the future.
- (III) For semantic segmentation models with poor boundary segmentation, SSA can utilize SAM and the semantic voting mechanism to provide more accurate masks. However, for models that already have excellent segmentation performance, SSA cannot bring about a significant improvement. On the other hand, if the original segmentation model is too poor and misses many semantic categories, SSA cannot help it recall those categories either. Exploring better ways to utilize SAM is worth further investigation.
- Segment Anything provides the SA-1B dataset.
- HuggingFace provides code and pre-trained models.
- CLIPSeg, Segformer, OneFormer, BLIP and CLIP provide powerful semantic segmentation, image caption and classification models.
If you find this work useful for your research, please cite our github repo:
@misc{chen2023semantic,
title = {Semantic Segment Anything},
author = {Chen, Jiaqi and Yang, Zeyu and Zhang, Li},
howpublished = {\url{https://github.com/fudan-zvg/Semantic-Segment-Anything}},
year = {2023}
}