Abstract
The problem we study in this paper is amodal image segmentation: predicting entire object segmentation masks including both visible and invisible (occluded) parts. In previous work, the amodal segmentation ground truth on real images is usually predicted by manual annotaton and thus is subjective. In contrast, we use 3D data to establish an automatic pipeline to determine authentic ground truth amodal masks for partially occluded objects in real images. This pipeline is used to construct an amodal completion evaluation benchmark, MP3D-Amodal, consisting of a variety of object categories and labels. To better handle the amodal completion task in the wild, we explore two architecture variants: a two-stage model that first infers the occluder, followed by amodal mask completion; and a one-stage model that exploits the representation power of Stable Diffusion for amodal segmentation across many categories. Without bells and whistles, our method achieves a new state-of-the-art performance on Amodal segmentation datasets that cover a large variety of objects, including COCOA and our new MP3D-Amodal dataset.
Publications
|
Amodal Ground Truth and Completion in the Wild
Guanqi Zhan,
Chuanxia Zheng,
Weidi Xie,
Andrew Zisserman
CVPR 2024
@InProceedings{Zhan24,
author = "Guanqi Zhan and Chuanxia Zheng and Weidi Xie and Andrew Zisserman",
title = "Amodal Ground Truth and Completion in the Wild",
booktitle = "CVPR",
year = "2024",
}
|
MP3D-Amodal Dataset
Comparison of Different Amodal Datasets. Our MP3D-Amodal dataset is the first amodal dataset to provide authentic amodal ground truth for the occluded objects of a large variety of categories in real scenes.
|
Examples of Our MP3D-Amodal Dataset.
|
Statistics of the MP3D-Amodal dataset.
|
Generation Process of the MP3D-Amodal dataset.
|
Dataset Download
- Evaluation Dataset:
- Training Dataset:
- Annotations: (Same format with COCOA)
Architecture
Two-Stage Architecture (OccAmodal) for Amodal Prediction. Left: A lightweight U-Net based architecture is used to predict the occluder mask for each object. Right: The amodal predictor takes the predicted occluder mask, together with the modal mask and image as input to predict the amodal segmentation mask.
|
One-Stage Architecture (SDAmodal) for Amodal Pre- diction. The image is fed into a pre-trained Stable Diffusion model to get multi-scale representations containing occlusion information. The image and modal mask features are concatenated and forwarded to multiple decoding layers for amodal prediction. The Stable Diffusion model is frozen during training.
|
Experiment Results
Acknowledgements
This research is supported by EPSRC Programme Grant VisualAI EP/T028572/1, a Royal Society Research Professorship RP\R1\191132, an AWS credit funding, a China Oxford Scholarship and ERC-CoG UNION 101001212.
Webpage template modified from Richard Zhang.