Amodal Ground Truth and Completion in the Wild

Guanqi Zhan 1
Chuanxia Zheng 1
Weidi Xie 1,2
Andrew Zisserman 1
1Visual Geometry Group, University of Oxford    2Shanghai Jiao Tong University

{guanqi,cxzheng,weidi,az}@robots.ox.ac.uk

amodal-teaser-figure

Abstract

The problem we study in this paper is amodal image segmentation: predicting entire object segmentation masks including both visible and invisible (occluded) parts. In previous work, the amodal segmentation ground truth on real images is usually predicted by manual annotaton and thus is subjective. In contrast, we use 3D data to establish an automatic pipeline to determine authentic ground truth amodal masks for partially occluded objects in real images. This pipeline is used to construct an amodal completion evaluation benchmark, MP3D-Amodal, consisting of a variety of object categories and labels. To better handle the amodal completion task in the wild, we explore two architecture variants: a two-stage model that first infers the occluder, followed by amodal mask completion; and a one-stage model that exploits the representation power of Stable Diffusion for amodal segmentation across many categories. Without bells and whistles, our method achieves a new state-of-the-art performance on Amodal segmentation datasets that cover a large variety of objects, including COCOA and our new MP3D-Amodal dataset.


Publications

Amodal Ground Truth and Completion in the Wild
Guanqi Zhan, Chuanxia Zheng, Weidi Xie, Andrew Zisserman
CVPR 2024



MP3D-Amodal Dataset

dataset_comparison

Comparison of Different Amodal Datasets. Our MP3D-Amodal dataset is the first amodal dataset to provide authentic amodal ground truth for the occluded objects of a large variety of categories in real scenes.

 
mp3d_example

Examples of Our MP3D-Amodal Dataset.

 
mp3d_stat mp3d_dist

Statistics of the MP3D-Amodal dataset.

 
mp3d_generate

Generation Process of the MP3D-Amodal dataset.



Dataset Download


Architecture

arch_1

Two-Stage Architecture (OccAmodal) for Amodal Prediction. Left: A lightweight U-Net based architecture is used to predict the occluder mask for each object. Right: The amodal predictor takes the predicted occluder mask, together with the modal mask and image as input to predict the amodal segmentation mask.

 
arch_2

One-Stage Architecture (SDAmodal) for Amodal Pre- diction. The image is fed into a pre-trained Stable Diffusion model to get multi-scale representations containing occlusion information. The image and modal mask features are concatenated and forwarded to multiple decoding layers for amodal prediction. The Stable Diffusion model is frozen during training.



Experiment Results

ablation_2

 
compare_sota

 
qualitative



Acknowledgements

This research is supported by EPSRC Programme Grant VisualAI EP/T028572/1, a Royal Society Research Professorship RP\R1\191132, an AWS credit funding, a China Oxford Scholarship and ERC-CoG UNION 101001212.



Webpage template modified from Richard Zhang.