A Light Touch Approach to Teaching Transformers Multi-view Geometry
In CVPR 2023


Yash Bhalgat
João F. Henriques
Andrew Zisserman


Visual Geometry Group, University of Oxford





Predicted Cross-attention Maps.

An Epipolar-guided training method to incorporate multi-view geometric priors into Transformer models. Shown above: Predicted Cross-Attention maps for a test image pair (i.e. never seen in training) and without any input pose information. The Transformer implicitly estimates the epipolar geometry given 2 images and uses it for downstream predictions, e.g. for pose-invariant object retrieval.




Abstract

Transformers are powerful visual learners, in large part due to their conspicuous lack of manually-specified priors. This flexibility can be problematic in tasks that involve multiple-view geometry, due to the near-infinite possible variations in 3D shapes and viewpoints (requiring flexibility), and the precise nature of projective geometry (obeying rigid laws). To resolve this conundrum, we propose a "light touch" approach, guiding visual Transformers to learn multiple-view geometry but allowing them to break free when needed. We achieve this by using epipolar lines to guide the Transformer's cross-attention maps, penalizing attention values outside the epipolar lines and encouraging higher attention along these lines since they contain geometrically plausible matches. Unlike previous methods, our proposal does not require any camera pose information at test-time. We focus on pose-invariant object instance retrieval, where standard Transformer networks struggle, due to the large differences in viewpoint between query and retrieved images. Experimentally, our method outperforms state-of-the-art approaches at object retrieval, without needing pose information at test-time.




Video




Method

Teaser figure.

Proposed Epipolar-guided training method. Pose information (or epipolar geometry) is required only during training. During inference, the Transformer implicitly uses geometric reasoning in its predictions.




Qualitative Comparisons on CO3D-Retrieve Benchmark

Example 1

Example 2

Example 3


Paper

Paper thumbnail

A Light Touch Approach to Teaching Transformers Multi-view Geometry

Yash Bhalgat, João F. Henriques, Andrew Zisserman

In CVPR 2023.

@InProceedings{bhalgat2023light,
    title = {A Light Touch Approach to Teaching Transformers Multi-view Geometry},
    author = {Bhalgat, Yash and Henriques, João F and Zisserman, Andrew},
    booktitle = {CVPR},
    year = {2023},
}



Acknowledgements

We are grateful for funding from EPSRC AIMS CDT EP/S024050/1, AWS, the Royal Academy of Engineering (RF\201819\18\163), EPSRC Programme Grant VisualAI EP/T028572/1, and a Royal Society Research Professorship RP\R1\191132.

This webpage template was originally made by Phillip Isola and Richard Zhang for a colorful project, and inherits the modifications made by Jason Zhang.