AutoAD III: The Prequel -- Back to the Pixels

Tengda Han 1
Max Bain 1
Arsha Nagrani 1
Gül Varol 1,2
Weidi Xie 1,3
Andrew Zisserman 1
1Visual Geometry Group, University of Oxford
2LIGM, École des Ponts, Univ Gustave Eiffel, CNRS
3CMIC, Shanghai Jiao Tong University

htd@robots.ox.ac.uk


AD-III figure

Cinema is a matter of what's in the frame and what's out.
-- Martin Scorsese


Abstract

Generating Audio Description (AD) for movies is a challenging task that requires fine-grained visual understanding and an awareness of the characters and their names. Currently, visual language models for AD generation are limited by a lack of suitable training data, and also their evaluation is hampered by using performance measures not specialized to the AD domain. In this paper, we make three contributions: (i) We propose two approaches for constructing AD datasets with aligned video data, and build training and evaluation datasets using these. These datasets will be publicly released; (ii) We develop a Q-former-based architecture which ingests raw video and generates AD, using frozen pre-trained visual encoders and large language models; and (iii) We provide new evaluation metrics to benchmark AD quality that are well-matched to human performance. Taken together, we improve the state of the art on AD generation.



Qualitative Examples from AutoAD-III


In this stitched example, the AD ground-truth and our model predictions are provided as subtitles with the format `[title] ground-truth | prediction`. The voice of AD is generated from OpenAI text-to-speech API and fused with the original movie soundtrack.

The movie clips in the video are from: Meet the Parents (2000), Inferno (2016), Liar Liar (1997), Back to the Future (1985), The Night Before (2015), Jason Bourne (2016), Sing (2016). All of them are from the CMD-AD-Eval set.


New Datasets

Dataset Download Video source Text source # AD Information
CMD-AD [18MB, csv] CMD AudioVault 101k AD dataset with video pixels. Aligned texts from AudioVault with videos from CMD.
HowTo-AD [coming soon] HowTo100M modified from HowToCaption 3.4M Large pre-training dataset for AD. Modified video captions to mimick AD style.



New Metrics




Publications

AutoAD III: The Prequel -- Back to the Pixels.
Tengda Han, Max Bain, Arsha Nagrani, Gül Varol, Weidi Xie, Andrew Zisserman
CVPR, 2024

----- Previous works of the Trilogy: -----
AutoAD II: The Sequel -- Who, When, and What in Movie Audio Description.
Tengda Han, Max Bain, Arsha Nagrani, Gül Varol, Weidi Xie, Andrew Zisserman
ICCV, 2023

AutoAD: Movie Description in Context.
Tengda Han*, Max Bain*, Arsha Nagrani, Gül Varol, Weidi Xie, Andrew Zisserman
[Highlight] CVPR, 2023






Acknowledgements

This research is funded by EPSRC PG VisualAI EP/T028572/1, and ANR-21-CE23-0003-01 CorVis.



Webpage template modified from Richard Zhang.