|
|
|
|
|
|
|
|
|
htd@robots.ox.ac.uk
|
Generating Audio Description (AD) for movies is a challenging task that requires fine-grained visual understanding and an awareness of the characters and their names. Currently, visual language models for AD generation are limited by a lack of suitable training data, and also their evaluation is hampered by using performance measures not specialized to the AD domain. In this paper, we make three contributions: (i) We propose two approaches for constructing AD datasets with aligned video data, and build training and evaluation datasets using these. These datasets will be publicly released; (ii) We develop a Q-former-based architecture which ingests raw video and generates AD, using frozen pre-trained visual encoders and large language models; and (iii) We provide new evaluation metrics to benchmark AD quality that are well-matched to human performance. Taken together, we improve the state of the art on AD generation.
| Dataset | Download | Video source | Text source | # AD | Information |
| CMD-AD | [18MB, csv] | CMD | AudioVault | 101k | AD dataset with video pixels. Aligned texts from AudioVault with videos from CMD. |
| HowTo-AD | [coming soon] | HowTo100M | modified from HowToCaption | 3.4M | Large pre-training dataset for AD. Modified video captions to mimick AD style. |
Webpage template modified from Richard Zhang.