The VoxCeleb1 Dataset


VoxCeleb1 contains over 100,000 utterances for 1,251 celebrities, extracted from videos uploaded to YouTube.

Verification split
devtest
# of speakers1,21140
# of videos21,819677
# of utterances148,6424,874
Identification split
devtest
# of speakers1,2511,251
# of videos21,2451,251
# of utterances145,2658,251



Updates

26/10/2017 Overlap with SITW: The authors of the Speakers in the Wild (SITW) dataset have kindly released the overlap between the speakers in their dataset and VoxCeleb. The SITW codes of the speakers present in both datasets can be found here. Those wishing to use both datasets (SITW and VoxCeleb) will hence be required to reduce the overall size of the SITW dataset.

11/10/2017 Models: Pretrained models for Speaker Identification and Verification can be found here.

29/9/2017 VoxCeleb 1.1: After deduping the dataset, we have found a small list of repeated videos (34 videos). The list of duplicates can be found here. Note that these videos are only in the training set for identification, the test set remains unchanged.



Downloads


Terms and Conditions

The VoxCeleb dataset consists of Youtube URLs with timestamps for utterances. For privacy issues with the dataset, please refer to our Dataset Privacy Notice.

The provided VoxCeleb metadata is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

URLs and timestamps

The URLs and timestamps for the VoxCeleb dataset are no longer available from this website.

Audio files

The audio files for the VoxCeleb dataset are no longer available from this website.

Metadata

The identifying metadata files for the VoxCeleb dataset are no longer available from this website.

Verification Set
Dataset split for identification

List of trial pairs - VoxCeleb1
List of trial pairs - VoxCeleb1 (cleaned)
List of trial pairs - VoxCeleb1-H
List of trial pairs - VoxCeleb1-H (cleaned)
List of trial pairs - VoxCeleb1-E
List of trial pairs - VoxCeleb1-E (cleaned)

VoxCeleb1-E and VoxCeleb1-H lists are drawn from the VoxCeleb1 training set. Therefore you cannot use any files in VoxCeleb1 for training if you are using these lists for testing.




Related Links
Download script and unofficial baseline code can be found here.


Please cite the following if you make use of the dataset.

A. Nagrani*, J. S. Chung*, A. Zisserman  
INTERSPEECH, 2017.