Visual Domain Decathlon

Part of PASCAL in Detail Workshop Challenge, CVPR 2017, July 26th, Honolulu, Hawaii, USA

This taster challenge tests the ability of visual recognition algorithms to cope with (or take advantage of) many different visual domains.

Download Submit Results Results (legacy)

Overview

The goal of this challenge is to solve simultaneously ten image classification problems representative of very different visual domains. The data for each domain is obtained from the following image classification benchmarks:

  1. ImageNet [6].
  2. CIFAR-100 [2].
  3. Aircraft [1].
  4. Daimler pedestrian classification [3].
  5. Describable textures [4].
  6. German traffic signs [5].
  7. Omniglot. [7]
  8. SVHN [8].
  9. UCF101 Dynamic Images [9a,9b].
  10. VGG-Flowers [10].

The union of the images from the ten datasets is split in training, validation, and test subsets. Different domains contain different image categories as well as a different number of images.

The task is to train the best possible classifier to address all ten classification tasks using the training and validation subsets, apply the classifier to the test set, and send us the resulting annotation file for assessment. The winner will be determined based on a weighted average of the classification performance on each domain, using the scoring scheme described below. At test time, your model is allowed to know the ground-truth domain of each test image (ImageNet, CIFAR-100, ...) but, of course, not its category.

It is up to you to make use of the data, and you can either train a single model for all tasks or ten independent ones. However, you are not allowed to use any external data source for training. Furthermore, we ask you to report the overall size of the model(s) used.

Competition results

The CVPR 2017 competition winner is:

You can check the detailed breakdown here here (legacy).

Submission

The 2017 competition is now finished. However, you can keep submitting your entries to the competition and check the result in the leaderboards, as explained below.

Original text:

For the final submission (phase 2 of the competition), generate the results.json file for the images in the test set, as explained below. Pack it into a ZIP file and submit your entry using the VGG Codalab server. This phase opens July 10 2017 and closes July 20 2017 (midnight UTC).

For development (phase 1 of the competition), you are able submit results on the validation subset instead of the test set.

Data download

In order to enter the challenge, please download the following files:

Data

Visual Decathlon contains the following datasets:

Dataset $E^\text{base}$ no. classes training validation testing
Aircraft 39.66 100 3334 3333 3333
CIFAR-100 17.88 100 40000 10000 10000
Daimler Ped 7.18 2 23520 5880 19600
D. Textures 44.47 47 1880 1880 1880
GTSRB 2.47 43 31367 7842 12630
ImageNet 40.13 1000 1232167 49000 48238
Omniglot 12.31 1623 19476 6492 6492
SVHN 3.45 10 47217 26040 26032
UCF101 Dyn 48.80 101 7629 1908 3783
VGG-Flowers 18.59 102 1020 1020 6149

The datasets have been pre-processed as follows:

  1. All images have been resized isotropically to have a shorter size of 72 pixels. For some datasets such as ImageNet, this is a substantial reduction in resolution which makes training models much faster (baselines show that very good performance can still be obtained at this resolution).

  2. When the original dataset did not provide publicly-available test annotations or pre-established splits, we created ad-hoc train, validation, and test split.

  3. All images have been stored in a single directory hierarchy data/{aircraft,cifar100,...,vgg-flowers}/ .

  4. Annotations are provided in Microsoft COCO format. There are three files for each domain ( annotations/aircraft_train.json , annotations/aircraft_val.json , and annotations/aircraft_test_stripped.json , ...). The test annotations contain only the image names and not class labels. The format of these files is described below.

In order to enter the challenge, evaluate your method on the test data and prepare a single result file results.json , comprising responses for the ten domains, in the format described below. Then follow this procedure for submission.

Annotation format

Each annotation text file uses the MS Coco JSON format and has the following structure:

{"info":{"year":2017,"version":1,...},
 "images":[image1,image2,...],
 "annotations":[anno1,anno2,...],
 "categories":[cat1,cat2,...]}

where images has the format:

image1={"id":10000001, "width":320, "height":256, "file_name":"images/dataset1/train/image1.jpg", ...}

annotations has the format:

anno1={"id":10000001, "image_id": 10000001, "category_id":10000003,
       "segmentation":[], "area":81920, "bbox":[0 0 320 256], "iscrowd":0}

and categories has the format:

cat1={"id":10000001,"name":"category1","supercategory":"dataset1"}

The MS Coco format is a bit redundant for an image classification task. You only need to know the list of images in each domain and (for training) the corresponding category labels. Images and categories are given numeric ids ID (using the format $10^7 \times \textsf{domainNumber} + \textsf{imageNumber}$ and $10^7 \times \textsf{domainNumber} + \textsf{categoryNumber}$ respectively). We use 1-based indexing such that \textsf{domainNumber}, \textsf{categoryNumber} and \textsf{imageNumber} start from 1. Each annotation relates an image ID to a corresponding category ID (there is exactly one annotation per image). You can ignore bounding box information. While annotations also have their own ID, since there is exactly one annotation per image, this is set to be equal to the ID of the corresponding image.

Result format

The results.json file also uses MS Coco format, as follows:

[res1,res2,...]

where each entry in the array is an image_id , category_id pair:

res1={"image_id":10000001,"category_id":10000001}

Note that results.json must contain exactly one annotation for each test image for all the ten domains, in a single file. If for any reason you decide to give up on a domain, please fill the corresponding annotations randomly.

If uncertain, the result file can be validated using the MATLAB code. Malformed files will not be accepted.

Evaluation code

To simplify entering the challenge, we make available a devkit with code and annotations and a single TAR archive with all the images (see download buttons above).

We provide a reference implementation of the evaluation procedure in code/evaluation.m (MATLAB format). To use this procedure use the following MATLAB fragment

cd decathlon-1.0 ;
addpath code ;
evaluation('path/to/results_test.json') ;

By default, this runs the evaluation code assuming that results_test.json contains annotations for the test images. Since we do not ship the test labels, results are not meaningful, but the call can still be used to validate the results_test.json file.

In order to evaluate on, e.g., the validation set instead, use instead:

evaluation('path/to/results_val.json','evaluationSet','val') ;

Benchmark measures

Each of the ten domains $d=1,\dots,10$ is a classification problem, evaluated in terms of average prediction error $E_d \in [0,1]$. This is the fraction of test images incorrectly classified, also known as top-1 classification error.

The overall score of an algorithm is computed as follows: Here:

The maximum error $E^\text{max}=2E^\text{base}$ is set to twice the baseline error $E^\text{base}$. The baseline performances are determined form preliminary experiments as well as by consulting the state-of-the-art performance figures available in the literature. The errors are set such that the baseline method obtains a score of 250 points for each task and 2,500 points in total (see the techreport for details). In order to do well in the decathlon challenge, it is necessary to do well on all or at least most of the domains!

Discussion

The scoring system is designed to reward decreasing the error more when this is already significantly better than the baseline. This reflects the fact that further error reductions are proportionally harder. One may think instead to use a logarithmic rule, such as $\alpha_d \log(E_d^\text{max}/E_d)$. Unfortunately, such a rule would have the unwanted property that a perfect result would receive an infinite number of points. The power law used in decathlon strikes a balance, as shown in the following figure below:

Scoring system figure

The figure plots the number of point received by an algorithm as a function of its classification error $E$, where the baseline performance $E^\text{max}$ is 5% error. The logarithmic point system diverges to infinity, wheras the decathlon system assigns at most 1000 point. For $\gamma_d=1$, the number of points is proportional to the score reduction and for $\gamma_d>1$ further reductions in error are rewarded more strongly, similar to a logarithmic effect but not as extreme.

The evaluation protocol and baselines are discussed more thoroughly in this paper.

Acknowledgments

The organisers would like to thank the authors of the ten public benchmark datasets for allowing us to use their data in this challenge.

Data, code, and baselines for this challenge were prepared by Hakan Bilen, Sylvestre Rebuffi, Tomas Jakab from the Oxford Visual Geometry Group.

This challenge was presented as part of the "PASCAL in Detail" workshop at the Conference on Computer Vision and Pattern Recognition (CVPR), 2017, Honolulu. We would like to thank the workshop organisers:

This research is supported by ERC 677195-IDIU.

References

[1] S. Maji, J. Kannala, E. Rahtu, M. Blaschko, and A. Vedaldi. Fine-grained visual classification of aircraft. Technical report, 2013.

[2] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.

[3] S. Munder and D. M. Gavrila. An experimental study on pedestrian classification. PAMI, 28(11):1863-1868, 2006.

[4] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi. Describing textures in the wild. In Proc. CVPR, 2014.

[5] J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel. Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition. Neural Networks, 32(0):323-332, 2012.

[6] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and K. Fei-Fei. Imagenet large scale visual recognition challenge, 2014.

[7] B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332-1338, 2015.

[8] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.

[9a] K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.

[9b] H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, S. Gould. Dynamic Image Networks for Action Recognition. In Proc. CVPR, 2016.

[10] M-E. Nilsback and A. Zisserman. Automated flower classification over a large number of classes. In ICCVGIP, Dec 2008.