Part of PASCAL in Detail Workshop Challenge, CVPR 2017, July 26th, Honolulu, Hawaii, USA
This taster challenge tests the ability of visual recognition algorithms to cope with (or take advantage of) many different visual domains.
Download Submit Results Results (legacy)The goal of this challenge is to solve simultaneously ten image classification problems representative of very different visual domains. The data for each domain is obtained from the following image classification benchmarks:
The union of the images from the ten datasets is split in training, validation, and test subsets. Different domains contain different image categories as well as a different number of images.
The task is to train the best possible classifier to address all ten classification tasks using the training and validation subsets, apply the classifier to the test set, and send us the resulting annotation file for assessment. The winner will be determined based on a weighted average of the classification performance on each domain, using the scoring scheme described below. At test time, your model is allowed to know the ground-truth domain of each test image (ImageNet, CIFAR-100, ...) but, of course, not its category.
It is up to you to make use of the data, and you can either train a single model for all tasks or ten independent ones. However, you are not allowed to use any external data source for training. Furthermore, we ask you to report the overall size of the model(s) used.
The CVPR 2017 competition winner is:
You can check the detailed breakdown here here (legacy).
The 2017 competition is now finished. However, you can keep submitting your entries to the competition and check the result in the leaderboards, as explained below.
Original text:
For the final submission (phase 2 of the competition), generate the
results.json
file for the images in the test set, as explained below. Pack it into a ZIP file and submit your entry using the VGG Codalab server. This phase opens July 10 2017 and closes July 20 2017 (midnight UTC).
For development (phase 1 of the competition), you are able submit results on the validation subset instead of the test set.
In order to enter the challenge, please download the following files:
Visual Decathlon contains the following datasets:
| Dataset | $E^\text{base}$ | no. classes | training | validation | testing |
|---|---|---|---|---|---|
| Aircraft | 39.66 | 100 | 3334 | 3333 | 3333 |
| CIFAR-100 | 17.88 | 100 | 40000 | 10000 | 10000 |
| Daimler Ped | 7.18 | 2 | 23520 | 5880 | 19600 |
| D. Textures | 44.47 | 47 | 1880 | 1880 | 1880 |
| GTSRB | 2.47 | 43 | 31367 | 7842 | 12630 |
| ImageNet | 40.13 | 1000 | 1232167 | 49000 | 48238 |
| Omniglot | 12.31 | 1623 | 19476 | 6492 | 6492 |
| SVHN | 3.45 | 10 | 47217 | 26040 | 26032 |
| UCF101 Dyn | 48.80 | 101 | 7629 | 1908 | 3783 |
| VGG-Flowers | 18.59 | 102 | 1020 | 1020 | 6149 |
The datasets have been pre-processed as follows:
All images have been resized isotropically to have a shorter size of 72 pixels. For some datasets such as ImageNet, this is a substantial reduction in resolution which makes training models much faster (baselines show that very good performance can still be obtained at this resolution).
When the original dataset did not provide publicly-available test annotations or pre-established splits, we created ad-hoc train, validation, and test split.
All images have been stored in a single directory hierarchy
data/{aircraft,cifar100,...,vgg-flowers}/
.
Annotations are provided in Microsoft COCO format. There are three files for each domain (
annotations/aircraft_train.json
,
annotations/aircraft_val.json
, and
annotations/aircraft_test_stripped.json
, ...). The test annotations contain only the image names and not class labels. The format of these files is described below.
In order to enter the challenge, evaluate your method on the test data and prepare a single result file
results.json
, comprising responses for the ten domains, in the format described below. Then follow this procedure for submission.
Each annotation text file uses the MS Coco JSON format and has the following structure:
{"info":{"year":2017,"version":1,...},
"images":[image1,image2,...],
"annotations":[anno1,anno2,...],
"categories":[cat1,cat2,...]}
where
images
has the format:
image1={"id":10000001, "width":320, "height":256, "file_name":"images/dataset1/train/image1.jpg", ...}
annotations
has the format:
anno1={"id":10000001, "image_id": 10000001, "category_id":10000003,
"segmentation":[], "area":81920, "bbox":[0 0 320 256], "iscrowd":0}
and
categories
has the format:
cat1={"id":10000001,"name":"category1","supercategory":"dataset1"}
The MS Coco format is a bit redundant for an image classification task. You only need to know the list of images in each domain and (for training) the corresponding category labels. Images and categories are given numeric ids ID (using the format $10^7 \times \textsf{domainNumber} + \textsf{imageNumber}$ and $10^7 \times \textsf{domainNumber} + \textsf{categoryNumber}$ respectively). We use 1-based indexing such that \textsf{domainNumber}, \textsf{categoryNumber} and \textsf{imageNumber} start from 1. Each annotation relates an image ID to a corresponding category ID (there is exactly one annotation per image). You can ignore bounding box information. While annotations also have their own ID, since there is exactly one annotation per image, this is set to be equal to the ID of the corresponding image.
The
results.json
file also uses MS Coco format, as follows:
[res1,res2,...]
where each entry in the array is an
image_id
,
category_id
pair:
res1={"image_id":10000001,"category_id":10000001}
Note that
results.json
must contain exactly one annotation for each test image for all the ten domains, in a single file. If for any reason you decide to give up on a domain, please fill the corresponding annotations randomly.
If uncertain, the result file can be validated using the MATLAB code. Malformed files will not be accepted.
To simplify entering the challenge, we make available a devkit with code and annotations and a single TAR archive with all the images (see download buttons above).
We provide a reference implementation of the evaluation procedure in
code/evaluation.m
(MATLAB format). To use this procedure use the following MATLAB fragment
cd decathlon-1.0 ;
addpath code ;
evaluation('path/to/results_test.json') ;
By default, this runs the evaluation code assuming that
results_test.json
contains annotations for the test images. Since we do not ship the test labels, results are not meaningful, but the call can still be used to validate the
results_test.json
file.
In order to evaluate on, e.g., the validation set instead, use instead:
evaluation('path/to/results_val.json','evaluationSet','val') ;
Each of the ten domains $d=1,\dots,10$ is a classification problem, evaluated in terms of average prediction error $E_d \in [0,1]$. This is the fraction of test images incorrectly classified, also known as top-1 classification error.
The overall score of an algorithm is computed as follows: Here:
$E_d^\text{max}$ is the maximum error allowed in order to receive points for a given domain. It is determined based on the performance of a reasonable baseline algorithm, as explained below.
$\gamma_d \geq 1$ is an exponent that rewards proportionally more reductions of the classification error as this becomes close to zero. This is set to $\gamma_d=2$ for all domains.
$\alpha_d = 1000\,(E_d^\text{max})^{-\gamma_d}$ so that a perfect result receives a score of 1000.
The maximum error $E^\text{max}=2E^\text{base}$ is set to twice the baseline error $E^\text{base}$. The baseline performances are determined form preliminary experiments as well as by consulting the state-of-the-art performance figures available in the literature. The errors are set such that the baseline method obtains a score of 250 points for each task and 2,500 points in total (see the techreport for details). In order to do well in the decathlon challenge, it is necessary to do well on all or at least most of the domains!
The scoring system is designed to reward decreasing the error more when this is already significantly better than the baseline. This reflects the fact that further error reductions are proportionally harder. One may think instead to use a logarithmic rule, such as $\alpha_d \log(E_d^\text{max}/E_d)$. Unfortunately, such a rule would have the unwanted property that a perfect result would receive an infinite number of points. The power law used in decathlon strikes a balance, as shown in the following figure below:
The figure plots the number of point received by an algorithm as a function of its classification error $E$, where the baseline performance $E^\text{max}$ is 5% error. The logarithmic point system diverges to infinity, wheras the decathlon system assigns at most 1000 point. For $\gamma_d=1$, the number of points is proportional to the score reduction and for $\gamma_d>1$ further reductions in error are rewarded more strongly, similar to a logarithmic effect but not as extreme.
The evaluation protocol and baselines are discussed more thoroughly in this paper.
The organisers would like to thank the authors of the ten public benchmark datasets for allowing us to use their data in this challenge.
Data, code, and baselines for this challenge were prepared by Hakan Bilen, Sylvestre Rebuffi, Tomas Jakab from the Oxford Visual Geometry Group.
This challenge was presented as part of the "PASCAL in Detail" workshop at the Conference on Computer Vision and Pattern Recognition (CVPR), 2017, Honolulu. We would like to thank the workshop organisers:
This research is supported by ERC 677195-IDIU.
[1] S. Maji, J. Kannala, E. Rahtu, M. Blaschko, and A. Vedaldi. Fine-grained visual classification of aircraft. Technical report, 2013.
[2] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.
[3] S. Munder and D. M. Gavrila. An experimental study on pedestrian classification. PAMI, 28(11):1863-1868, 2006.
[4] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi. Describing textures in the wild. In Proc. CVPR, 2014.
[5] J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel. Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition. Neural Networks, 32(0):323-332, 2012.
[6] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and K. Fei-Fei. Imagenet large scale visual recognition challenge, 2014.
[7] B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332-1338, 2015.
[8] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
[9a] K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
[9b] H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, S. Gould. Dynamic Image Networks for Action Recognition. In Proc. CVPR, 2016.
[10] M-E. Nilsback and A. Zisserman. Automated flower classification over a large number of classes. In ICCVGIP, Dec 2008.