Places Challenge 2017: Deep Scene Understanding is held jointly with COCO Challenge at ICCV'17. Scene understanding is one of the hallmark tasks of computer vision, allowing the definition of a context for object recognition. The goal of the Places Challenge is to stimulate the computer vision community to develop new algorithms and models that improve the state of the arts in visual scene understanding. Winners will be invited to present at Joint COCO and Places Challenge Workshop at ICCV 2017.

There are three tasks in Places Challenge 2017: Scene Parsing, Scene Instance Segmentation, and Semantic Boundary Detection. The data for all the three tasks are from the fully annotated image dataset ADE20K, there are 20K images for training, 2K images for validation, and 3K images for testing. Teams could particpate in one or two or three of the tasks. The details for each task are listed below:

Task 1: Scene Parsing

Scene Semantic Segmentation is to segment the image into object and stuff categories. The task is pixel-wise classification which is similar to semantic segmentation task in Pascal, but the difference is that each pixel in each testing image is required to be classified into some semantic category such as stuff concepts like sky, grass, road or discrete objects like person, car, building. There are 150 semantic categories in total, which cover 89% of all the pixels of all the images. Specifically, the challenge data is divided into 20K images for training, 2K images for validation, and 3K images for testing. The evaluation metric is IoU averaged over all the 150 visual concepts. For each image, segmentation algorithms will produce a semantic segmentation mask, predicting the semantic category for each pixel in the image. The performance of the algorithms will be evaluated on the mean of pixel-wise accuracy and the Intersection over Union (IoU) averaged over all the 150 semantic categories.

Task 2: Instance Segmentation

Scene Instance Segmentation is to segment the image into object instances. The task is pixel-wise classification similarly to Task 1, but it requires the proposed algorithm to extract each object instance from the image as well. The motivation of this task is two folds: 1) Push the research of semantic segmentation towards instance segmentation. 2) Let there be more synergy among object detection, semantic segmentation, and the scene parsing. The data share semantic categories with Task 1, but comes with object instance annotations for 100 categories. The evaluation metric is Average Precision (AP) over all the 100 semantic categories.

Task 3: Semantic Boundary Detection

Semantic Boundary Detection is to detect the boundaries of each object instance in the images. Boundary detection is relevant to edge detection, but focuses more on the association of boundary and their object instances. The previous pixel annotations of all the object instances in the images of the ADE20K dataset could make a benchmark for semantic boundary detection, which is much larger than the previous BSDS500. The data for this task are the same images used in the Task 1 and Task 2, with a total of 150 semantic categories. The submitted models will be evaluated using the F-measure at optimal dataset scale (F-ODS).


  • June 26, 2017: Development kit and data are made available.
  • Sep. 4, 2017: Testing set will be released.
  • Sep. 15, 2017: Submission deadline.
  • Sep. 26, 2017: Challenge results released.
  • Oct. 29, 2017: Winner(s) presents at ICCV'17 Joint COCO and Places Challenge Workshop
  • Downloads

    Data and toolkit

    We release the training/validation data and the development toolkit as follows:
    Data & Toolkit for all the three tasks
    Training set
    20,210 images (browse)

    Validation set
    2,000 images (browse)

    Test set
    To be released.

    Rules & Evaluation

    General: Pre-trained models from the classificaiton networks on ImageNet and Places are allowed to use. But models trained on pixel-wise annotations or pixel-wise annotation data from other sources are not allowed to use.

    Scene Parsing: To evaluate the segmentation algorithms, we will take the mean of the pixel-wise accuracy and class-wise IoU as the final score. Pixel-wise accuracy indicates the ratio of pixels which are correctly predicted, while class-wise IoU indicates the Intersection of Union of pixels averaged over all the 150 semantic categories.

    Instance Segmentation: The performance of the instance segmentation algorithms will be evaluated by Average Precision (AP, or mAP), following COCO evaluation metrics. For each image, we take at most 255 top-scoring instance masks across all categories. For each instance mask prediction, we only count it when its IoU with ground truth is above a certain threshold. We take 10 IoU thresholds of 0.50:0.05:0.95 for evaluation. The final AP is averaged across 10 IoU thresholds and 100 categories. You can refer to COCO API for evaluation criteria

    Semantic Boundary Detection: the performance of the semantic boundary detection will be determined by the F-measure at optimal dataset scale (F-ODS). The evaluation toolkit also provides F-measure at optimal image scale (F-OIS) and average precision (AP) to evaluate your model. Refer to the toolkit for the detailed information.

    Submissions should be uploaded through the challenge server (TBA).


    Bolei Zhou

    Hang Zhao

    Xavier Puig

    Zhiding Yu

    Sanja Fidler
    University of Toronto

    Antonio Torralba


    If you find this scene parse challenge or the data useful, please cite the following paper:

    Scene Parsing through ADE20K Dataset. B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso and A. Torralba. Computer Vision and Pattern Recognition (CVPR), 2017. [PDF] [bib]

    Semantic Understanding of Scenes through ADE20K Dataset. B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso and A. Torralba. [arXiv:1608.05442][bib]