Many datasets used by the deep learning community consist of a single predefined training and test split. For example, in the previous experiments on CIFAR-10 we stated that a set of 50,000 images was used for training, and another set of 10,000 images was used for testing. In order to perform some sort of significance test, and thus have some degree of confidence in our results and the conclusions we draw from them, we must gather multiple measurements of how well models trained using a particular algorithm configuration perform. To this end, we propose the Scaled ImageNet Subset (SINS-10) dataset, a set of 100,000 colour images retrieved from the ImageNet collection. The images are evenly divided into 10 different classes, and each of these classes is associated with multiple synsets from the ImageNet database. All images were first resized such that their smallest dimension was 96 pixels and their aspect ratio was maintained. Then, the central 96×96 pixel subwindow of the image was extracted to be used as the final instance.
An important difference between the proposed dataset and currently available benchmark datasets is how it has been split into training and testing data. The entire dataset is divided into 10 equal sized predefined folds of 10,000 instances. The first 9,000 images in each fold are intended for training a model, and the remaining 1,000 for testing it. One can then apply a machine learning technique to each fold in the dataset, and repeat the process for techniques one wishes to compare against. This will result in 10 performance measurements for each algorithm. A paired t-test can then be used to determine whether there is a significant difference, with some level of confidence, between the performance of the different techniques.
Note that the protocol for SINS-10 is different to the commonly used cross-validation technique. When performing cross-validation, the training sets overlap significantly, and the measurements for the test fold performance are therefore not independent. To mitigate this, one can use a heuristic for correcting the paired t-test (Nadeau and Bengio, 2000). Rather than use this heuristic, we simply avoid fitting models using overlapping training (or test) sets, and can therefore use the standard paired t-test.
- SINS-10 binary files (2.4GB)
The SINS-10 archive contains two binary files, X.bin and y.bin.
The X.bin file contains all 100,000 images in the dataset, each of which is stored as an array of raw unsigned bytes. The images have three channels and are 96×96 pixels, so each image is exactly 27,648 bytes. The first 9,216 bytes are the red channel, then the green channel, and finally the blue channel. The pixels in each channel are stored in row-major order. There is no delimiter between each image. The first 10,000 images in this file is the first fold, the next 10,000 images the second fold, and so on.
The y.bin file is exactly 100,000 bytes long, and each byte is the zero-indexed label of the corresponding image in X.bin.