Data/Software for choosing between learning algorithms

This directory contains data and software used for the paper on calibrated 
hypothesis tests. It is provided in the hope that it can be used to apply
alternative tests that can be directly compared with the outcomes for 
calibrated test. 

Licences

Software is (mostly) GPL (some stuff copied from elsewhere, see code for details), 
data is free.

All provided as is without any warranty, etc..

Data description

There are 4 data sources that generated 1000 training files of 300 instances 
each (stored in set1,...,set4 subdirectories). The '.dat' files contain 
outcomes of various experiments that can be used as input data for hypothesis 
tests. 

Each line in a file contains a difference of accuracies for fold in an r x k
fold cv.

The format of the files is as follows:
for every of the 1000 training sets there is one block starting with the name
of the training set (e.g. out000.arff for the first training set) followed by 
the outcomes of the r x k fold cv experiment. The first line after the label 
contains the accuracy difference between C4.5 and naive Bayes for run 1, fold 
1. Then follows on the next line the outcome for run 1, fold 2, etc. up to run 
1, fold k. Then the run is increased and the fold reset to 1. This is repeated 
till all runs and folds have been listed.

There are two types of files: 
1. files based on class probabilities of 50% (the majority of the files)

Filenames are of the form
<r>x<k>d<set>.dat
where
r = nr of runs
k = nr of folds
set = nr of dataset (as explained data set description of t-test.tex article)

2. files based on class probabilities of other than 50%

Filenames are of the form
<r>x<k>d.<prob>.dat
where
r = nr of runs
k = nr of folds
prob = class probability

File list

test.jar source code used in the experiments.
It contains classes to read the data files (see below) and perform
hypothesis test experiments. Admittedly, documentation is a bit sparse
(you are looking at it right now...)

100x10d1.dat  100 runs, 10 fold, set 1 (See my ICML2003 for description of sets)
100x10d2.dat    "        "       set 2
100x10d3.dat    "        "       set 3
100x10d4.dat    "        "       set 4
100x2d1.dat     "       2 fold   set 1
100x2d2.dat     "       2 fold   set 2
100x2d3.dat     "       2 fold   set 3
100x2d4.dat     "       2 fold   set 4
100x5d1.dat     "       5 fold   set 1
100x5d2.dat     "       5 fold   set 2
100x5d3.dat     "       5 fold   set 3
100x5d4.dat     "       5 fold   set 4
10x20d1.dat     10 run, 20 fold, set 1
10x20d2.dat      "        "       set 2
10x20d3.dat      "        "       set 3
10x20d4.dat      "        "       set 4
10x30d1.dat      "       30 fold, set 1
10x30d2.dat     etc
10x30d3.dat
10x30d4.dat
10x40d1.dat
10x40d2.dat
10x40d3.dat
10x40d4.dat
10x50d1.dat
10x50d2.dat
10x50d3.dat
10x50d4.dat
set1/           input arff files for set 1
set2/           input arff files for set 2
set3/           input arff files for set 3
set4/           input arff files for set 4
10x10d.0.10.dat 10 fold, 10 fold, set with 10% class probability
10x10d.0.20.dat   "        "      set with 20% class probability
10x10d.0.30.dat   "        "      set with 30% class probability
10x10d.0.40.dat   "        "      set with 40% class probability
10x10d.0.50.dat   "        "      set with 50% class probability
readme.txt      this file