Data/Software for choosing between learning algorithms
This directory contains data and software used for the paper on calibrated
hypothesis tests. It is provided in the hope that it can be used to apply
alternative tests that can be directly compared with the outcomes for
calibrated test.
Licences
Software is (mostly) GPL (some stuff copied from elsewhere, see code for details),
data is free.
All provided as is without any warranty, etc..
Data description
There are 4 data sources that generated 1000 training files of 300 instances
each (stored in set1,...,set4 subdirectories). The '.dat' files contain
outcomes of various experiments that can be used as input data for hypothesis
tests.
Each line in a file contains a difference of accuracies for fold in an r x k
fold cv.
The format of the files is as follows:
for every of the 1000 training sets there is one block starting with the name
of the training set (e.g. out000.arff for the first training set) followed by
the outcomes of the r x k fold cv experiment. The first line after the label
contains the accuracy difference between C4.5 and naive Bayes for run 1, fold
1. Then follows on the next line the outcome for run 1, fold 2, etc. up to run
1, fold k. Then the run is increased and the fold reset to 1. This is repeated
till all runs and folds have been listed.
There are two types of files:
1. files based on class probabilities of 50% (the majority of the files)
Filenames are of the form
<r>x<k>d<set>.dat
where
r = nr of runs
k = nr of folds
set = nr of dataset (as explained data set description of t-test.tex article)
2. files based on class probabilities of other than 50%
Filenames are of the form
<r>x<k>d.<prob>.dat
where
r = nr of runs
k = nr of folds
prob = class probability
File list
test.jar source code used in the experiments.
It contains classes to read the data files (see below) and perform
hypothesis test experiments. Admittedly, documentation is a bit sparse
(you are looking at it right now...)
100x10d1.dat 100 runs, 10 fold, set 1 (See my ICML2003 for description of sets)
100x10d2.dat " " set 2
100x10d3.dat " " set 3
100x10d4.dat " " set 4
100x2d1.dat " 2 fold set 1
100x2d2.dat " 2 fold set 2
100x2d3.dat " 2 fold set 3
100x2d4.dat " 2 fold set 4
100x5d1.dat " 5 fold set 1
100x5d2.dat " 5 fold set 2
100x5d3.dat " 5 fold set 3
100x5d4.dat " 5 fold set 4
10x20d1.dat 10 run, 20 fold, set 1
10x20d2.dat " " set 2
10x20d3.dat " " set 3
10x20d4.dat " " set 4
10x30d1.dat " 30 fold, set 1
10x30d2.dat etc
10x30d3.dat
10x30d4.dat
10x40d1.dat
10x40d2.dat
10x40d3.dat
10x40d4.dat
10x50d1.dat
10x50d2.dat
10x50d3.dat
10x50d4.dat
set1/ input arff files for set 1
set2/ input arff files for set 2
set3/ input arff files for set 3
set4/ input arff files for set 4
10x10d.0.10.dat 10 fold, 10 fold, set with 10% class probability
10x10d.0.20.dat " " set with 20% class probability
10x10d.0.30.dat " " set with 30% class probability
10x10d.0.40.dat " " set with 40% class probability
10x10d.0.50.dat " " set with 50% class probability
readme.txt this file