Hello again! In the last lesson, we looked at training and testing. We saw that we can evaluate a classifier on an independent test set, or using a percentage split, with a certain percentage of the dataset used to train and the rest used for testing, or -- and this is generally a very bad idea -- we can evaluate it on the training set itself, which gives misleadingly optimistic performance figures.
In this lesson, we're going to look a little bit more at training and testing. In fact, what we're going to do is repeatedly train and test using percentage split. Now, in the last lesson we saw that if you simply repeat the training and testing, you get the same result each time because Weka initializes the random number generator before it does each run to make sure that you know what's going on when you do the same experiment again tomorrow. But there is a way of overriding that. So we will be using independent random numbers on different occasions to produce a percentage split of the dataset into a training and test set.
I'm going to open the segment-challenge data again, that's what we used before. Notice there are 1500 instances here; that's quite a lot. I'm going to go to Classify. I'm going to choose J48, our standard method I guess. I'm going to use a percentage split, and because we've got 1500 instances, I'm going to choose 90% for training and just 10% for testing. I reckon that 10% -- that's 150 instances -- for testing is going to give us a reasonable estimate, and we might as well train on as many as we can to get the most accurate classifier.
I'm going to run this, and the accuracy figure I get -- this is what I got in the last lesson -- is 96.6667%. Now, this is misleadingly high accuracy; I'm going to call that 96.7%, or 0.967. And then I'm going to do it again and just see how much variation we get of that figure, initializing the random number generator to different amounts each time.
If I go to the "More options" menu, I get a number of options here which are quite useful: outputting the model (we're doing that); outputting statistics; we can output different evaluation measures; we're doing the confusion matrix; we're storing the prediction for visualization; we can output the predictions if we want; we can do a cost-sensitive evaluation; and we can set the random seed for cross-validation or percentage split. That's set by default to 1. I'm going to change it to 2, a different random seed. We could also output the source code for the classifier if we wanted, but I just want to change the random seed. Then I want to run it again. Before we got 0.967, and this time we get 0.94, 94%. Quite different, you see. If I were then to change this again to, say, 3, and run it again. Again I get 94%. If I change it again to 4 and run it again, I get 96.7%. Let's do one more: change it to 5, run it again, and now I get 95.3%.
Here's a table with these figures in. If we run it 10 times, we get this set of results. Given this set of experimental results, we can calculate the mean and standard deviation. The sample mean is the sum of all of these error figures -- or these success rates, I should say -- divided by the number, 10 of them. That's 0.949, about 95%. That's really what we would expect to get. That's a better estimate than the 96.7% that we started out with. A more reliable estimate.
We can calculate the sample variance. We take the deviation from the mean, we subtract the mean from each of these numbers, we square that, add them up, and we divide, not by n, but by n - 1. That might surprise you, perhaps. The reason for it being n-1 is because we've actually calculated the mean from this sample. When the mean is calculated from the sample, you need to divide by n - 1, leading to a slightly larger variance estimate than if you were to divide by n.
We take the square root of that, and in this case, we get a standard deviation of 1.8%. Now you can see that the real performance of J48 on the segment-challenge dataset is approximately 95% accuracy, plus or minus approximately 2%. Anywhere, let's say, between 93-97% accuracy.
These figures that you get -- that Weka puts out for you -- are misleading. You need to be careful how you interpret them, because the result is certainly not 95.3333%. There's a lot of variation on a lot of these figures.
Remember, the basic assumption is the training and test sets are sampled independently from an infinite population, and you should expect a slight variation in results -- perhaps more than just a slight variation in results. You can estimate the variation in results by setting the random-number seed and repeating the experiment. You can calculate the mean and the standard deviation experimentally, which is what we just did.
Off you go now, and do the activity associated with this lesson. I'll see you in the next lesson. Bye!