Hi! Good to see you again. One of the things I like to do with my time is play music, and that little bit of Mozart you hear at the beginning of these videos, that's me and two friends playing a clarinet trio. I play in an orchestra, and last night I was playing some jazz with a little trio. If you want to hear us play, if you go to Google and just find my home page -- type my name, Ian Witten -- you'll get me here, and every time you visit this page, I'll play you a tune [plays "Goodbye Pork Pie Hat"]. If you refresh the page, I'll play you another tune [plays "Lullaby of Birdland"]. Yeah, that's what I do.
Anyway, that's not what we're here for. We're here to talk about Lesson 2.6, which is about cross-validation results. We learned about cross-validation in the last lesson. I said that cross-validation was a better way of evaluating your machine learning algorithm, evaluating your classifier, than repeated holdout, repeating the holdout method. Cross-validation does things 10 times. You can use holdout to do things 10 times, but cross-validation is a better way of doing things.
Let's just do a little experiment here. I'm going to start up Weka and open the diabetes dataset. Here it is, diabetes.arff. The baseline accuracy, which ZeroR gives me -- that's the default classifier, by the way, rules > ZeroR -- if I just run that, well, it will evaluate it using cross-validation. Actually, for a true baseline, I should just use the training set. That'll just look at the percentage chance of getting a correct result if we simply guess the most likely class, in this case 65.1%. That's the baseline accuracy.
That's the first thing you should do with any dataset. Then we're going to look at J48, which is down here under "trees". There it is. I'm going to evaluate it with 10-fold cross-validation. It takes just a second to do that. I get a result of 73.8%, and we can change the random-number seed like we did before. The default is 1; let's put a random-number seed of 2. Run it again; I get 75%. Do it again; change it to, say, 3 -- I can choose anything I want, of course. Run it again, and I get 75.5%.
These are the numbers I get on this slide with 10 different random-number seeds. Those are the same numbers on this slide, in the right-hand column, the 10 values I got: 73.8%, 75.0%, 75.5%, and so on. I can calculate the mean, the sample mean, which for that right-hand column is 74.5%, and the sample standard deviation, which is 0.9%, using just the same formulas that we used before.
Before we use these formulas for the holdout method -- we repeated the holdout 10 times. These are the results you get on this dataset if you repeat holdout, that is using 90% for training and 10% for testing -- which is, of course, what we're doing with 10-fold cross-validation. I would get those results there, and if I average those, I get a mean of 74.8%, which is satisfactorily close to 74.5%, but I get a larger standard deviation, quite a lot larger; standard deviation of 4.6% as opposed to 0.9% with cross-validation.
Now, you might be asking yourself why use 10-fold cross-validation. With Weka we can use 20-fold cross-validation or anything, we just set the number of folds here beside the cross-validation box to whatever we want. So we can use 20-fold cross-validation. What that would do is to divide the dataset into 20 equal parts and repeat 20 times. Take one part out, train on the other 95% of the dataset; and then do it a 21st time on the whole dataset.
So why 10? Why not 20? Well, it's a good question really, and there's not a very good answer. We want to use quite a lot of data for training, because, in the final analysis, we're going to use the entire dataset for training. If we're using 10-fold cross-validation, then we're using 90% of the dataset for training. Maybe it would be a little better to use 95% of the dataset for training, with 20-fold cross-validation. On the other hand, we want to make sure that what we evaluate on is a valid statistical sample.
So in general, it's not necessarily a good idea to use a large number of folds with cross-validation. Also, of course, 20-fold cross-validation will take twice as long as 10-fold cross-validation. The upshot is that there isn't a really good answer to this question, but the standard thing to do is to use 10-fold cross-validation and that's why it's Weka's default.
We've shown in this lesson that cross-validation really is better than repeated holdout. Remember, on the last slide, we found that we got about the same mean for repeated holdout as for cross-validation, but we got a much smaller variance for cross-validation. We know that the evaluation in this machine learning method, J48, on this dataset, diabetes, we get 74.5% accuracy, probably somewhere between 73.5% and 75.5%. That is actually substantially larger than the baseline. So J48 is doing something for us better than the baseline. Cross-validation reduces the variance of the estimate.
That's the end of this class. Off you go and do the activity. I'll see you at the next class. Bye for now!