Hi! In this lesson, Lesson 2.5, I want to introduce you to the standard way of evaluating the performance of a machine learning algorithm, which is called cross-validation.
A couple of lessons back, we looked at evaluating on an independent test set, and we also talked about evaluating on the training set (don't do that). We also talked about evaluating using the "holdout" method by taking the one dataset and holding out a little bit for testing and using the rest for training.
There's a fourth option on Weka's Classify panel, which is called cross-validation, and that's what we're going to talk about here.
Cross-validation is a way of improving upon repeated holdout. We tried using the holdout method with different random-number seeds each time. That's called "repeated holdout". Cross-validation is a systematic way of doing repeated holdout that actually improves upon it by reducing the variance of the estimate.
We take a training set and we create a classifier. Then we're looking to evaluate the performance of that classifier, and there is a certain amount of variance in that evaluation, because it's all statistical underneath. We want to keep the variance in the estimation as low as possible.
Cross-validation is a way of reducing the variance, and a variant on cross-validation called "stratified cross-validation" reduces it even further. I'm going to explain that in this class.
In a previous lesson, we held out 10% for testing and we repeated that 10 times; that's the "repeated holdout" method. We've got one dataset, and we divided it independently 10 separate times into a training set and a test set.
With cross-validation, we divide it just once, but we divide into, say, 10 pieces. Then we take 9 of the pieces and use them for training, and the last piece we use for testing. Then, with the same division, we take another 9 pieces and use them for training and the held-out piece for testing. We do the whole thing 10 times, using a different segment for testing each time. In other words, we divide the dataset into 10 pieces, and then we hold out each of these pieces in turn for testing, train on the rest, do the testing and average the 10 results. That would be "10-fold cross-validation".
Divide the dataset into 10 parts (these are called "folds"), hold out each part in turn, and average the results. So each data point in the dataset is used once for testing and 9 times for training. That's 10-fold cross-validation.
"Stratified" cross-validation is a simple variant where, when we do the initial division into 10 parts, we ensure that each fold has got approximately the correct proportion of each of the class values. Of course, there are many different ways of dividing a dataset into 10 equal parts. We just make sure we choose a division that has approximately the right representation of class values in each of the folds. That's "stratified cross-validation". It helps reduce the variance in the estimate a little bit more.
Then, once we've done the cross-validation, what Weka does is run the algorithm an eleventh time on the whole dataset. That will then produce a classifier that we might deploy in practice. We use 10-fold cross-validation in order to get an evaluation result and estimate of the error, and then finally we do classification one more time to get an actual classifier to use in practice.
That's what I wanted to tell you. Cross-validation is better than repeated holdout, and we'll look at that in the next lesson. Stratified cross-validation is even better. Weka does stratified cross-validation by default. With 10-fold cross-validation, Weka invokes the learning algorithm 11 times, once for each fold of the cross-validation and then a final time on the entire dataset. A practical rule of thumb is that if you've got lots of data you can use a percentage split, and evaluate it just once. Otherwise, if you don't have too much data, you should use stratified 10-fold cross-validation.
How big is lots? Well, this is what everyone asks. How long is a piece of string? You know, it's hard to say, but it depends on a few things. It depends on the number of classes in your dataset. If you've got a two-class dataset, then if you had, say 100-1000 samples, data points, that would probably be good enough for a pretty reliable evaluation. If you did a 90% / 10% split in the training and test set, and if you had, say 10,000 data points in a two-class problem, then I think you'd have lots and lots of data. You wouldn't need to go to cross-validation. If, on the other hand, you had 100 different classes, then that's different, right? You would need a larger dataset, because you want a fair representation of each class when you do the evaluation. It's really hard to say exactly; it depends on the circumstances. If you've got thousands and thousands of data points, you might just do things once with hold out. If you've got less than a thousand data points, even with a two-class problem, then you might as well do 10-fold cross-validation. It doesn't really take much longer. Well, it takes 10-times as long, but the times are generally pretty short.
You can read more about this in Section 5.3 of the course text on cross-validation. And now it's time for you to go and do the activity associated with this lesson. See you soon!