1 00:00:16,510 --> 00:00:23,510 Hi! In this lesson, Lesson 2.5, I want to introduce you to the standard way of evaluating 2 00:00:23,680 --> 00:00:28,740 the performance of a machine learning algorithm, which is called cross-validation. 3 00:00:28,740 --> 00:00:35,740 A couple of lessons back, we looked at evaluating on an independent test set, and we also talked 4 00:00:36,640 --> 00:00:41,160 about evaluating on the training set (don't do that). 5 00:00:41,160 --> 00:00:47,190 We also talked about evaluating using the holdout method by taking the one dataset and 6 00:00:47,190 --> 00:00:51,370 holding out a little bit for testing and using the rest for training. 7 00:00:51,370 --> 00:00:57,129 There is a fourth option on Weka's Classify panel, which is called cross-validation, and 8 00:00:57,129 --> 00:01:02,049 that's what we're going to talk about here. 9 00:01:02,649 --> 00:01:07,680 Cross-validation is a way of improving upon repeated holdout. 10 00:01:07,680 --> 00:01:14,310 We tried using the holdout method with different random-number seeds each time. 11 00:01:14,310 --> 00:01:17,659 That's called repeated holdout. 12 00:01:17,659 --> 00:01:21,860 Cross-validation is a systematic way of doing repeated holdout that actually improves upon 13 00:01:21,860 --> 00:01:26,080 it by reducing the variance of the estimate. 14 00:01:26,080 --> 00:01:30,680 We take a training set and we create a classifier. 15 00:01:30,680 --> 00:01:34,370 Then we're looking to evaluate the performance of that classifier, and there is a certain 16 00:01:34,370 --> 00:01:39,020 amount of variance in that evaluation, because it's all statistical underneath. 17 00:01:39,020 --> 00:01:42,480 We want to keep the variance in the estimation as low as possible. 18 00:01:42,480 --> 00:01:48,330 Cross-validation is a way of reducing the variance, and a variant on cross-validation 19 00:01:48,330 --> 00:01:52,610 called stratified cross-validation reduces it even further. 20 00:01:52,610 --> 00:01:58,580 I'm going to explain that in this class. 21 00:01:58,580 --> 00:02:03,440 In a previous lesson, we held out 10% for the testing and we repeated that 10 times. 22 00:02:03,440 --> 00:02:06,310 That's the repeated holdout method. 23 00:02:06,310 --> 00:02:13,310 We've got one dataset, and we divided it independently 10 separate times into a training set and 24 00:02:14,170 --> 00:02:16,450 a test set. 25 00:02:16,450 --> 00:02:23,450 With cross-validation, we divide it just once, but we divide into, say, 10 pieces. 26 00:02:23,820 --> 00:02:28,690 Then, we take 9 of the pieces and use them for training, 27 00:02:28,690 --> 00:02:30,920 and the last piece we use for testing. 28 00:02:31,320 --> 00:02:37,630 Then, with the same division, we take another 9 pieces and use them for training and the 29 00:02:37,630 --> 00:02:39,960 held out piece for testing. 30 00:02:39,960 --> 00:02:44,610 We do the whole thing 10 times, using a different segment for testing each time. 31 00:02:44,610 --> 00:02:50,100 In other words, we divide the dataset into 10 pieces, and then we hold out each of these 32 00:02:50,100 --> 00:02:57,100 pieces in turn for testing, train on the rest, do the testing and average the 10 results. 33 00:02:57,160 --> 00:03:00,860 That would be 10-fold cross-validation. 34 00:03:00,860 --> 00:03:07,040 Divide the dataset into 10 parts (these are called folds), hold out each part in turn 35 00:03:07,040 --> 00:03:07,960 and average the results. 36 00:03:07,960 --> 00:03:14,270 So, each data point in the dataset is used once for testing and 9 times for training. 37 00:03:14,270 --> 00:03:17,000 That's 10-fold cross-validation. 38 00:03:17,000 --> 00:03:22,320 Stratified cross-validation is a simple variant where, when we do the initial division into 39 00:03:22,320 --> 00:03:28,110 10 parts, we ensure that each fold has got approximately the correct proportion of each 40 00:03:28,110 --> 00:03:29,150 of the class values. 41 00:03:29,150 --> 00:03:36,150 Of course, there are many different ways of dividing a dataset into 10 equal parts, 42 00:03:36,150 --> 00:03:40,600 we just make sure we choose a division that has approximately the right representation 43 00:03:40,600 --> 00:03:42,880 of class values in each of the folds. 44 00:03:42,880 --> 00:03:44,790 That's stratified cross-validation. 45 00:03:44,790 --> 00:03:50,880 It helps reduce the variance in the estimate a little bit more. 46 00:03:50,880 --> 00:03:59,540 Then, once we've done the cross-validation, what Weka does is run the algorithm an eleventh 47 00:03:59,540 --> 00:04:01,750 time on the whole dataset. 48 00:04:01,750 --> 00:04:05,580 That will then produce a classifier that we might deploy in practice. 49 00:04:05,580 --> 00:04:11,790 We use 10-fold cross-validation in order to get an evaluation result and estimate of the error, 50 00:04:11,790 --> 00:04:17,180 and then finally, we do classification one more time to get an actual classifier 51 00:04:17,180 --> 00:04:20,000 to use in practice. 52 00:04:22,550 --> 00:04:24,050 That's what I wanted to tell you. 53 00:04:24,050 --> 00:04:28,150 Cross-validation is better than repeated holdout, and we'll look at that in the next lesson. 54 00:04:28,150 --> 00:04:31,120 Stratified cross-validation is even better. 55 00:04:31,120 --> 00:04:37,760 Weka does stratified cross-validation by default. 56 00:04:37,960 --> 00:04:42,650 With 10-fold cross-validation, Weka invokes the learning algorithm 11 times, one for each 57 00:04:42,650 --> 00:04:47,820 fold of the cross-validation and then a final time on the entire dataset. 58 00:04:47,820 --> 00:04:52,190 The practical rule of thumb is that if you've got lots of data, you can use a percentage 59 00:04:52,190 --> 00:04:54,740 split and evaluate it just once. 60 00:04:54,740 --> 00:05:01,670 Otherwise, if you don't have too much data, you should use stratified 10-fold cross-validation. 61 00:05:01,670 --> 00:05:03,830 How big is lots? Well, this is what everyone asks. 62 00:05:03,830 --> 00:05:10,830 How long is a piece of string, you know? It's hard to say, but it depends on a few things. 63 00:05:11,150 --> 00:05:14,000 It depends on the number of classes in your dataset. 64 00:05:14,000 --> 00:05:24,220 If you've got a two-class dataset, then if you had, say 100-1000 datapoints, that would 65 00:05:24,220 --> 00:05:29,490 probably be good enough for a pretty reliable evaluation. 66 00:05:29,490 --> 00:05:33,750 If you did 90% and 10% split in the training and test set. 67 00:05:33,750 --> 00:05:39,560 If you had, say 10,000 data points in a two-class problem, then I think you'd have lots and 68 00:05:39,560 --> 00:05:43,360 lots of data, you wouldn't need to go to cross-validation. 69 00:05:43,360 --> 00:05:50,130 If, on the other hand, you had 100 different classes, then that's different, right? 70 00:05:50,130 --> 00:05:54,720 You would need a larger dataset, because you want a fair representation of each class when you 71 00:05:54,920 --> 00:05:57,790 do the evaluation. 72 00:05:57,790 --> 00:06:00,780 It's really hard to say exactly; it depends on the circumstances. 73 00:06:00,780 --> 00:06:05,790 If you've got thousands and thousands of data points, you might just do things once with 74 00:06:05,790 --> 00:06:07,200 a holdout. 75 00:06:07,200 --> 00:06:14,100 If you've got less than a thousand data points, even with a two-class problem, then you might 76 00:06:14,100 --> 00:06:15,930 as well do 10-fold cross-validation. 77 00:06:15,930 --> 00:06:18,440 It really doesn't take much longer. 78 00:06:18,440 --> 00:06:23,340 Well, it takes 10-times as long, but the times are generally pretty short. 79 00:06:23,340 --> 00:06:29,770 You can read more about this in Section 5.3 of the course text on cross-validation. 80 00:06:29,770 --> 00:06:35,030 Now it's time for you to go and do the activity associated with this [lesson]. 81 00:06:35,030 --> 00:06:42,030 See you soon!