1 00:00:16,320 --> 00:00:21,939 Hello again! In the last lesson, we looked at training and testing. 2 00:00:21,939 --> 00:00:29,820 We saw that we can evaluate a classifier on an independent test set, or using a percentage split, 3 00:00:29,820 --> 00:00:35,730 with a certain percentage of the dataset used to train and the rest used for testing, 4 00:00:35,730 --> 00:00:41,230 or -- and this is generally a very bad idea -- we can evaluate it on the training set itself, 5 00:00:41,230 --> 00:00:45,550 which gives misleadingly optimistic performance figures. 6 00:00:45,550 --> 00:00:51,820 In this lesson, we're going to look a little bit more at training and testing. 7 00:00:51,820 --> 00:01:03,640 In fact, what we're going to do is repeatedly train and test using percentage split. 8 00:01:03,640 --> 00:01:08,420 Now, in the last lesson, we saw that if you simply repeat the training and testing, you 9 00:01:08,420 --> 00:01:13,610 get the same result each time because Weka initializes the random number generator before 10 00:01:13,610 --> 00:01:18,500 it does each run to make sure that you know what's going on when you do the same experiment 11 00:01:18,500 --> 00:01:19,220 again tomorrow. 12 00:01:19,220 --> 00:01:22,090 But, there is a way of overriding that. 13 00:01:22,090 --> 00:01:28,820 So, we will be using independent random numbers on different occasions to produce a percentage 14 00:01:28,820 --> 00:01:34,210 split of the dataset into a training and test set. 15 00:01:34,210 --> 00:01:37,780 I'm going to open the segment-challenge data again. 16 00:01:37,780 --> 00:01:40,130 That's what we used before. 17 00:01:40,130 --> 00:01:44,700 Notice there are 1500 instances here; 18 00:01:44,700 --> 00:01:45,729 that's quite a lot. 19 00:01:45,729 --> 00:01:48,649 I'm going to go to Classify. 20 00:01:48,649 --> 00:01:54,549 I'm going to choose J48, our standard method, I guess. 21 00:01:54,549 --> 00:02:00,710 I'm going to use a percentage split, and because we've got 1500 instances, I'm going to choose 22 00:02:00,710 --> 00:02:05,329 90% for training and just 10% for testing. 23 00:02:05,329 --> 00:02:12,070 I reckon that 10% -- that's 150 instances -- for testing is going to give us a reasonable estimate, 24 00:02:12,070 --> 00:02:16,720 and we might as well train on as many as we can to get the most accurate classifier. 25 00:02:16,720 --> 00:02:25,520 I'm going to run this, and the accuracy figure I get -- this is what I got in the last lesson -- 26 00:02:25,520 --> 00:02:27,740 is 96.6667%. 27 00:02:29,340 --> 00:02:34,949 Now, this is misleadingly high accuracy here. 28 00:02:34,949 --> 00:02:41,000 I'm going to call that 96.7%, or 0.967. 29 00:02:41,000 --> 00:02:45,560 And then, I'm going to do it again and just see how much variation we get of that figure 30 00:02:45,560 --> 00:02:49,500 initializing the random number generator to different amounts each time. 31 00:02:50,460 --> 00:02:57,460 If I go to the More options menu, I get a number of options here which are quite useful: 32 00:02:57,770 --> 00:03:00,150 outputting the model, we're doing that; 33 00:03:00,150 --> 00:03:01,680 outputting statistics; 34 00:03:01,680 --> 00:03:03,890 we can output different evaluation measures; 35 00:03:03,890 --> 00:03:05,770 we're doing the confusion matrix; 36 00:03:05,770 --> 00:03:08,060 we're storing the prediction for visualization; 37 00:03:08,060 --> 00:03:10,860 we can output the predictions if we want; 38 00:03:10,860 --> 00:03:14,370 we can do a cost-sensitive evaluation; 39 00:03:14,370 --> 00:03:20,980 and we can set the random seed for cross-validation or percentage split. 40 00:03:20,980 --> 00:03:22,300 That's set by default to 1. 41 00:03:22,300 --> 00:03:26,170 I'm going to change that to 2, a different random seed. 42 00:03:26,170 --> 00:03:31,490 We could also output the source code for the classifier if we wanted, but I just want to 43 00:03:31,490 --> 00:03:32,950 change the random seed. 44 00:03:32,950 --> 00:03:35,450 Then I want to run it again. 45 00:03:35,450 --> 00:03:42,450 Before we got 0.967, and this time we get 0.94, 94%. 46 00:03:43,180 --> 00:03:45,310 Quite different, you see. 47 00:03:45,310 --> 00:03:52,090 If I were then to change this again to, say, 3, and run it again. 48 00:03:52,090 --> 00:03:53,900 Again I get 94%. 49 00:03:53,900 --> 00:04:03,830 If I change it again to 4 and run it again, I get 96.7%. 50 00:04:03,830 --> 00:04:05,200 Let's do one more. 51 00:04:05,200 --> 00:04:12,200 Change it to 5, run it again, and now I get 95.3%. 52 00:04:14,330 --> 00:04:15,710 Here's a table with these figures in. 53 00:04:15,710 --> 00:04:21,480 If we run it 10 times, we get this set of results. 54 00:04:21,480 --> 00:04:26,330 Given this set of experimental results, we can calculate the mean and standard deviation. 55 00:04:26,330 --> 00:04:33,770 The sample mean is the sum of all of these error figures -- or these success rates, I should say -- 56 00:04:33,770 --> 00:04:37,200 divided by the number, 10 of them. 57 00:04:37,200 --> 00:04:41,760 That's 0.949, about 95%. 58 00:04:41,760 --> 00:04:43,290 That's really what we would expect to get. 59 00:04:43,290 --> 00:04:46,910 That's a better estimate than the 96.7% that we started out with. 60 00:04:46,910 --> 00:04:49,460 A more reliable estimate. 61 00:04:49,460 --> 00:04:51,420 We can calculate the sample variance. 62 00:04:51,420 --> 00:04:57,200 We take the deviation from the mean, we subtract the mean from each of these numbers, we square that, 63 00:04:57,200 --> 00:05:02,560 add them up, and we divide, not by n, but by n - 1. 64 00:05:02,560 --> 00:05:04,730 That might surprise you, perhaps. 65 00:05:04,730 --> 00:05:11,730 The reason for it being n - 1 is because we've actually calculated the mean from this sample. 66 00:05:12,650 --> 00:05:19,060 When the mean is calculated from the sample, you need to divide by n - 1, leading to a slightly larger 67 00:05:19,060 --> 00:05:22,090 variance estimate than if you were to divide by n. 68 00:05:22,090 --> 00:05:32,740 We take the square root of that, and in this case, we get a standard deviation of 1.8%. 69 00:05:32,740 --> 00:05:39,190 Now you can see that the real performance of J48 on the segment-challenge dataset is 70 00:05:39,190 --> 00:05:44,460 approximately 95% accuracy, plus or minus approximately 2%. 71 00:05:44,460 --> 00:05:50,550 Anywhere, let's say, between 93-97% accuracy. 72 00:05:50,550 --> 00:05:55,470 These figures that you get, that Weka puts out for you, are misleading. 73 00:05:55,470 --> 00:06:04,720 You need to be careful how you interpret them, because the result is certainly not 95.333%. 74 00:06:04,720 --> 00:06:08,550 There's a lot of variation on a lot of these figures. 75 00:06:09,900 --> 00:06:13,870 Remember, the basic assumption is the training and test sets are sampled independently from 76 00:06:13,870 --> 00:06:18,940 an infinite population, and you should expect a slight variation in results -- perhaps more 77 00:06:18,940 --> 00:06:21,660 than just a slight variation in results. 78 00:06:21,660 --> 00:06:27,680 You can estimate the variation in results by setting the random-number seed and repeating 79 00:06:27,680 --> 00:06:29,520 the experiment. 80 00:06:29,520 --> 00:06:33,520 You can calculate the mean and the standard deviation experimentally, which is what we 81 00:06:33,520 --> 00:06:34,240 just did. 82 00:06:35,270 --> 00:06:38,740 Off you go now, and do the activity associated with this lesson. 83 00:06:39,140 --> 00:06:40,240 I'll see you in the next lesson. 84 00:06:40,540 --> 00:06:42,090 Bye!