1 00:00:16,309 --> 00:00:22,070 Hi! This is Lesson 2.2 in Data Mining with Weka, and here we're going to look at training 2 00:00:22,070 --> 00:00:27,710 and testing in a little bit more detail. 3 00:00:27,710 --> 00:00:29,369 Here's a situation. 4 00:00:29,369 --> 00:00:34,519 We've got a machine learning algorithm, and we feed into it training data, and it produces 5 00:00:34,519 --> 00:00:38,379 a classifier -- a basic machine learning situation. 6 00:00:38,379 --> 00:00:43,620 For that classifier, we can test it with some independent test data. 7 00:00:43,620 --> 00:00:49,820 We can put that into the classifier and get some evaluation results, and, separately, 8 00:00:49,820 --> 00:00:55,159 we can deploy the classifier in some real situation to make predictions on fresh data 9 00:00:55,159 --> 00:00:58,589 coming from the environment. 10 00:00:58,589 --> 00:01:03,530 It's really important in classification, when you're looking at your evaluation results, 11 00:01:03,530 --> 00:01:09,029 you only get reliable evaluation results if the test data is different from the training data. 12 00:01:10,080 --> 00:01:14,250 That's what we're going to look at in this lesson. 13 00:01:14,250 --> 00:01:19,090 What if you only have one dataset? If you just have one dataset, you should divide it 14 00:01:19,090 --> 00:01:20,689 into two parts. 15 00:01:20,689 --> 00:01:24,189 Maybe use some of it for training and some of it for testing. 16 00:01:24,189 --> 00:01:27,549 Perhaps, 2/3rds of it for training and 1/3rd of it for testing. 17 00:01:27,549 --> 00:01:32,369 It's really important that the training data is different from the test data. 18 00:01:32,369 --> 00:01:38,759 Both training and test sets are produced by independent sampling from an infinite population. 19 00:01:38,759 --> 00:01:43,479 That's the basic scenario here, but they're different independent samples. 20 00:01:43,479 --> 00:01:44,950 It's not the same data. 21 00:01:44,950 --> 00:01:49,479 If it is the same data, then your evaluation results are misleading. 22 00:01:49,479 --> 00:01:56,479 They don't reflect what you should actually expect on new data when you deploy your classifier. 23 00:01:57,060 --> 00:02:02,600 Here we're going to look at the segment dataset, which we used in the last lesson. 24 00:02:02,600 --> 00:02:09,600 I'm going to open the segment-challenge. 25 00:02:09,759 --> 00:02:12,640 I'm going to use a supplied test set. 26 00:02:12,640 --> 00:02:19,110 First of all, I'm going to use the J48 tree learner. 27 00:02:19,110 --> 00:02:21,530 I'm going to use a supplied test set, 28 00:02:21,530 --> 00:02:25,579 and I will set it to the appropriate segment-test file, segment-test.arff. 29 00:02:32,879 --> 00:02:38,579 I'm going to open that. Now we've got a test set, and let's see how it does. 30 00:02:38,879 --> 00:02:45,510 In the last lesson, on the same data with the user classifier, I think I got 79% accuracy. 31 00:02:45,510 --> 00:02:49,140 J48 does much better; 32 00:02:49,140 --> 00:02:55,989 it gets 96% accuracy on the same test set. 33 00:02:55,989 --> 00:03:00,670 Suppose I was to evaluate it on the training set? I can do that by just specifying under 34 00:03:00,670 --> 00:03:03,049 Test options Use training set. 35 00:03:03,049 --> 00:03:08,069 Now it will train it again and evaluate it on the training set, which is not what you're 36 00:03:08,069 --> 00:03:12,319 supposed to do, because you get misleading results. 37 00:03:12,319 --> 00:03:17,739 Here, it's saying the accuracy is 99% on the training set. 38 00:03:17,739 --> 00:03:24,640 That is not representative of what we would get using this on independent data. 39 00:03:24,640 --> 00:03:30,540 If we had just one dataset, if we didn't have a test dataset, we could do a percentage split. 40 00:03:30,540 --> 00:03:31,900 Here's a percentage split. 41 00:03:31,900 --> 00:03:37,219 This is going to be 66% training data and 34% test data. 42 00:03:37,219 --> 00:03:40,200 That's going to make a random split of the dataset. 43 00:03:40,200 --> 00:03:47,019 If I run that, I get 95%. 44 00:03:47,019 --> 00:03:50,160 That's just about the same as what we got when we had an independent test set, 45 00:03:50,160 --> 00:03:52,009 just slightly worse. 46 00:03:54,109 --> 00:04:01,109 If I were to run it again, if we had a different split, we'd expect a slightly different result, 47 00:04:01,819 --> 00:04:08,640 but actually, I get exactly the same results, 95.098%. 48 00:04:08,640 --> 00:04:14,719 That's because Weka, before it does a run, it reinitializes the random number generator. 49 00:04:14,719 --> 00:04:18,220 The reason is to make sure that you can get repeatable results. 50 00:04:18,220 --> 00:04:22,120 If it didn't do that, then the results that you got would not be repeatable. 51 00:04:22,120 --> 00:04:27,940 However, if you wanted to have a look at the differences that you might get on different 52 00:04:27,940 --> 00:04:32,560 runs, then there is a way of resetting the random number between each run. 53 00:04:32,560 --> 00:04:37,880 We're going to look at that in the next lesson. 54 00:04:37,880 --> 00:04:38,630 That's this lesson. 55 00:04:38,630 --> 00:04:42,440 The basic assumption of machine learning is that the training and test sets are independently 56 00:04:42,440 --> 00:04:46,729 sampled from an infinite population, the same population. 57 00:04:46,729 --> 00:04:52,750 If you have just one dataset, you should hold part of it out for testing, maybe 33% as we 58 00:04:52,750 --> 00:04:56,009 just did or perhaps 10%. 59 00:04:56,009 --> 00:05:00,550 We would expect a slight variation in results each time if we hold out a different set, 60 00:05:00,550 --> 00:05:05,669 but Weka produces the same results each time by design by making sure it reinitializes 61 00:05:05,669 --> 00:05:09,449 the random number generator each time. 62 00:05:09,449 --> 00:05:12,389 We ran J48 on the segment-challenge dataset. 63 00:05:12,389 --> 00:05:16,080 If you'd like, you can go and look at the course text on 64 00:05:16,080 --> 00:05:18,180 Training and testing, Section 5.1, 65 00:05:18,180 --> 00:05:21,380 and please go and do the activity associated with this lesson. 66 00:05:21,580 --> 00:05:23,180 Bye for now!