1 00:00:16,460 --> 00:00:20,980 Hi! Welcome back for another few minutes in New Zealand. 2 00:00:20,980 --> 00:00:27,410 In the last lesson, Lesson 5.1, we learned that Weka only helps you with a small part 3 00:00:27,410 --> 00:00:33,370 of the overall data mining process, the technical part, which is perhaps the easy part. 4 00:00:33,370 --> 00:00:38,690 In this lesson, we're going to learn that there are many pitfalls and pratfalls even 5 00:00:38,690 --> 00:00:40,470 in that part. 6 00:00:41,860 --> 00:00:43,149 Let me just define these for you. 7 00:00:43,149 --> 00:00:48,840 A "pitfall" is a hidden or unsuspected danger or difficulty, and there are plenty of those 8 00:00:48,840 --> 00:00:51,059 in the field of machine learning. 9 00:00:51,059 --> 00:00:57,690 A "pratfall" is a stupid and humiliating action, which is very easy to do when you're working 10 00:00:57,690 --> 00:01:01,870 with data. 11 00:01:01,870 --> 00:01:04,710 The first lesson is that you should be skeptical. 12 00:01:04,710 --> 00:01:08,860 In data mining it's very easy to cheat. 13 00:01:08,860 --> 00:01:14,659 Whether you're cheating consciously or unconsciously, it's easy to mislead yourself or mislead others 14 00:01:14,659 --> 00:01:18,440 about the significance of your results. 15 00:01:18,440 --> 00:01:25,440 For a reliable test, you should use a completely fresh sample of data that has never been seen before. 16 00:01:25,440 --> 00:01:29,390 You should save something for the very end, that you don't use until you've selected your 17 00:01:29,390 --> 00:01:33,579 algorithm, decided how you're going to apply it, and the filters, and so on. 18 00:01:33,579 --> 00:01:39,659 At the very, very end, having done all that, run it on some fresh data to get an estimate 19 00:01:39,659 --> 00:01:41,570 of how it will perform. 20 00:01:41,570 --> 00:01:47,500 Don't be tempted to then change it to improve it so that you get better results on that data. 21 00:01:47,500 --> 00:01:51,659 Always do your final run on fresh data. 22 00:01:51,659 --> 00:01:56,189 We've talked a lot about overfitting, and this is basically the same kind of problem. 23 00:01:56,189 --> 00:02:00,820 Of course, you know not to test on the training set. 24 00:02:00,820 --> 00:02:05,030 We've talked about that endlessly throughout this course. 25 00:02:05,030 --> 00:02:09,370 Data that's been used for development in any way is tainted. 26 00:02:09,370 --> 00:02:14,650 Any time you use some data to help you make a choice of the filter, or the classifier, 27 00:02:14,650 --> 00:02:20,250 or how you're going to treat your problem, then that data is tainted. 28 00:02:20,250 --> 00:02:24,470 You should be using completely fresh data to get evaluation results. 29 00:02:24,470 --> 00:02:29,400 Leave some evaluation data aside for the very end of the process. 30 00:02:29,400 --> 00:02:34,239 That's the first piece of advice. 31 00:02:34,239 --> 00:02:38,280 Another thing I haven't told you about in this course so far is missing values. 32 00:02:38,280 --> 00:02:45,280 In real datasets, it's very common that some of the data values are missing. 33 00:02:45,370 --> 00:02:46,579 They haven't been recorded. 34 00:02:48,220 --> 00:02:53,579 They might be unknown; we might have forgotten to record them; they might be irrelevant. 35 00:02:55,810 --> 00:03:00,310 There are two basic strategies for dealing with missing values in a dataset. 36 00:03:00,310 --> 00:03:05,970 You can omit instances where the attribute value is missing, or somehow find a way of 37 00:03:05,970 --> 00:03:08,780 omitting that particular attribute in that instance. 38 00:03:08,780 --> 00:03:13,260 Or you can treat missing as a separate possible value. 39 00:03:15,060 --> 00:03:20,790 You need to ask yourself, is there significance in the fact that a value is missing? They 40 00:03:20,799 --> 00:03:24,419 say that if you've got something wrong with you and go to the doctor, and he does some 41 00:03:24,419 --> 00:03:30,370 tests on you: if you just record the tests that he does -- not the results of the test, 42 00:03:30,370 --> 00:03:34,669 but just the ones he chooses to do -- there's a very good chance that you can work out what's 43 00:03:34,669 --> 00:03:39,919 wrong with you just from the existence of the tests, not from their results. 44 00:03:39,919 --> 00:03:43,180 That's because the doctor chooses tests intelligently. 45 00:03:43,180 --> 00:03:48,680 The fact that he doesn't choose a test doesn't mean that that value is missing, or accidentally 46 00:03:48,680 --> 00:03:49,660 not there. 47 00:03:49,660 --> 00:03:54,139 There's huge significance in the fact that he's chosen not to do certain tests. 48 00:03:54,139 --> 00:03:59,019 This is a situation where "missing" should be treated as a separate possible value. 49 00:03:59,019 --> 00:04:03,709 There's significance in the fact that a value is missing. 50 00:04:03,709 --> 00:04:06,959 But in other situations, a value might be missing simply because a piece of equipment 51 00:04:06,959 --> 00:04:11,180 malfunctioned, or for some other reason -- maybe someone forgot something. 52 00:04:11,180 --> 00:04:16,799 Then there's no significance in the fact that it's missing. 53 00:04:16,799 --> 00:04:20,850 Pretty well all machine learning algorithms deal with missing values. 54 00:04:20,850 --> 00:04:25,889 In an ARFF file, if you put a question mark as a data value, that's treated as a missing 55 00:04:25,889 --> 00:04:27,600 value. 56 00:04:27,600 --> 00:04:30,530 All methods in Weka can deal with missing values. 57 00:04:30,530 --> 00:04:33,759 But they make different assumptions about them. 58 00:04:33,759 --> 00:04:39,460 If you don't appreciate this, it's easy to get misled. 59 00:04:39,460 --> 00:04:45,550 Let me just take two simple and well known (to us) examples -- OneR and J48. 60 00:04:45,550 --> 00:04:47,460 They deal with missing values in different ways. 61 00:04:47,460 --> 00:05:00,740 I'm going to load the nominal weather data and run OneR on it: I get 43%. 62 00:05:00,740 --> 00:05:10,600 Let me run J48 on it, to get 50%. 63 00:05:10,600 --> 00:05:11,750 I'm going to 64 00:05:11,750 --> 00:05:21,940 edit this dataset by changing the value of "outlook" for the first four "no" instances 65 00:05:21,940 --> 00:05:24,040 to "missing". 66 00:05:24,040 --> 00:05:26,580 That's how we do it here in this editor. 67 00:05:26,580 --> 00:05:32,060 If we were to write this file out in ARFF format, we'd find that these values are written 68 00:05:32,060 --> 00:05:36,600 into the file as question marks. 69 00:05:37,380 --> 00:05:42,870 Now, if we look at "outlook", you can see that it says here there are 4 missing values. 70 00:05:42,870 --> 00:05:49,870 If you count up these labels -- 2, 4, and 4 -- that's 10 labels. 71 00:05:50,350 --> 00:05:54,370 Plus another 4 that are missing, to make the 14 instances. 72 00:05:54,370 --> 00:06:00,120 Let's go back to J48 and run it again. 73 00:06:00,120 --> 00:06:02,400 We still get 50%, the same result. 74 00:06:03,400 --> 00:06:09,620 Of course, this is a tiny dataset, but the fact is that the results here are not affected 75 00:06:09,620 --> 00:06:12,530 by the fact that a few of the values are missing. 76 00:06:12,530 --> 00:06:22,280 However, if we run OneR, I get a much higher accuracy, a 93% accuracy. 77 00:06:26,370 --> 00:06:31,660 The rule that I've got is "branch on outlook", which is what we had before I think. 78 00:06:31,660 --> 00:06:36,590 Here it says there are 4 possibilities: if it's sunny, it's a yes; if it's overcast it's 79 00:06:36,590 --> 00:06:41,130 a yes; if it's rainy, it's a yes; and if it's missing, it's a no. 80 00:06:41,130 --> 00:06:45,870 Here, OneR is using the fact that a value is missing as significant, as something you 81 00:06:45,870 --> 00:06:46,970 can branch on. 82 00:06:46,970 --> 00:06:53,010 Whereas if you were to look at a J48 tree, it would never have a branch that corresponded 83 00:06:53,010 --> 00:06:54,280 to a missing value. 84 00:06:54,280 --> 00:06:56,160 It treats them differently. 85 00:06:56,160 --> 00:07:00,910 It is very important to know and remember. 86 00:07:00,910 --> 00:07:07,910 The final thing I want to tell you about in this lesson is the "no free lunch" theorem. 87 00:07:08,290 --> 00:07:11,930 There's no free lunch in data mining. 88 00:07:11,930 --> 00:07:13,440 Here's a way to illustrate it. 89 00:07:13,440 --> 00:07:17,430 Suppose you've got a 2-class problem with 100 binary attributes. 90 00:07:17,430 --> 00:07:22,260 Let's say you've got a huge training set with a million instances and their classifications 91 00:07:22,260 --> 00:07:25,690 in the training set. 92 00:07:25,690 --> 00:07:31,910 The number of possible instances is 2 to the 100 (2^100), because there are 100 binary 93 00:07:31,910 --> 00:07:33,120 attributes. 94 00:07:33,120 --> 00:07:34,980 And you know 10^6 of them. 95 00:07:34,980 --> 00:07:40,160 So you don't know the classes of 2^100 - 10^6 examples. 96 00:07:40,160 --> 00:07:47,780 Let me tell you that 2^100 - 10^6 is 99.999...% of 2^100. 97 00:07:47,780 --> 00:07:52,220 There's this huge number of examples that you just don't know the classes of. 98 00:07:52,220 --> 00:07:56,780 How could you possibly figure them out? If you apply a data mining scheme to this, it 99 00:07:56,780 --> 00:08:02,130 will figure them out, but how could you possibly figure out all of those things just from the 100 00:08:02,130 --> 00:08:06,750 tiny amount of data that you've been given. 101 00:08:06,750 --> 00:08:11,220 In order to generalize, every learner must embody some knowledge or assumptions beyond 102 00:08:11,220 --> 00:08:14,440 the data it's given. 103 00:08:14,440 --> 00:08:18,680 Each learning algorithm implicitly provides a set of assumptions. 104 00:08:18,680 --> 00:08:23,400 The best way to think about those assumptions is to think back to the Boundary Visualizer 105 00:08:23,400 --> 00:08:26,320 we looked at in Lesson 4.1. 106 00:08:26,320 --> 00:08:30,150 You saw that different machine learning schemes are capable of drawing different kinds of 107 00:08:30,150 --> 00:08:33,230 boundaries in instance space. 108 00:08:33,230 --> 00:08:39,530 These boundaries correspond to a set of assumptions about the sort of decisions we can make. 109 00:08:39,530 --> 00:08:44,350 There's no universal best algorithm; there's no free lunch. 110 00:08:44,350 --> 00:08:46,900 There's no single best algorithm. 111 00:08:46,900 --> 00:08:52,080 Data mining is an experimental science, and that's why we've been teaching you how to 112 00:08:52,080 --> 00:08:55,010 experiment with data mining yourself. 113 00:08:56,240 --> 00:08:57,920 This is just a summary. 114 00:08:57,920 --> 00:09:02,250 Be skeptical: when people tell you about data mining results and they say that it gets this 115 00:09:02,250 --> 00:09:07,450 kind of accuracy, then to be sure about that you want to have them test their classifier 116 00:09:07,450 --> 00:09:12,570 on your new, fresh data that they've never seen before. 117 00:09:12,570 --> 00:09:15,480 Overfitting has many faces. 118 00:09:15,480 --> 00:09:19,640 Different learning schemes make different assumptions about missing values, which can 119 00:09:19,640 --> 00:09:21,400 really change the results. 120 00:09:21,400 --> 00:09:26,950 There is no universal best learning algorithm. 121 00:09:26,950 --> 00:09:32,240 Data mining is an experimental science, and it's very easy to be misled by people quoting 122 00:09:32,240 --> 00:09:37,160 the results of data mining experiments. 123 00:09:37,160 --> 00:09:37,890 That's it for now. 124 00:09:37,890 --> 00:09:40,540 Off you go and do the activity. 125 00:09:40,540 --> 00:09:42,080 We'll see you in the next lesson. 126 00:09:42,080 --> 00:09:43,670 Bye for now!