1 00:00:18,000 --> 00:00:23,040 Hi! We've just finished Class 3, and here are some of the issues that arose. 2 00:00:23,040 --> 00:00:27,949 I have a list of them here, so let's start at the top. 3 00:00:27,949 --> 00:00:34,800 Numeric precision in activities has caused a little bit of unnecessary angst. 4 00:00:34,800 --> 00:00:36,570 So we've simplified our policy. 5 00:00:36,570 --> 00:00:41,440 In general, we're asking you to round your percentages to the nearest integer. 6 00:00:41,440 --> 00:00:51,940 We certainly don't want you typing in those 4 decimal places, because that accuracy is misleading. 7 00:00:51,940 --> 00:00:54,250 Some people are getting the wrong results in Weka. 8 00:00:54,250 --> 00:00:59,309 One reason you might get the wrong results is that the random seed is not set to the 9 00:00:59,309 --> 00:01:00,559 default value. 10 00:01:00,559 --> 00:01:04,670 Whenever you change the random seed, it stays there until you change it back or until you 11 00:01:04,670 --> 00:01:06,960 restart Weka. 12 00:01:06,960 --> 00:01:10,960 Just restart Weka or reset the random seed to 1. 13 00:01:10,960 --> 00:01:14,420 Another thing you should do is check your version of Weka. 14 00:01:14,420 --> 00:01:17,820 We asked you to download 3.6.10. 15 00:01:17,820 --> 00:01:22,450 There have been some bug fixes since the previous version, so you really do need to use this 16 00:01:22,450 --> 00:01:25,590 new version. 17 00:01:25,590 --> 00:01:33,560 One of the activities asked you to copy an attribute, and some people found some surprising 18 00:01:33,560 --> 00:01:37,840 things with Weka claiming 100% accuracy. 19 00:01:37,840 --> 00:01:43,970 If you accidentally ask Weka to predict something that's already there as an attribute, it will 20 00:01:43,970 --> 00:01:49,780 do very well, with very high accuracy! It's very easy to mislead yourself when you're 21 00:01:49,780 --> 00:01:51,039 doing data mining. 22 00:01:51,039 --> 00:01:56,530 You just need to make sure you know what you're trying to predict, you know what the attributes 23 00:01:56,530 --> 00:02:02,640 are, and you haven't accidentally included a copy of the class attribute as one of the 24 00:02:02,640 --> 00:02:05,720 attributes that's being used for prediction. 25 00:02:07,480 --> 00:02:11,770 There's been some discussion on the mailing list about whether OneR is really always better 26 00:02:11,770 --> 00:02:13,819 than ZeroR on the training set. 27 00:02:13,819 --> 00:02:16,140 In fact, it is. 28 00:02:16,140 --> 00:02:16,939 Someone proved it. 29 00:02:16,939 --> 00:02:22,579 (Thank you Jurek for sharing that proof with us.) 30 00:02:22,579 --> 00:02:29,370 Someone else found a counterexample! "If we had a dataset with 10 instances, 6 belonging 31 00:02:29,370 --> 00:02:36,370 to Class A and 4 belonging to Class B, with attribute values selected randomly, wouldn't 32 00:02:36,450 --> 00:02:43,450 ZeroR outperform OneR? -- OneR would be fooled by the randomness of attribute values." 33 00:02:45,250 --> 00:02:49,329 It's kind of anthropomorphic to talk about OneR being "fooled by" things. 34 00:02:49,329 --> 00:02:50,920 It's not fooled by anything. 35 00:02:50,920 --> 00:02:55,379 It's not a person; it's not a being: it's just an algorithm. 36 00:02:55,379 --> 00:03:01,400 It just gets an input and does its thing with the data. 37 00:03:01,400 --> 00:03:08,300 If you think that OneR might be fooled, then why don't you try it? Set up this dataset 38 00:03:08,300 --> 00:03:14,100 with 10 instances, 6 in A and 4 in B, select the attributes randomly, and see what happens. 39 00:03:14,109 --> 00:03:19,030 I think you'll be able to convince yourself quite easily that this counterexample isn't 40 00:03:19,030 --> 00:03:20,489 a counterexample at all. 41 00:03:20,489 --> 00:03:26,129 It is definitely true that OneR is always better than ZeroR on the training set. 42 00:03:26,129 --> 00:03:33,129 That doesn't necessarily mean it's going to be better on an independent test set, of course. 43 00:03:33,780 --> 00:03:40,780 The next thing is Activity 3.3, which asks you to repeat attributes with NaiveBayes. 44 00:03:41,370 --> 00:03:48,280 Some people asked "why are we doing this?" It's just an exercise! We're just trying to 45 00:03:48,280 --> 00:03:54,540 understand NaiveBayes a bit better, and what happens when you get highly correlated attributes, 46 00:03:54,540 --> 00:03:57,790 like repeated attributes. 47 00:03:57,790 --> 00:04:03,010 With NaiveBayes, enough repetitions mean that the other attributes won't matter at all. 48 00:04:03,010 --> 00:04:08,129 This is because all attributes contribute equally to the decision, so multiple copies 49 00:04:08,129 --> 00:04:10,469 of an attribute skew it in that direction. 50 00:04:10,469 --> 00:04:13,639 This is not true with other learning algorithms. 51 00:04:13,639 --> 00:04:17,870 It's true for NaiveBayes, but it's not true for OneR or J48, for example. 52 00:04:17,870 --> 00:04:22,360 Copied attributes doesn't effect OneR at all. 53 00:04:22,360 --> 00:04:27,000 The copying exercise is just to illustrate what happens with NaiveBayes when you have 54 00:04:27,000 --> 00:04:28,280 non-independent attributes. 55 00:04:28,280 --> 00:04:30,770 It's not something you do in real life. 56 00:04:30,770 --> 00:04:37,770 Although you might copy an attribute in order to transform it in some way, for example. 57 00:04:38,750 --> 00:04:40,449 Someone asked about the mathematics. 58 00:04:40,449 --> 00:04:50,090 In Bayes formula you get Pr[E|H]^k, if the attribute was repeated k times, in the top line. 59 00:04:50,090 --> 00:04:57,720 How does this work mathematically? First of all, I'd just like to say that the Bayes formulation 60 00:04:57,720 --> 00:05:01,750 assumes independent attributes. 61 00:05:01,750 --> 00:05:05,509 Bayes expansion is not true if the attributes are dependent. 62 00:05:05,509 --> 00:05:09,349 But the algorithm works off that, so let's see what would happen. 63 00:05:09,349 --> 00:05:17,900 If you can stomach a bit of mathematics, here's the equation for the probability of the hypothesis 64 00:05:17,900 --> 00:05:19,490 given the evidence (Pr[H|E]). 65 00:05:19,490 --> 00:05:24,610 H might be Play is "yes" or Play is "no", for example, in the weather data. 66 00:05:24,610 --> 00:05:28,310 It's equal to this fairly complicated formula at the top, which, let me just simplify it 67 00:05:28,310 --> 00:05:32,360 by writing "..." for all the bits after here. 68 00:05:32,360 --> 00:05:44,370 So Pr[E1|H]^k, where E1 is repeated k times, times all the other stuff, divided by Pr[E]. 69 00:05:44,370 --> 00:05:51,349 What the algorithm does: because we don't know Pr[E], we normalize the 2 probabilities 70 00:05:51,349 --> 00:06:00,340 by calculating Pr[yes|E] using this formula and Pr[no|E], and normalizing them so that 71 00:06:00,340 --> 00:06:02,599 they add up to 1. 72 00:06:02,599 --> 00:06:12,400 That then computes Pr[yes|E] as this thing here -- which is at the top, up here -- Pr[E1|yes]^k, 73 00:06:12,400 --> 00:06:17,099 divided by that same thing, plus the corresponding thing for "no". 74 00:06:17,099 --> 00:06:23,580 If you look at this formula and just forget about the "...", what's going to happen is 75 00:06:23,580 --> 00:06:26,259 that these probabilities are less than 1. 76 00:06:26,259 --> 00:06:31,240 If we take them to the k'th power, they are going to get very small as k gets bigger. 77 00:06:31,240 --> 00:06:33,159 In fact, they're going to approach 0. 78 00:06:33,159 --> 00:06:37,389 But one of them is going to approach 0 faster than the other one. 79 00:06:37,389 --> 00:06:42,610 Whichever one is bigger -- for example, if the "yes" one is bigger than the "no" one 80 00:06:42,610 --> 00:06:45,849 -- then it's going to dominate. 81 00:06:45,849 --> 00:06:53,690 The normalized probability then is going to be 1 if the "yes" probability is bigger than 82 00:06:53,699 --> 00:06:57,050 the "no" probability, otherwise 0. 83 00:06:57,050 --> 00:07:01,569 That's what's actually going to happen in this formula as k approaches infinity. 84 00:07:01,569 --> 00:07:05,909 The result is as though there is only one attribute: E1. 85 00:07:05,909 --> 00:07:11,300 That's a mathematical explanation of what happens when you copy attributes in NaiveBayes. 86 00:07:12,380 --> 00:07:18,300 Don't worry if you didn't follow that; that was just for someone who asked. 87 00:07:19,650 --> 00:07:24,729 Decision trees and bits. 88 00:07:24,729 --> 00:07:28,069 Someone said on the mailing list that in the lecture there was a condition that resulted 89 00:07:28,069 --> 00:07:32,979 in branches with all "yes" or all "no" results completely determining things. 90 00:07:32,979 --> 00:07:39,639 Why was the information gain only [0.971] and not the full 1 bit? This is the picture they 91 00:07:39,639 --> 00:07:41,159 were talking about. 92 00:07:41,159 --> 00:07:48,280 Here, "humidity" determines these are all "no" and these are all "yes" for high and normal humidity, respectively. 93 00:07:48,280 --> 00:07:57,520 When you calculate the information gain -- and this is the formula for information gain -- you 94 00:07:57,520 --> 00:08:00,870 get 0.971 bits. 95 00:08:00,870 --> 00:08:06,789 You might expect 1 (and I would agree), and you would get 1 if you had 3 no's and 3 yes's 96 00:08:06,789 --> 00:08:11,280 here, or if you had 2 no's and 2 yes's. 97 00:08:11,280 --> 00:08:16,030 But because there is a slight imbalance between the number of no's and the number of yes's, 98 00:08:16,030 --> 00:08:20,819 you don't actually get 1 bit under these circumstances. 99 00:08:23,940 --> 00:08:30,750 There were some questions on Class 2 about stratified cross-validation, which tries to 100 00:08:30,750 --> 00:08:34,409 get the same proportion of class values in each fold. 101 00:08:34,409 --> 00:08:39,169 Some suggested maybe you should choose the number of folds so that it can do this exactly, 102 00:08:39,169 --> 00:08:40,690 instead of approximately. 103 00:08:40,690 --> 00:08:47,590 If you chose as the number of folds an exact divisor of the number of elements in each class, 104 00:08:47,590 --> 00:08:50,380 we'd be able to do this exactly. 105 00:08:50,380 --> 00:08:52,320 "Would that be a good thing to do?" was the question. 106 00:08:53,480 --> 00:08:55,000 The answer is no, not really. 107 00:08:55,000 --> 00:08:59,960 These things are all estimates, and you're treating them as though they were exact answers. 108 00:08:59,960 --> 00:09:01,200 They are all just estimates. 109 00:09:01,200 --> 00:09:05,140 There are more important considerations to take into account when determining the number 110 00:09:05,140 --> 00:09:07,760 of folds to do in your cross-validation. 111 00:09:07,760 --> 00:09:13,670 Like: you want a large enough test set to get an accurate estimate of the classification 112 00:09:13,860 --> 00:09:20,340 performance, and you want a large enough training set to train the classifier adequately. 113 00:09:20,340 --> 00:09:23,020 Don't worry about stratification being approximate. 114 00:09:23,020 --> 00:09:28,100 The whole thing is pretty approximate actually. 115 00:09:28,100 --> 00:09:33,350 Someone else asked "why is there a 'Use training set'" option on the Classify tab. 116 00:09:33,350 --> 00:09:40,100 It's very misleading to take the evaluation you get on the training data seriously, as 117 00:09:40,100 --> 00:09:40,890 we know. 118 00:09:40,890 --> 00:09:46,270 So why is it there in Weka? Well, we might want it for some purposes. 119 00:09:46,270 --> 00:09:51,480 For example, it does give you a quick upper bound on an algorithm's performance: it couldn't 120 00:09:51,480 --> 00:09:54,210 possibly do better than it would do on the training set. 121 00:09:54,210 --> 00:09:59,630 That might be useful, allowing you to quickly reject a learning algorithm. 122 00:09:59,630 --> 00:10:04,130 The important thing here is to understand what is wrong with using the training set 123 00:10:04,130 --> 00:10:08,120 for a performance estimate, and what overfitting is. 124 00:10:08,120 --> 00:10:13,930 Rather than changing the interface so you can't do bad things, I would rather protect 125 00:10:13,930 --> 00:10:18,190 you by educating you about what the issues are here. 126 00:10:20,340 --> 00:10:24,940 There have been quite a few suggested topics for a follow-up course: attribute selection, 127 00:10:24,940 --> 00:10:30,900 clustering, the Experimenter, parameter optimization, the KnowledgeFlow interface, and simple command 128 00:10:30,900 --> 00:10:32,390 line interface. 129 00:10:32,390 --> 00:10:36,660 We're considering a followup course, and we'll be asking you for feedback on that at the 130 00:10:36,660 --> 00:10:38,600 end of this course. 131 00:10:38,600 --> 00:10:44,480 Finally, someone said "Please let me know if there is a way to make a small donation" 132 00:10:44,480 --> 00:10:47,540 -- he's enjoying the course so much! Well, thank you very much. 133 00:10:47,540 --> 00:10:52,880 We'll make sure there is a way to make a small donation at the end of the course. 134 00:10:52,880 --> 00:10:53,760 That's it for now. 135 00:10:53,760 --> 00:10:54,900 On with Class 4. 136 00:10:54,900 --> 00:10:58,500 I hope you enjoy Class 4, and we'll talk again later. 137 00:10:58,500 --> 00:11:05,500 Bye for now!