1 00:00:18,500 --> 00:00:21,340 Hi! Well, it's summertime here in New Zealand. 2 00:00:21,349 --> 00:00:28,310 Summer's just arrived, and, as you can see, I'm sitting outside for a change of venue. 3 00:00:28,310 --> 00:00:35,280 This is Class 5 of the MOOC -- the last class! Here are a few comments on Class 4, some issues 4 00:00:35,280 --> 00:00:37,160 that came up. 5 00:00:37,160 --> 00:00:40,270 We had a couple of errors in the activities; we corrected those pretty quickly. 6 00:00:40,270 --> 00:00:44,210 Some of the activities are getting harder -- you will have noticed that! But I think 7 00:00:44,210 --> 00:00:46,129 if you're doing the activities you'll be learning a lot. 8 00:00:46,129 --> 00:00:50,600 You learn a lot through doing the activities, so keep it up! And the Class 5 activities 9 00:00:50,600 --> 00:00:51,829 are much easier. 10 00:00:53,220 --> 00:00:58,829 There was a question about converting nominal variables to numeric in Activity 4.2. 11 00:01:00,560 --> 00:01:04,660 Someone said the result of the supervised nominal binary filter was weird. 12 00:01:05,460 --> 00:01:07,160 Yes, well, it is a little bit weird. 13 00:01:07,160 --> 00:01:12,380 If you click the "More" button for that filter, it says that k-1 new binary attributes are 14 00:01:12,380 --> 00:01:16,370 generated in the manner described in this book (if you can get hold of it). 15 00:01:16,370 --> 00:01:19,140 Let me just tell you a little bit more about this. 16 00:01:20,200 --> 00:01:27,100 I've come up with an example of a nominal attribute called "fruit", and it has 3 values: 17 00:01:27,100 --> 00:01:29,390 orange, apple, and banana. 18 00:01:29,390 --> 00:01:33,120 In this dataset, the class is "juicy"; it's a numeric measure of juiciness. 19 00:01:33,120 --> 00:01:38,080 I don't know about where you live, but in New Zealand oranges are juicier than apples, 20 00:01:38,080 --> 00:01:39,920 and apples are juicier than bananas. 21 00:01:39,920 --> 00:01:46,000 I'm assuming that in this dataset, if you average the juiciness of all the instances 22 00:01:46,000 --> 00:01:51,830 where the fruit attribute equals orange you get a larger value than if you do this with 23 00:01:51,830 --> 00:01:58,050 all the instances where the fruit attribute equals apple, and that's larger than for banana. 24 00:01:58,050 --> 00:02:02,110 That sort of orders these values. 25 00:02:02,110 --> 00:02:08,399 Let's consider ways of making "fruit" into a set of binary attributes. 26 00:02:08,399 --> 00:02:14,940 The simplest method, and the one that's used by the unsupervised conversion filter, is 27 00:02:14,940 --> 00:02:15,790 Method 1 here. 28 00:02:15,790 --> 00:02:21,540 We create 3 new binary attributes; I've just called them "fruit=orange", "fruit=apple", 29 00:02:21,540 --> 00:02:22,970 and "fruit=banana". 30 00:02:22,970 --> 00:02:27,530 The first attribute value is 1 if it's an orange and 0 otherwise. 31 00:02:27,530 --> 00:02:31,970 The second attribute, "fruit=apple", is 1 if it's an apple and 0 otherwise, and the 32 00:02:31,970 --> 00:02:34,470 same for banana. 33 00:02:34,470 --> 00:02:40,520 Of course, of these three binary attributes, 34 00:02:40,520 --> 00:02:45,790 exactly one of them has to be "1" for any instance. 35 00:02:45,790 --> 00:02:47,430 Here's another way of doing it, Method 2. 36 00:02:47,430 --> 00:02:52,790 We take each possible subset: as well as "orange", "apple" and "banana", we have another binary 37 00:02:52,790 --> 00:02:59,790 variable for "orange_or_apple", another for "orange_or_banana", and another for "apple_or_banana". 38 00:03:01,069 --> 00:03:06,880 For example, if the value of fruit was "orange", then the first attribute ("fruit=orange") 39 00:03:06,880 --> 00:03:12,380 would be 1, the fourth attribute ("orange_or_apple") would be 1, and the fifth attribute ("orange_or_banana") 40 00:03:12,380 --> 00:03:13,180 would be 1. 41 00:03:13,180 --> 00:03:15,380 All of the others would be 0. 42 00:03:15,380 --> 00:03:23,370 This effectively creates a binary attribute for each subset of possible values of the 43 00:03:23,370 --> 00:03:24,810 "fruit" attribute. 44 00:03:25,840 --> 00:03:31,810 Actually, we don't create one for the empty subset or the full subset (with all 3 of the values in). 45 00:03:41,120 --> 00:03:46,240 We get 2^k-2 values for a k-valued attribute. 46 00:03:46,240 --> 00:03:51,720 That's impractical in general, because 2^k grows very fast as k grows. 47 00:03:51,720 --> 00:03:55,370 The third method is the one that is actually used, and this is the one that's described 48 00:03:55,370 --> 00:03:57,170 in that book. 49 00:03:57,170 --> 00:04:04,170 We create 2 new attributes (k-1, in general, for a k-valued attribute): 50 00:04:04,170 --> 00:04:08,540 "fruit=orange_or_apple" and "fruit=apple". 51 00:04:08,540 --> 00:04:14,450 For oranges, the first attribute is 1 and the second is 0; for apples, they're both 1; 52 00:04:14,450 --> 00:04:18,079 and for bananas, they're both 0. 53 00:04:18,079 --> 00:04:24,070 That's assuming this ordering of class values: orange is largest in juiciness, and banana 54 00:04:24,070 --> 00:04:25,680 is smallest in juiciness. 55 00:04:25,680 --> 00:04:29,770 There's a theorem that, if you're making a decision tree, the best way of splitting a 56 00:04:29,770 --> 00:04:36,530 node for a nominal variable with k values is one of the k-1 positions -- well, you can 57 00:04:36,530 --> 00:04:37,570 read this. 58 00:04:37,570 --> 00:04:41,660 In fact, this theorem is reflected in Method 3. 59 00:04:41,660 --> 00:04:46,770 That is the best way of splitting these attribute values. 60 00:04:46,770 --> 00:04:50,620 Whether it's a good thing in practice or not, well, I don't know. 61 00:04:50,620 --> 00:04:52,280 You should try it and see. 62 00:04:52,280 --> 00:04:59,090 Perhaps you can try Method 3 for the supervised conversion filter and Method 1 for the unsupervised 63 00:04:59,090 --> 00:05:04,400 conversion filter and see which produces the best results on your dataset. 64 00:05:04,400 --> 00:05:08,800 Weka doesn't implement Method 2, because the number of attributes explodes with the number 65 00:05:08,800 --> 00:05:15,800 of possible values, and you could end up with some very large datasets. 66 00:05:16,660 --> 00:05:23,660 The next question is about simulating multiresponse linear regression: "Please explain!" Well, 67 00:05:24,790 --> 00:05:30,290 we're looking at a Weka screen like this. 68 00:05:30,290 --> 00:05:43,310 We're running linear regression on the iris dataset where we've mapped the values so that 69 00:05:43,310 --> 00:05:50,310 the class for any Virginica instance is 1 and 0 for the others. 70 00:05:50,770 --> 00:05:54,030 We've done it with this kind of configuration. 71 00:05:54,030 --> 00:05:57,830 This is the default configuration of the makeIndicator filter. 72 00:05:57,830 --> 00:06:00,500 It's working on the last attribute -- that's the class. 73 00:06:00,500 --> 00:06:09,550 In this case, the value index is last, which means we're looking at the last value, which, 74 00:06:09,550 --> 00:06:11,340 in fact, is Virginica. 75 00:06:11,340 --> 00:06:17,140 We could put a number here to get the first, second, or third values. 76 00:06:17,140 --> 00:06:25,780 That's how we get the dataset, and then we run linear regression on this to get a linear model. 77 00:06:26,360 --> 00:06:30,650 Now, I want to look at the output for the first 4 instances. 78 00:06:30,650 --> 00:06:37,010 We've got an actual class of 1, 1, 0, 0 and the predicted value of these numbers. 79 00:06:37,010 --> 00:06:42,919 I've written those down in this little table over here: 1, 1, 0, 0 and these numbers. 80 00:06:42,919 --> 00:06:49,910 That for the dataset where all of the Virginicas are mapped to 1 and the other irises are mapped to 0. 81 00:06:49,919 --> 00:06:53,830 When we do the corresponding mapping with Versicolors, we get this as the actual class 82 00:06:53,830 --> 00:06:59,210 -- we just run Weka and look at what appeared on the screen -- and this is the predicted value. 83 00:06:59,210 --> 00:07:01,320 We get these for Setosa. 84 00:07:01,320 --> 00:07:08,020 So, you can see that the first instance is actually a Virginica - 1, 0, 0. 85 00:07:08,020 --> 00:07:11,919 I've put in bold the largest of these 3 numbers. 86 00:07:11,919 --> 00:07:18,020 This is the largest, 0.966, which is bigger than 0.117 and -0.065, so multiresponse linear 87 00:07:18,020 --> 00:07:22,919 regression is going to predict Virginica for instance 1. 88 00:07:22,919 --> 00:07:25,360 It's got the largest value. 89 00:07:25,360 --> 00:07:27,150 And that's correct. 90 00:07:27,150 --> 00:07:32,070 For the second instance, it's also a Virginica, and it's also the largest of the 3 values 91 00:07:32,070 --> 00:07:33,080 in its row. 92 00:07:33,080 --> 00:07:36,210 For the third instance, it's actually a Versicolor. 93 00:07:36,210 --> 00:07:42,680 The actual output is 1 for the Versicolor model, but the largest prediction is still 94 00:07:42,680 --> 00:07:44,020 for the Virginica model. 95 00:07:44,020 --> 00:07:48,520 It's going to predict Virginica for an iris that's actually Versicolor. 96 00:07:48,520 --> 00:07:51,270 That's going to be a mistake. 97 00:07:51,270 --> 00:07:57,100 In the [fourth] case, it's actually a Setosa -- the actual column is 1 for Setosa -- and 98 00:07:57,100 --> 00:08:01,970 this is the largest value in the row, so it's going to correctly predict Setosa. 99 00:08:01,970 --> 00:08:08,669 That's how multiresponse linear regression works. 100 00:08:11,840 --> 00:08:15,669 "How does OneR use the rules it generates? Please explain!" 101 00:08:21,900 --> 00:08:23,860 Well, here's the rule generated by OneR. 102 00:08:23,860 --> 00:08:26,640 It hinges on attribute 6. 103 00:08:26,640 --> 00:08:31,320 Of course, if you click the "Edit" button in the Preprocess panel, you can see the value 104 00:08:31,320 --> 00:08:38,320 of this attribute for each instance. 105 00:08:43,579 --> 00:08:49,839 This is what we see in the Explorer when we run OneR. 106 00:08:49,839 --> 00:08:53,649 You can see the predicted instances here. 107 00:08:53,649 --> 00:08:58,740 These are the predicted instances -- g, b, g, b, g, g, etc. 108 00:08:58,740 --> 00:08:59,920 These are the predictions. 109 00:08:59,920 --> 00:09:02,410 The question is, how does it get these predictions. 110 00:09:02,410 --> 00:09:06,900 This is the value of attribute 6 for instance 1. 111 00:09:06,900 --> 00:09:14,080 What the OneR code does is go through each of these conditions and looks to see if it's satisfied. 112 00:09:14,089 --> 00:09:19,929 Is 0.02 less than -0.2? -- no, it's not. 113 00:09:19,929 --> 00:09:22,839 Is it less than -0.01? -- no, it's not. 114 00:09:22,839 --> 00:09:26,319 Is it less than 0.001? -- no, it's not. 115 00:09:26,319 --> 00:09:29,740 (It's surprisingly hard to get these right, especially when you've got all of the other 116 00:09:29,740 --> 00:09:36,530 decimal places in the list here.) Is it less than 0.1? -- yes, it is. 117 00:09:36,530 --> 00:09:40,869 So rule 4 fires -- this is rule 4 -- and predicts "g". 118 00:09:43,170 --> 00:09:47,820 I've written down here the number of the rule clause that fires. 119 00:09:47,820 --> 00:09:55,390 In this case, for instance 2, the value of the attribute is -0.4, and that satisfies 120 00:09:55,399 --> 00:09:56,649 the first rule. 121 00:09:56,649 --> 00:10:00,300 So this satisfies number 1, and we predict "b". 122 00:10:00,300 --> 00:10:02,610 And so on down the list. 123 00:10:02,610 --> 00:10:03,559 That's what OneR does. 124 00:10:03,559 --> 00:10:10,670 It goes through the rule evaluating each of these clauses until it finds one that is true, 125 00:10:10,679 --> 00:10:17,679 and then it uses the corresponding prediction as its output. 126 00:10:17,860 --> 00:10:22,050 Moving on to ensemble learning questions. 127 00:10:22,050 --> 00:10:28,779 There were some questions on ensemble learning, about these ten OneR models. 128 00:10:28,779 --> 00:10:37,510 "Are these ten alternative ways of classifying the data?" Well, in a sense, but they are used together: 129 00:10:37,519 --> 00:10:39,470 AdaBoost.M1 combines them. 130 00:10:39,470 --> 00:10:44,769 In practice you don't just pick one of them and use that: AdaBoost combines these models 131 00:10:44,769 --> 00:10:50,550 inside itself -- the predictions it prints are produced by its combined model. 132 00:10:50,550 --> 00:10:55,800 The weights are used in the combination to decide how much weight to give each of these models. 133 00:10:55,809 --> 00:11:00,720 And when Weka reports a certain accuracy, that's for the combined model. 134 00:11:00,720 --> 00:11:08,480 It's not the average; it's not the best; it's combined in the way that AdaBoost combines them. 135 00:11:08,489 --> 00:11:10,970 That's all done internally in the algorithm. 136 00:11:10,970 --> 00:11:17,870 I didn't really explain the details of how the algorithm works; you'll have to look that up, I guess. 137 00:11:17,879 --> 00:11:21,790 The point is AdaBoostM1 combines these models for you. 138 00:11:21,790 --> 00:11:23,679 You don't have to think of them as separate models. 139 00:11:23,679 --> 00:11:27,559 They're all combined by AdaBoostM1. 140 00:11:27,559 --> 00:11:31,029 Someone complained that we're supposed to be looking for simplicity, and this seems 141 00:11:31,029 --> 00:11:32,259 pretty complicated. 142 00:11:32,259 --> 00:11:33,579 That's true. 143 00:11:33,579 --> 00:11:38,929 The real disadvantage of these kinds of models, ensemble models, is that it's hard to look 144 00:11:38,929 --> 00:11:39,420 at the rules. 145 00:11:39,420 --> 00:11:42,439 It's hard to see inside to see what they're doing. 146 00:11:42,439 --> 00:11:44,980 Perhaps you should be a bit wary of that. 147 00:11:44,980 --> 00:11:46,899 But they can produce very good results. 148 00:11:46,899 --> 00:11:51,929 You know how to test machine learning methods reliably using cross-validation or whatever. 149 00:11:51,929 --> 00:11:57,249 So, sometimes they're good to use. 150 00:11:58,400 --> 00:12:03,929 "How does Weka make predictions? How can you use Weka to make predictions?" You can use 151 00:12:03,929 --> 00:12:09,059 the "Supplied test set" option on the Classify panel to put in a test set and see the predictions 152 00:12:09,059 --> 00:12:09,839 on that. 153 00:12:09,839 --> 00:12:15,250 Or, alternatively, there is a program -- if you can run Java programs -- there's a program here. 154 00:12:15,259 --> 00:12:23,220 This is how you run it: "java weka.classifiers.trees.J48" with your ARFF data file, and you put question 155 00:12:23,220 --> 00:12:25,600 marks there to indicate the class. 156 00:12:25,600 --> 00:12:32,029 Then you give it the model, which you've output from the Explorer. 157 00:12:32,029 --> 00:12:40,000 You can look at how to do this on the Weka Wiki on the FAQ list: "using Weka to make predictions". 158 00:12:43,790 --> 00:12:49,339 Can you bootstrap learning? Someone talked about some friends of his who were using training 159 00:12:49,339 --> 00:12:54,199 data to train a classifier and using the results of the classification to create further training 160 00:12:54,199 --> 00:12:57,410 data, and continuing the cycle -- kind of bootstrapping. 161 00:12:57,410 --> 00:13:01,410 That sounds very attractive, but it can also be unstable. 162 00:13:01,410 --> 00:13:06,069 It might work, but I think you'd be pretty lucky for it to work well. 163 00:13:06,069 --> 00:13:11,220 It's a potentially rather unreliable way of doing things -- believing the classifications 164 00:13:11,220 --> 00:13:16,959 on new data and using that to further train the classifier. 165 00:13:16,959 --> 00:13:21,019 He also said these friends of his don't really look into the classification algorithm. 166 00:13:21,019 --> 00:13:24,790 I guess I'm trying to tell you a little bit about how each classification algorithm works, 167 00:13:24,790 --> 00:13:26,639 because I think it really does help to know that. 168 00:13:26,639 --> 00:13:32,440 You should be looking inside and thinking about what's going on inside your data mining method. 169 00:13:32,449 --> 00:13:38,489 A couple of suggestions of things not covered in this MOOC: FilteredClassifier and association 170 00:13:38,489 --> 00:13:40,939 rules, the Apriori association rule learner. 171 00:13:40,939 --> 00:13:47,939 As I said before, maybe we'll produce a follow-up MOOC and include topics like this in it. 172 00:13:48,619 --> 00:13:49,579 That's it for now. 173 00:13:49,579 --> 00:13:51,350 Class 5 is the last class. 174 00:13:51,350 --> 00:13:52,569 It's a short class. 175 00:13:52,569 --> 00:13:54,040 Go ahead and do it. 176 00:13:54,040 --> 00:13:57,879 Please complete the assessments and finish off the course. 177 00:13:57,879 --> 00:14:03,260 It'll be open this week, and it'll remain open for one further week if you're getting behind. 178 00:14:03,269 --> 00:14:05,320 But after that, it'll be closed. 179 00:14:05,320 --> 00:14:07,300 So, you need to get on with it. 180 00:14:07,300 --> 00:14:09,140 We'll talk to you later. Bye!