1 00:00:18,000 --> 00:00:24,420 Hi! Well, Class 2 has gone flying by, and here are some things I'd like to discuss. 2 00:00:24,420 --> 00:00:28,640 First of all, we made some mistakes in the answers to the activities. 3 00:00:28,640 --> 00:00:29,640 Sorry about that. 4 00:00:29,640 --> 00:00:32,480 We've corrected them. 5 00:00:32,480 --> 00:00:37,530 Secondly -- a general point -- some people have been asking questions, for example, about 6 00:00:37,530 --> 00:00:38,870 huge datasets. 7 00:00:38,870 --> 00:00:44,059 How big a dataset can Weka deal with? The answer is pretty big, actually. 8 00:00:44,059 --> 00:00:47,710 But it depends on what you do, and it's a fairly complicated question to discuss. 9 00:00:47,710 --> 00:00:52,570 If it's not big enough, there are ways of improving things. 10 00:00:52,570 --> 00:00:57,619 Anyway, issues like that should be discussed on the Weka mailing list, or you should look 11 00:00:57,619 --> 00:01:03,920 in the Weka FAQ, where there's quite a lot of discussion on this particular issue. 12 00:01:03,920 --> 00:01:07,939 The Weka API: the programming interface to Weka. 13 00:01:07,939 --> 00:01:11,430 You can incorporate the Weka routines in your program. 14 00:01:11,430 --> 00:01:14,220 It's wonderful stuff, but it's not covered in this MOOC. 15 00:01:14,220 --> 00:01:19,430 So the right place to discuss those issues is the Weka mailing list. 16 00:01:19,430 --> 00:01:22,240 Finally, personal emails to me. 17 00:01:22,240 --> 00:01:27,360 You know, there are 5,000 people on this MOOC, and I can't cope with personal emails, so 18 00:01:27,360 --> 00:01:33,960 please send them to the mailing list and not to me personally. 19 00:01:33,960 --> 00:01:38,460 I'd like to discuss the issues of numeric precision in Weka. 20 00:01:38,460 --> 00:01:44,670 Weka prints percentages to 4 decimal places; it prints most numbers to 4 decimal places. 21 00:01:44,670 --> 00:01:47,520 That's misleadingly high accuracy. 22 00:01:47,520 --> 00:01:49,350 Don't take these at face value. 23 00:01:49,350 --> 00:01:59,250 For example, here we've done an experiment using a 40% percentage split, and we get 92.3333% 24 00:01:59,250 --> 00:02:00,659 accuracy printed out. 25 00:02:00,659 --> 00:02:07,240 Well, that's the exact right answer to the wrong question. 26 00:02:07,240 --> 00:02:11,540 We're not interested in the performance on this particular test set. 27 00:02:11,540 --> 00:02:18,530 What we're interested in is how Weka will do in general on data from this source. 28 00:02:18,530 --> 00:02:25,110 We certainly can't infer that that's this percentage to 4 decimal place accuracy. 29 00:02:25,110 --> 00:02:29,430 In Class 2, we're trying to sensitize you to the fact that these figures aren't to be 30 00:02:29,430 --> 00:02:31,300 taken at face value. 31 00:02:31,300 --> 00:02:35,090 For example, there we are with a 40% split. 32 00:02:35,090 --> 00:02:42,090 If we do a 30% split we get 92.381%. 33 00:02:43,530 --> 00:02:46,590 The difference between these two numbers is completely insignificant. 34 00:02:46,590 --> 00:02:50,260 You shouldn't be saying this is better than the other number. 35 00:02:50,260 --> 00:02:56,270 They are both the same, really, within the amount of statistical fuzz that's involved 36 00:02:56,270 --> 00:02:57,660 in the experiment. 37 00:02:57,660 --> 00:03:06,760 We're trying to train you to write your answers to the nearest percentage point, or perhaps 38 00:03:06,760 --> 00:03:07,900 1 decimal place. 39 00:03:07,900 --> 00:03:12,489 Those are the answers that are being accepted as correct. 40 00:03:12,489 --> 00:03:17,090 The reason we're doing that is to try to train you to think about these numbers and what 41 00:03:17,090 --> 00:03:22,090 they really represent, rather than just copy/pasting whatever Weka prints out. 42 00:03:22,090 --> 00:03:25,520 These numbers need to be interpreted. 43 00:03:25,520 --> 00:03:37,840 For example, in Activity 2.6 in question 2, the 4-digit answer would be 0.7354%, and 0.7 44 00:03:37,840 --> 00:03:41,520 and 0.74 are the only accepted answers. 45 00:03:41,520 --> 00:03:51,810 In question 5, the 4-decimal place accuracy is 1.7256%, and we would accept 1.73%, 1.7% and 2%. 46 00:03:51,819 --> 00:03:55,790 We're a bit selective in what we'll accept here. 47 00:03:58,740 --> 00:04:02,790 I want to move on to the user classifier now. 48 00:04:04,280 --> 00:04:10,030 Some people got some confusing results, because they created splits that involved the class 49 00:04:10,030 --> 00:04:13,330 attribute. 50 00:04:13,330 --> 00:04:16,739 When you're dealing with the test set, you don't know the class attribute -- that's what 51 00:04:16,739 --> 00:04:18,120 you're trying to find out. 52 00:04:18,120 --> 00:04:22,750 So it doesn't make sense to create splits in the decision tree that involve testing 53 00:04:22,750 --> 00:04:24,889 the class attribute. 54 00:04:24,889 --> 00:04:31,819 If you do that, you're going to get 0 accuracy on test data, because the class value cannot 55 00:04:31,819 --> 00:04:37,259 be evaluated on the test data. 56 00:04:37,259 --> 00:04:40,800 That was the cause of that confusion. 57 00:04:40,800 --> 00:04:44,080 Here's the league table for the user classifier. 58 00:04:44,080 --> 00:04:47,909 J48 gets 96.2%, just as a reference point. 59 00:04:47,909 --> 00:04:51,719 Magda did really well and got very close to that, with 93.9%. 60 00:04:51,719 --> 00:04:58,719 It took her 6.5-7 minutes, according to the script that she mailed in. 61 00:05:01,409 --> 00:05:04,909 Myles did pretty well -- 93.5%. 62 00:05:04,909 --> 00:05:09,369 In the class, I got 78% in just a few seconds. 63 00:05:09,369 --> 00:05:14,710 I think if you get over 90% you're doing pretty well on this dataset for the user classifier. 64 00:05:14,710 --> 00:05:21,710 The point is not to get a good result, it's to think about the process of classification. 65 00:05:23,680 --> 00:05:30,050 Let's move to Activity 2.2, partitioning the datasets for training and testing. 66 00:05:30,050 --> 00:05:40,080 Question 1 asked you to evaluate J48 with percentage split, using 10% for the training 67 00:05:40,080 --> 00:05:43,650 set, 20%, 40%, 60%, and 80%. 68 00:05:43,650 --> 00:05:50,650 What you observed is that the accuracy increases as we go through that set of numbers. 69 00:05:51,960 --> 00:05:55,169 "Performance always increases" for those numbers. 70 00:05:55,169 --> 00:05:57,939 It doesn't always increase in general. 71 00:05:57,939 --> 00:06:03,979 In general, you would expect an increasing trend -- the more training data the better 72 00:06:03,979 --> 00:06:08,559 the performance, asymptoting off at some point. 73 00:06:08,559 --> 00:06:12,569 You would expect some fluctuation, though, so sometimes you would expect it to go down 74 00:06:12,569 --> 00:06:13,499 and up again. 75 00:06:13,499 --> 00:06:20,119 In this particular case, performance always increases. 76 00:06:20,119 --> 00:06:28,500 You were asked to estimate J48's true accuracy on the segment-challenge dataset in Question 4. 77 00:06:28,509 --> 00:06:34,240 Well, "true accuracy" -- what do we mean by "true accuracy"? I guess maybe it's not very 78 00:06:34,240 --> 00:06:40,770 well defined, but what one thinks of is if you have a large enough training set, the 79 00:06:40,770 --> 00:06:45,300 performance of J48 is going to increase up to some kind of point, and what would that 80 00:06:45,300 --> 00:06:55,780 point be? Actually, if you do this -- in fact, you've done it! -- you found that between 81 00:06:55,789 --> 00:07:05,370 60% training sets and 97-98% training sets using the percentage split option consistently 82 00:07:05,379 --> 00:07:10,619 yield correctly classified instances in the range 94-97%. 83 00:07:10,619 --> 00:07:15,960 So 95% is probably the best fit from this selection of possible numbers. 84 00:07:15,960 --> 00:07:22,339 It's true, by the way, that greater weight is normally given to the training portion 85 00:07:22,339 --> 00:07:23,240 of this split. 86 00:07:23,240 --> 00:07:31,330 Usually when we use percentage split, we would use 2/3, or maybe 3/4, or maybe 90% of the 87 00:07:31,339 --> 00:07:34,909 training data, and the smaller amount for the test data. 88 00:07:36,600 --> 00:07:41,520 Questions 6 and 7 were confusing, and we've changed those. 89 00:07:41,520 --> 00:07:48,890 The issue there was how a classifier's performance, and secondly the reliability of the estimate 90 00:07:48,899 --> 00:07:53,490 of the classifier's performance, is expected to increase as the volume of the training 91 00:07:53,490 --> 00:07:54,699 data increases. 92 00:07:56,020 --> 00:07:59,949 Or, how they change with the size of the dataset. 93 00:07:59,949 --> 00:08:05,249 The performance is expected to increase as the volume of training data increases, and 94 00:08:05,249 --> 00:08:11,490 the reliability of the estimate is also expected to increase as the volume of test data increases. 95 00:08:11,490 --> 00:08:14,689 With the percentage split option, there's a trade-off between the amount of test data 96 00:08:14,689 --> 00:08:16,289 and the amount of training data. 97 00:08:16,289 --> 00:08:22,669 That's what that question is trying to get at. 98 00:08:22,669 --> 00:08:31,030 Activity 2.3 Question 5: "How do the mean and standard deviation estimates depend on 99 00:08:31,039 --> 00:08:40,900 the number of samples?" Well, the answer is that roughly speaking both stay the same. 100 00:08:40,900 --> 00:08:45,460 Let me find Activity 2.3, Question 5. 101 00:08:46,340 --> 00:08:57,740 As you increase the number of samples, you expect the estimated mean to converge to the true 102 00:08:57,740 --> 00:09:02,850 value of the mean, and the estimated standard deviation to converge to the true standard 103 00:09:02,850 --> 00:09:04,150 deviation. 104 00:09:04,150 --> 00:09:09,050 So, they would both stay about the same. 105 00:09:09,050 --> 00:09:14,160 This is, in fact, now marked as correct. 106 00:09:14,160 --> 00:09:24,080 Actually, because of the "n - 1" in the denominator of the formula for variance, it's true that 107 00:09:24,080 --> 00:09:29,820 the standard deviation decreases a tiny bit, but it's a very small effect. 108 00:09:29,820 --> 00:09:34,770 So we've also accepted that answer as correct. 109 00:09:34,770 --> 00:09:38,630 That's how the mean and standard deviation estimates depend on the number of samples. 110 00:09:38,630 --> 00:09:45,630 Perhaps a more important question is how the reliability of the mean would change. 111 00:09:46,340 --> 00:09:51,710 What decreases is the standard error of the estimate of the mean, which is the standard 112 00:09:51,710 --> 00:09:57,740 deviation of the theoretical distribution of the large population of such estimates. 113 00:09:57,740 --> 00:10:04,740 The estimate of the mean is a better, more reliable estimate with a larger training set size. 114 00:10:10,160 --> 00:10:17,610 "The supermarket dataset is weird." Yes, it is weird: it's intended to be weird. 115 00:10:17,610 --> 00:10:25,960 Actually, in the supermarket dataset, each instance represents a supermarket trolley, 116 00:10:25,960 --> 00:10:30,450 and, instead of putting a 0 for every item you don't buy -- of course, when we go to 117 00:10:30,450 --> 00:10:36,660 the supermarket, we don't buy most of the items in the supermarket -- the ARFF file 118 00:10:36,660 --> 00:10:39,800 codes that as a question mark, which stands for "missing value". 119 00:10:39,800 --> 00:10:43,380 We're going to discuss missing values in Class 5. 120 00:10:44,320 --> 00:10:49,990 This dataset is suitable for association rule learning, which we're not doing in this course. 121 00:10:49,990 --> 00:10:54,570 The message I'm trying to emphasize here is that you need to understand what you're doing, 122 00:10:54,570 --> 00:10:57,220 not just process datasets blindly. 123 00:10:57,220 --> 00:10:59,250 Yes, it is weird. 124 00:11:00,520 --> 00:11:06,990 There's been some discussion on the mailing list about cross-validation and the extra model. 125 00:11:06,990 --> 00:11:10,500 When you do cross-validation, you're trying to do two things. 126 00:11:10,500 --> 00:11:19,540 You're trying to get an estimate of the expected accuracy of a classifier, and you're trying 127 00:11:19,540 --> 00:11:21,930 to actually produce a really good classifier. 128 00:11:21,930 --> 00:11:27,090 To produce a really good classifier to use in the future, you want to use the entire 129 00:11:27,090 --> 00:11:30,880 training set to train up the classifier. 130 00:11:30,880 --> 00:11:35,070 To get an estimate of its accuracy, however, you can't do that unless you have an independent 131 00:11:35,070 --> 00:11:36,680 test set. 132 00:11:36,680 --> 00:11:44,190 So cross-validation takes 90% for training and 10% for testing, repeats that 10 times, 133 00:11:44,190 --> 00:11:46,700 and averages the results to get an estimate. 134 00:11:46,700 --> 00:11:53,440 Once you've got the estimate, if you want an actual classifier to use, the best classifier 135 00:11:53,440 --> 00:11:56,960 is one built on the full training set. 136 00:11:56,960 --> 00:12:00,760 The same is true with a percentage split option. 137 00:12:00,760 --> 00:12:05,190 Weka will evaluate the percentage split, but then it will print the classifier that it 138 00:12:05,190 --> 00:12:10,600 produces from the entire training set to give you a classifier to use on your problem in 139 00:12:10,600 --> 00:12:11,410 the future. 140 00:12:12,920 --> 00:12:16,310 There's been a little bit of discussion on advanced stuff. 141 00:12:16,310 --> 00:12:19,570 I think maybe a follow-up course might be a good idea here. 142 00:12:19,570 --> 00:12:24,430 Someone noticed that if you apply a filter to the training set, you need to apply exactly 143 00:12:24,430 --> 00:12:28,690 the same filter to the test set, which is sometimes a bit difficult to do, particularly 144 00:12:28,690 --> 00:12:33,220 if the training and test sets are produced by cross-validation. 145 00:12:33,220 --> 00:12:40,010 There's an advanced classifier called the "FilteredClassifier" which addresses that problem. 146 00:12:40,010 --> 00:12:45,160 In his response to a question on the supermarket dataset, Peter mentioned "unbalanced" datasets, 147 00:12:45,160 --> 00:12:47,110 and the cost of different kinds of error. 148 00:12:47,110 --> 00:12:51,900 This is something that Weka can take into account with a cost sensitive evaluation, 149 00:12:51,900 --> 00:12:58,090 and there is a classifier called the CostSensitiveClassifier that allows you to do that. 150 00:12:58,090 --> 00:13:03,490 Finally, someone just asked a question on attribute selection: how do you select a good 151 00:13:03,490 --> 00:13:09,050 subset of attributes? Excellent question! There's a whole attribute Selection panel, 152 00:13:09,050 --> 00:13:11,490 which we're not able to talk about in this MOOC. 153 00:13:11,490 --> 00:13:15,100 This is just an introductory MOOC on Weka. 154 00:13:15,100 --> 00:13:20,680 Maybe we'll come up with an advanced, followup MOOC where we're able to discuss some of these 155 00:13:20,680 --> 00:13:22,030 more advanced issues. 156 00:13:23,340 --> 00:13:24,400 That's it. 157 00:13:24,400 --> 00:13:29,940 I just want to finish with a picture that someone sent in of two wekas in an enclosure. 158 00:13:30,610 --> 00:13:36,350 It's rare to see wekas in the wild -- I've seen them a couple of times myself, but not very often. 159 00:13:36,350 --> 00:13:43,270 More likely, to see a weka you need to go to a place where they keep captured wekas 160 00:13:43,270 --> 00:13:45,170 for you to look at. 161 00:13:45,170 --> 00:13:48,450 Here are two wekas that Leah from Vancouver sent in. 162 00:13:50,000 --> 00:13:50,980 That's it. 163 00:13:50,980 --> 00:13:55,240 Now Class 3 is up now, and off you go with Class 3. 164 00:13:55,240 --> 00:13:56,960 Good luck! We'll talk to you later. 165 00:13:56,960 --> 00:13:58,100 Bye for now!