﻿1 00:00:16,670 --> 00:00:21,500 Hi! This is the third class of Data Mining with Weka, and in this class, 2 00:00:21,500 --> 00:00:26,130 we're going to look at some simple machine learning methods and how they work. 3 00:00:26,130 --> 00:00:33,130 We're going to start out emphasizing the message that simple algorithms often work very well. 4 00:00:34,420 --> 00:00:38,210 In data mining, maybe in life in general, 5 00:00:38,210 --> 00:00:43,480 you should always try simple things before you try more complicated things. 6 00:00:43,480 --> 00:00:45,980 There are many different kinds of simple structure. 7 00:00:45,980 --> 00:00:49,930 For example, it might that one attribute in the dataset does all the work, 8 00:00:49,930 --> 00:00:52,489 everything depends on the value of one of the attributes. 9 00:00:52,489 --> 00:00:57,989 Or, it might be that all of the attributes contribute equally and independently. 10 00:00:57,989 --> 00:01:02,769 Or a simple structure might be a decision tree that tests just a few of the attributes. 11 00:01:02,769 --> 00:01:09,769 We might calculate the distance from an unknown sample to the nearest training sample, 12 00:01:10,460 --> 00:01:14,970 or a result my depend on a linear combination of attributes. 13 00:01:14,970 --> 00:01:21,630 We're going to look at all of these simple structures in the next few lessons. 14 00:01:21,630 --> 00:01:23,850 There's no universally best learning algorithm. 15 00:01:23,850 --> 00:01:27,469 The success of a machine learning method depends on the domain. 16 00:01:27,469 --> 00:01:33,700 Data mining really is an experimental science. 17 00:01:33,700 --> 00:01:37,259 We're going to look at OneR rule learner, 18 00:01:37,259 --> 00:01:39,950 where one attribute does all the work. 19 00:01:39,950 --> 00:01:42,770 It's extremely simple, very trivial, actually, 20 00:01:42,770 --> 00:01:47,570 but we're going to start with simple things and build up to more complex things. 21 00:01:47,570 --> 00:01:52,439 OneR learns what you might call a one-level decision tree, 22 00:01:52,439 --> 00:01:56,509 or a set of rules that all test one particular attribute. 23 00:01:56,509 --> 00:02:03,170 A tree that branches only at the root node depending on the value of a particular attribute, 24 00:02:03,170 --> 00:02:09,819 or, equivalently, a set of rules that test the value of that particular attribute. 25 00:02:09,819 --> 00:02:11,230 The basic version of OneR, 26 00:02:11,230 --> 00:02:14,400 there's one branch for each value of the attribute. 27 00:02:14,400 --> 00:02:17,680 We choose which attribute first, 28 00:02:17,680 --> 00:02:20,900 and we make one branch for each possible value of the attribute. 29 00:02:20,900 --> 00:02:26,090 Each branch assigns the most frequent class that comes down that branch. 30 00:02:26,090 --> 00:02:30,739 The error rate is the proportion of instances that don't belong to the majority class of 31 00:02:30,739 --> 00:02:32,319 their corresponding branch. 32 00:02:32,319 --> 00:02:36,640 We choose the attribute with the smallest error rate. 33 00:02:36,640 --> 00:02:39,190 Let's look at what this actually means. 34 00:02:39,190 --> 00:02:41,310 Here's the algorithm. 35 00:02:41,310 --> 00:02:46,150 For each attribute, we're going to make some rules. 36 00:02:46,150 --> 00:02:47,870 For each value of the attribute, 37 00:02:47,870 --> 00:02:52,599 we're going to make a rule that counts how often each class appears, 38 00:02:52,599 --> 00:02:54,560 finds the most frequent class, 39 00:02:54,560 --> 00:02:59,090 makes the rule assign that most frequent class to this attribute value combination, 40 00:02:59,090 --> 00:03:03,030 and then we're going to calculate the error rate of this attribute's rules. 41 00:03:03,030 --> 00:03:07,439 We're going to repeat that for each of the attributes in the dataset, 42 00:03:07,439 --> 00:03:10,760 and choose the attribute with the smallest error rate. 43 00:03:10,760 --> 00:03:15,049 Here's the weather data again. 44 00:03:15,049 --> 00:03:18,099 What OneR does, is it looks at each attribute in turn, 45 00:03:18,099 --> 00:03:23,409 outlook, temperature, humidity, and wind, and forms rules based on that. 46 00:03:23,409 --> 00:03:30,409 For outlook, there are three possible values: sunny, overcast, and rainy. 47 00:03:30,470 --> 00:03:35,000 We just count out of the 5 sunny instances, 48 00:03:35,000 --> 00:03:42,000 2 of them are yeses and 3 of them are nos. 49 00:03:51,730 --> 00:03:53,469 We're going to choose a rule, 50 00:03:53,469 --> 00:03:55,640 if it's sunny choose no. 51 00:03:55,640 --> 00:03:58,459 We're going to get 2 errors out of 5. 52 00:03:58,459 --> 00:04:07,110 For overcast, all of the 4 overcast values of outlook lead to yes values for the class play. 53 00:04:07,110 --> 00:04:09,170 So, we're going to choose the rule, 54 00:04:09,170 --> 00:04:15,280 if outlook is overcast, then yes, giving us 0 errors. 55 00:04:15,280 --> 00:04:17,269 Finally, for outlook is rainy, 56 00:04:17,269 --> 00:04:18,220 we're going to choose yes, 57 00:04:18,220 --> 00:04:22,490 as well, and that would also give us 2 errors out of the 5 instances. 58 00:04:22,490 --> 00:04:26,890 We've got a total number of errors if we branch on outlook of 4. 59 00:04:26,890 --> 00:04:32,970 We can branch on temperature and do the same thing. 60 00:04:32,970 --> 00:04:34,220 When temperature is hot, 61 00:04:34,220 --> 00:04:36,220 there are 2 nos and 2 yeses. 62 00:04:36,220 --> 00:04:38,300 We just choose arbitrarily in the case of a tie, 63 00:04:38,300 --> 00:04:40,020 so we'll choose if it's hot, 64 00:04:40,020 --> 00:04:43,410 let's predict no, getting 2 errors. 65 00:04:43,410 --> 00:04:44,720 If temperature is mild, 66 00:04:44,720 --> 00:04:47,660 we'll predict yes, getting 2/6 errors, 67 00:04:47,660 --> 00:04:49,760 and if the temperature is cool, 68 00:04:49,760 --> 00:04:54,990 we'll predict yes, getting 1 out of the 4 instances as an error. 69 00:04:54,990 --> 00:04:58,260 And the same for humidity and wind. 70 00:04:58,260 --> 00:05:04,100 We look at the total error values; we choose the rule with the lowest total error value -- either 71 00:05:04,100 --> 00:05:05,970 outlook or humidity. 72 00:05:05,970 --> 00:05:07,860 That's a tie, so we'll just choose arbitrarily, 73 00:05:07,860 --> 00:05:09,150 and choose outlook. 74 00:05:09,150 --> 00:05:11,370 That's how OneR works, 75 00:05:11,370 --> 00:05:14,300 it's as simple as that. 76 00:05:14,300 --> 00:05:15,100 Let's just try it. 77 00:05:15,100 --> 00:05:15,760 Here's Weka. 78 00:05:15,760 --> 00:05:22,760 I'm going to open the nominal weather data. 79 00:05:24,590 --> 00:05:26,520 I'm going to go to Classify. 80 00:05:26,520 --> 00:05:32,480 This is such a trivial dataset that the results aren't very meaningful, 81 00:05:32,480 --> 00:05:36,000 but if I just run ZeroR to start off with, 82 00:05:36,000 --> 00:05:39,860 I get an error rate of 64%. 83 00:05:39,860 --> 00:05:43,620 If I now choose OneR, 84 00:05:45,450 --> 00:05:47,370 and run that. 85 00:05:47,370 --> 00:05:51,320 I get a rule, and the rule I get is branched on outlook, 86 00:05:51,320 --> 00:05:53,070 if it's sunny then choose no, 87 00:05:53,070 --> 00:05:56,460 overcast choose yes, and rainy choose yes. 88 00:05:56,460 --> 00:06:01,370 We get 10 out of 14 instances correct on the training set. 89 00:06:01,370 --> 00:06:03,780 We're evaluating this using cross-validation. 90 00:06:03,780 --> 00:06:06,210 Doesn't really make much sense on such a small dataset. 91 00:06:06,210 --> 00:06:09,160 Interesting, though, that the [success] rate we get, 92 00:06:09,160 --> 00:06:12,100 42% is pretty bad, worse than ZeroR. 93 00:06:12,100 --> 00:06:14,440 Actually, with any 2-class problem, 94 00:06:14,440 --> 00:06:19,700 you would expect to get a success rate of at least 50%. 95 00:06:19,700 --> 00:06:22,110 Tossing a coin would give you 50%. 96 00:06:22,110 --> 00:06:27,440 This OneR scheme is not performing very well on this trivial dataset. 97 00:06:27,440 --> 00:06:34,290 Notice that the rule it finally prints out since we're using 10-fold cross-validation, 98 00:06:34,290 --> 00:06:38,940 it does the whole thing 10 times and then on the 11th time calculates a rule from the 99 00:06:38,940 --> 00:06:43,710 entire dataset and that's what it prints out. 100 00:06:43,710 --> 00:06:48,000 That's where this rule comes from. 101 00:06:48,000 --> 00:06:51,170 OneR, one attribute does all the work. 102 00:06:51,170 --> 00:06:55,730 This is a very simple method of machine learning described in 1993, 103 00:06:55,730 --> 00:07:02,420 20 years ago in a paper called "Very Simple Classification Rules Perform Well on Most 104 00:07:02,420 --> 00:07:04,390 Commonly Used Datasets" 105 00:07:04,390 --> 00:07:10,300 by a guy called Rob Holte, who lives in Canada. 106 00:07:10,300 --> 00:07:15,880 He did an experimental evaluation of the OneR method on 16 commonly used datasets. 107 00:07:15,880 --> 00:07:20,850 He used cross-validation just like we've told you to evaluate these things, 108 00:07:20,850 --> 00:07:26,440 and he found that the simple rules from OneR often outperformed far more complex methods 109 00:07:26,440 --> 00:07:30,910 that had been proposed for these datasets. 110 00:07:30,910 --> 00:07:34,410 How can such a simple method work so well? 111 00:07:34,410 --> 00:07:37,230 Some datasets really are simple, 112 00:07:37,230 --> 00:07:39,950 and others are so small, noisy, or complex 113 00:07:39,950 --> 00:07:42,010 that you can't learn anything from them. 114 00:07:42,010 --> 00:07:46,200 So, it's always worth trying the simplest things first. 115 00:07:46,200 --> 00:07:50,850 Section 4.1 of the course text talks about OneR. 116 00:07:50,850 --> 00:07:55,190 Now it's time for you to go and do the activity associated with this lesson. 117 00:07:55,190 --> 00:07:56,770 Bye for now!