Hi! This is the third class of Data Mining with Weka, and in this class, we're going to look at some simple machine learning methods and how they work. We're going to start out emphasizing the message that simple algorithms often work very well. In data mining -- maybe in life in general -- you should always try simple things before you try more complicated things. There are many different kinds of simple structure, For example, it might that one attribute in the dataset does all the work: everything depends on the value of one of the attributes. Or it might be that all of the attributes contribute equally and independently. Or a simple structure might be a decision tree that tests just a few of the attributes. We might calculate the distance from an unknown sample to the nearest training sample. Or the result might depend on a linear combination of attributes. We're going to look at all of these simple structures in the next few lessons. There's no universally best learning algorithm: the success of a machine learning method depends on the domain. Data mining really is an experimental science. We're going to look at the "OneR" rule learner, where one attribute does all the work. It's extremely simple, very trivial actually, but we're going to start with simple things and build up to more complex things. OneR learns what you might call a "one-level decision tree", or a set of rules that all test one particular attribute. A tree that branches only at the root node depending on the value of a particular attribute, or, equivalently, a set of rules that test the value of that particular attribute. Here's the basic version of OneR. There's one branch for each value of the attribute. We choose which attribute first, and we make one branch for each possible value of the attribute. Each branch assigns the most frequent class that comes down that branch. The error rate is the proportion of instances that don't belong to the majority class of their corresponding branch. We choose the attribute with the smallest error rate. Let's look at what this actually means. Here's the algorithm. For each attribute, we're going to make some rules. For each value of the attribute, we're going to make a rule that counts how often each class appears, finds the most frequent class, makes the rule assign that most frequent class to this attribute-value combination, and then we're going to calculate the error rate of this attribute's rules. We're going to repeat that for each of the attributes in the dataset, and choose the attribute with the smallest error rate. Here's the weather data again. What OneR does is to look at each attribute in turn -- "outlook", "temperature", "humidity", and "wind" -- and form rules based on that. For "outlook" there are three possible values: "sunny", "overcast", and "rainy". We just count. Out of the 5 "sunny" instances, 2 of them are yes's and 3 of them are no's. We're going to make a rule: if it's "sunny" choose "no". We're going to get 2 errors out of 5. For "overcast", all of the 4 "overcast" values of "outlook" lead to "yes" values for the class, "play". So we're going to make the rule: if "outlook" is "overcast" then "yes", giving us 0 errors. Finally, for "outlook" is "rainy", we're going to choose "yes", as well, and that would also give us 2 errors out of the 5 instances. We've got a total number of errors if we branch on "outlook" of 4. We can branch on "temperature" and do the same thing. When "temperature" is "hot" there are 2 no's and 2 yes's. We just choose arbitrarily in the case of a tie, so we'll choose: if it's "hot" predict "no", getting 2 errors. If "temperature" is "mild" we'll predict "yes", getting 2 out of 6 errors, and if "temperature" is "cool" we'll predict "yes", getting 1 out of the 4 instances as an error. And the same for "humidity" and "wind". We look at the total error values; we choose the rule with the lowest total error value -- either "outlook" or "humidity". That's a tie, so we'll just choose arbitrarily, and choose "outlook". That's how OneR works: it's as simple as that. Let's try it. Here's Weka. I'm going to open the nominal weather data. I'm going to go to Classify. This is such a trivial dataset that the results aren't very meaningful, but if I just run ZeroR to start off with, I get an error rate of 64%. If I now choose OneR, and run that, I get a rule, and the rule I get is: branch on "outlook"; if it's "sunny" then choose "no", "overcast" choose "yes", and "rainy" choose "yes". We get 10 out of 14 instances correct on the training set. We're evaluating this using cross-validation. It doesn't really make much sense on such a small dataset. Interesting, though, that the success rate we get, 42% is pretty bad, worse than ZeroR. Actually, with any 2-class problem, you would expect to get a success rate of at least 50%. Tossing a coin would give you 50%. This OneR scheme is not performing very well on this trivial dataset. Notice the rule it finally prints out: since we're using 10-fold cross-validation it does the whole thing 10 times and then on the 11th time it calculates a rule from the entire dataset, and that's what it prints out. That's where this rule comes from. OneR: one attribute does all the work. This is a very simple method of machine learning described in 1993, 20 years ago, in a paper called "Very Simple Classification Rules Perform Well on Most Commonly Used Datasets" by a guy called Rob Holte, who lives in Canada. He did an experimental evaluation of the OneR method on 16 commonly used datasets. He used cross-validation just like we've told you to evaluate these things, and he found that the simple rules from OneR often outperformed far more complex methods that had been proposed for these datasets. How can such a simple method work so well? Some datasets really are simple, and others are so small, noisy, or complex that you can't learn anything from them. So it's always worth trying the simplest things first. Section 4.1 of the course text talks about OneR. Now it's time for you to go and do the activity associated with this lesson. Bye for now!