Hi! Well, it's summertime here in New Zealand. Summer's just arrived, and, as you can see, I'm sitting outside for a change of venue.
This is Class 5 of the MOOC -- the last class! Here are a few comments on Class 4, some issues that came up.
We had a couple of errors in the activities; we corrected those pretty quickly. Some of the activities are getting harder -- you will have noticed that! But I think if you're doing the activities you'll be learning a lot. You learn a lot through doing the activities, so keep it up! And the Class 5 activities are much easier.
There was a question about converting nominal variables to numeric in Activity 4.2. Someone said the result of the supervised nominal binary filter was weird. Yes, well, it is a little bit weird. If you click the "More" button for that filter, it says that k-1 new binary attributes are generated in the manner described in this book (if you can get hold of it). Let me just tell you a little bit more about this.
I've come up with an example of a nominal attribute called "fruit", and it has 3 values: orange, apple, and banana. In this dataset, the class is "juicy"; it's a numeric measure of juiciness. I don't know about where you live, but in New Zealand oranges are juicier than apples, and apples are juicier than bananas. I'm assuming that in this dataset, if you average the juiciness of all the instances where the fruit attribute equals orange you get a larger value than if you do this with all the instances where the fruit attribute equals apple, and that's larger than for banana. That sort of orders these values.
Let's consider ways of making "fruit" into a set of binary attributes. The simplest method, and the one that's used by the unsupervised conversion filter, is Method 1 here. We create 3 new binary attributes; I've just called them "fruit=orange", "fruit=apple", and "fruit=banana". The first attribute value is 1 if it's an orange and 0 otherwise. The second attribute, "fruit=apple", is 1 if it's an apple and 0 otherwise, and the same for banana. Of course, of these three binary attributes, exactly one of them has to be "1" for any instance.
Here's another way of doing it, Method 2. We take each possible subset: as well as "orange", "apple" and "banana", we have another binary variable for "orange_or_apple", another for "orange_or_banana", and another for "apple_or_banana". For example, if the value of fruit was "orange", then the first attribute ("fruit=orange") would be 1, the fourth attribute ("orange_or_apple") would be 1, and the fifth attribute ("orange_or_banana") would be 1. All of the others would be 0. This effectively creates a binary attribute for each subset of possible values of the "fruit" attribute. Actually, we don't create one for the empty subset or the full subset (with all 3 of the values in). We get 2^k-2 values for a k-valued attribute. That's impractical in general, because 2^k grows very fast as k grows.
The third method is the one that is actually used, and this is the one that's described in that book. We create 2 new attributes (k-1, in general, for a k-valued attribute): "fruit=orange_or_apple" and "fruit=apple". For oranges, the first attribute is 1 and the second is 0; for apples, they're both 1; and for bananas, they're both 0. That's assuming this ordering of class values: orange is largest in juiciness, and banana is smallest in juiciness.
There's a theorem that, if you're making a decision tree, the best way of splitting a node for a nominal variable with k values is one of the k-1 positions -- well, you can read this. In fact, this theorem is reflected in Method 3. That is the best way of splitting these attribute values. Whether it's a good thing in practice or not, well, I don't know. You should try it and see. Perhaps you can try Method 3 for the supervised conversion filter and Method 1 for the unsupervised conversion filter and see which produces the best results on your dataset. Weka doesn't implement Method 2, because the number of attributes explodes with the number of possible values, and you could end up with some very large datasets.
The next question is about simulating multiresponse linear regression: "Please explain!" Well, we're looking at a Weka screen like this. We're running linear regression on the iris dataset where we've mapped the values so that the class for any Virginica instance is 1 and 0 for the others. We've done it with this kind of configuration. This is the default configuration of the makeIndicator filter. It's working on the last attribute -- that's the class. In this case, the value index is last, which means we're looking at the last value, which, in fact, is Virginica. We could put a number here to get the first, second, or third values. That's how we get the dataset, and then we run linear regression on this to get a linear model.
Now, I want to look at the output for the first 4 instances. We've got an actual class of 1, 1, 0, 0 and the predicted value of these numbers. I've written those down in this little table over here: 1, 1, 0, 0 and these numbers. That for the dataset where all of the Virginicas are mapped to 1 and the other irises are mapped to 0. When we do the corresponding mapping with Versicolors, we get this as the actual class -- we just run Weka and look at what appeared on the screen -- and this is the predicted value. We get these for Setosa. So, you can see that the first instance is actually a Virginica - 1, 0, 0. I've put in bold the largest of these 3 numbers. This is the largest, 0.966, which is bigger than 0.117 and -0.065, so multiresponse linear regression is going to predict Virginica for instance 1. It's got the largest value. And that's correct.
For the second instance, it's also a Virginica, and it's also the largest of the 3 values in its row. For the third instance, it's actually a Versicolor. The actual output is 1 for the Versicolor model, but the largest prediction is still for the Virginica model. It's going to predict Virginica for an iris that's actually Versicolor. That's going to be a mistake. In the [fourth] case, it's actually a Setosa -- the actual column is 1 for Setosa -- and this is the largest value in the row, so it's going to correctly predict Setosa. That's how multiresponse linear regression works.
"How does OneR use the rules it generates? Please explain!" Well, here's the rule generated by OneR. It hinges on attribute 6. Of course, if you click the "Edit" button in the Preprocess panel, you can see the value of this attribute for each instance. This is what we see in the Explorer when we run OneR. You can see the predicted instances here. These are the predicted instances -- g, b, g, b, g, g, etc. These are the predictions. The question is, how does it get these predictions. This is the value of attribute 6 for instance 1. What the OneR code does is go through each of these conditions and looks to see if it's satisfied. Is 0.02 less than -0.2? -- no, it's not. Is it less than -0.01? -- no, it's not. Is it less than 0.001? -- no, it's not. (It's surprisingly hard to get these right, especially when you've got all of the other decimal places in the list here.) Is it less than 0.1? -- yes, it is. So rule 4 fires -- this is rule 4 -- and predicts "g". I've written down here the number of the rule clause that fires. In this case, for instance 2, the value of the attribute is -0.4, and that satisfies the first rule. So this satisfies number 1, and we predict "b". And so on down the list. That's what OneR does. It goes through the rule evaluating each of these clauses until it finds one that is true, and then it uses the corresponding prediction as its output.
Moving on to ensemble learning questions. There were some questions on ensemble learning, about these ten OneR models. "Are these ten alternative ways of classifying the data?" Well, in a sense, but they are used together: AdaBoost.M1 combines them. In practice you don't just pick one of them and use that: AdaBoost combines these models inside itself -- the predictions it prints are produced by its combined model. The weights are used in the combination to decide how much weight to give each of these models. And when Weka reports a certain accuracy, that's for the combined model. It's not the average; it's not the best; it's combined in the way that AdaBoost combines them. That's all done internally in the algorithm. I didn't really explain the details of how the algorithm works; you'll have to look that up, I guess. The point is AdaBoostM1 combines these models for you. You don't have to think of them as separate models. They're all combined by AdaBoostM1.
Someone complained that we're supposed to be looking for simplicity, and this seems pretty complicated. That's true. The real disadvantage of these kinds of models, ensemble models, is that it's hard to look at the rules. It's hard to see inside to see what they're doing. Perhaps you should be a bit wary of that. But they can produce very good results. You know how to test machine learning methods reliably using cross-validation or whatever. So, sometimes they're good to use.
"How does Weka make predictions? How can you use Weka to make predictions?" You can use the "Supplied test set" option on the Classify panel to put in a test set and see the predictions on that. Or, alternatively, there is a program -- if you can run Java programs -- there's a program here. This is how you run it: "java weka.classifiers.trees.J48" with your ARFF data file, and you put question marks there to indicate the class. Then you give it the model, which you've output from the Explorer. You can look at how to do this on the Weka Wiki on the FAQ list: "using Weka to make predictions".
Can you bootstrap learning? Someone talked about some friends of his who were using training data to train a classifier and using the results of the classification to create further training data, and continuing the cycle -- kind of bootstrapping. That sounds very attractive, but it can also be unstable. It might work, but I think you'd be pretty lucky for it to work well. It's a potentially rather unreliable way of doing things -- believing the classifications on new data and using that to further train the classifier. He also said these friends of his don't really look into the classification algorithm. I guess I'm trying to tell you a little bit about how each classification algorithm works, because I think it really does help to know that. You should be looking inside and thinking about what's going on inside your data mining method.
A couple of suggestions of things not covered in this MOOC: FilteredClassifier and association rules, the Apriori association rule learner. As I said before, maybe we'll produce a follow-up MOOC and include topics like this in it.
That's it for now. Class 5 is the last class. It's a short class. Go ahead and do it. Please complete the assessments and finish off the course. It'll be open this week, and it'll remain open for one further week if you're getting behind. But after that, it'll be closed. So, you need to get on with it.
We'll talk to you later. Bye!