Hi! Welcome back! In the last lesson, we looked at linear regression -- the problem of predicting, not a nominal class value, but a numeric class value. The regression problem.
In this lesson, we're going to look at how to use regression techniques for classification. It sounds a bit weird, but regression techniques can be really good under certain circumstances, and we're going to see if we can apply them to ordinary classification problems.
In a 2-class problem, it's quite easy really. We're going to call the 2 classes 0 and 1 and just use those as numbers, and then come up with a regression line that, presumably for most 0 instances has a pretty low value, and for most 1 instances has a larger value, and then come up with a threshold for determining whether, if it's less than that threshold, we're going to predict class 0; if it's greater, we're going to predict class 1.
If we want to generalize that to more than 2 classes, we can use a separate regression for each class. We set the output to 1 for instances that belong to the class, and 0 for instances that don't. Then come up with a separate regression line for each class, and given an unknown test example, we're going to choose a class with the largest output. That would give us n regressions for a problem where there are n different classes. We could alternatively use pairwise regression: take every pair of classes -- that's n squared over 2 -- and have a linear regression line for each pair of classes, discriminating an instance in one class of that pair from the other class of that pair.
We're going to work with a 2-class problem, and we're going to investigate 2-class classification by regression.
I'm going to open diabetes.arff. Then I'm going to convert the class. Actually, let's just try to apply regression to this. I'm going to try LinearRegression. You see it's grayed out here. That means it's not applicable. I can select it, but I can't start it. It's not applicable because linear regression applies to a dataset where the class is numeric, and we've got a dataset where the class is nominal. We need to fix that. We're going to change this from these 2 labels to 0 and 1, respectively.
We'll do that with a filter. We want to change an attribute. It's unsupervised. We want to change a nominal to a binary attribute, so that's the NominalToBinary filter. We want to apply that to the 9th attribute. The default will apply it to all the attributes, but we just want to apply it to the 9th attribute. I'm hoping it will change this attribute from nominal to binary. Unfortunately, it doesn't. It doesn't have any effect, and the reason it doesn't have any effect is because these attribute filters don't work on the class value. I can change the class value; we're going to give this "No class", so now this is not the class value for the dataset. Run the filter again. Now I've got what I want: this attribute "class" is either 0 or 1. In fact, this is the histogram -- there are this number of 0's and this number of 1's, which correspond to the two different values in the original dataset.
Now, we've got our LinearRegression, and we can just run it. This is the regression line. It's a line, 0.02 times the "pregnancy" attribute, plus this times the "plas" attribute, and so on, plus this times the "age" attribute, plus this number. That will give us a number for any given instance. We can see that number if we select "Output predictions" and run it again. Here is a table of predictions for each instance in the dataset. This is the instance number; this is the actual class of the instance, which is 0 or 1; this is the predicted class, which is a number -- sometimes it's less than 0. We would hope that these numbers are generally fairly small for 0's and generally larger for 1's. They sort of are, although it's not really easy to tell. This is the error value here in the fourth column.
I'm going to do more extensive investigation, and you might ask why are we bothering to do this? First of all, it's an interesting idea that I want to explore. It will lead to quite good performance for classification by regression, and it will lead into the next lesson on logistic regression, which is an excellent classification technique. Perhaps most importantly, we'll learn how to do some cool things with the Weka interface.
My strategy is to add a new attribute called "classification" that gives this predicted number, and then we're going to use OneR to optimize a split point for the two classes. We'll have to restore the class back to its original nominal value, because, remember, I just converted it to numeric.
Here it is in detail. We're going to use a supervised attribute filter [AddClassification]. This is actually pretty cool, I think. We're going to add a new attribute called "classification". We're going to choose a classifier for that -- LinearRegression. We need to set "outputClassification" to "True". If we just run this, it will add a new attribute to the dataset. It's called "classification", and it's got these numeric values, which correspond exactly to the numeric values that were predicted here by the linear regression scheme. Now, we've got this "classification" attribute, and what I'd like to do now is to convert the class attribute back to nominal from numeric. I want to use ZeroR now, and ZeroR will only work with a nominal class. Let me convert that. I want NumericToNominal. I want to run that on attribute number 9. Let me apply that, and now, sure enough, I've got the two labels 0 and 1. This is a nominal attribute with these two labels. I'll be sure to make that one the class attribute. Then I get the colors back -- 2 colors for the 2 classes. Really, I want to predict this "class" based on the value of "classification", that numeric value. I'm going to delete all the other attributes. I'm going to go to my Classify panel here. I'm going to predict "class" -- this nominal value "class" -- and I'm going to use OneR. I think I'll stop outputting the predictions because they just get in the way; and run that.
It's 72-73%, and that's a bit disappointing. But actually, when you look at this, OneR has produced this really overfitted rule. We want a single split point. If it's less than this than predict 0, otherwise predict 1. We can get around that by changing this "b" parameter, the minBucketSize parameter, to be something much larger. I'm going to change it to 100 and run it again. Now I've got much better performance, 77% accuracy, and this is the kind of split I've got: if the classification -- that is the regression value -- is less than 0.47 I'm going to call it a 0; otherwise I'm going to call it a 1.
So I've got what I wanted, classification by regression. We've extended linear regression to classification. This performance of 76.8% is actually quite good for this problem. It was easy to do with 2 classes, 0 and 1; otherwise you need to have a regression for each class -- multi-response linear regression -- or else for each pair of classes -- pairwise linear regression. We learnt quite a few things about Weka. We learned about unsupervised attribute filters to convert nominal attributes to binary, and numeric attributes back to nominal. We learned about this cool filter AddClassification, which adds the classification according to a machine learning scheme as an attribute in the dataset. We learned about setting and unsetting the class of the dataset, and we learned about the minimum bucket size parameter to prevent OneR from overfitting.
That's classification by regression. In the next lesson, we're going to do better. We're going to look at logistic regression, an advanced technique which effectively does classification by regression in an even more effective way. We'll see you soon. Bye!