Hello, again, and welcome to Data Mining with Weka, back here in New Zealand. In this class, Class 4, we're going to look at some pretty cool machine learning methods.
We're going to look at linear regression, classification by regression, logistic regression, support vector machines, and ensemble learning. The last few of these are contemporary methods, which haven't been around very long. They are kind of state-of-the-art machine learning methods.
Remember, there are 5 classes in this course, so next week is Class 5, the last class. We'll be tidying things up and summarizing things then. You're well over halfway through; you're doing well. Just hang on in there.
In this lesson, we're going to start by looking at classification boundaries for different machine learning methods. We're going to use Weka's Boundary Visualizer, which is another Weka tool that we haven't encountered yet.
I'm going to use a 2-dimensional dataset. I've prepared iris.2d.arff. It's a 2-dimensional version of the iris dataset. I took the regular iris dataset and deleted a couple of attributes -- sepallength and sepalwidth -- leaving me with this 2D dataset, and the class.
We're going to look at that using the Boundary Visualizer. You get that from this Visualization menu on the Weka Chooser. There are a lot of tools in Weka, and we're just going to look at this one here, the Boundary Visualizer. I'm going to open the same file in the Boundary Visualizer, the 2-dimensional iris dataset. Here we've got a plot of the data. You can see that we're plotting petalwidth on the y-axis against petallength on the x-axis. This is a picture of the dataset with the 3 classes setosa in red, versicolor in green, and virginica in blue.
I'm going to choose a classifier. Let's begin with the OneR classifier, which is in rules. I'm going to "plot training data" and just going to let it rip. The color diagram shows the decision boundaries, with the training data superimposed on it. Let's look at what OneR does to this dataset in the Explorer. OneR has chosen to split on petalwidth. If it's less than a certain amount, we get a setosa; if it's intermediate, we get a versicolor; and if it's greater than the upper boundary, we get a viriginica.
It's the same as what's being shown here. We're splitting on petalwidth. If it's less than a certain amount, we get a setosa; in the middle, a versicolor; and at the top, a virginica.
This is a spatial representation of the decision boundary that OneR creates on this dataset. That's what the Boundary Visualizer does; it draws decision boundaries. It shows here that OneR chooses an attribute -- in this case petalwidth -- to split on. It might have chosen petallength, in which case we'd have vertical decision boundaries. Either way, we're going to get stripes from OneR.
I'm going to go ahead and look at some boundaries for other schemes. Let's look at IBk, which is a "lazy" classifier. That's the instance-based learner we looked at in the last class. I'm going to run that.
Here we get a different kind of pattern. I'd like to plot the training data. We've got diagonal lines. Down here are the setosas underneath this diagonal line; the versicolors in the intermediate region; and the virginicas, by and large, in the top right-hand corner.
Remember what [IBk] does. It takes a test instance. Let's say we had an instance here, just on this side of the boundary, in the red. Then it chooses the nearest instance to that. That would be this one, I guess. That's kind of the nearer than this one here. This is a red point. If I were to cross over the boundary here, it would choose a green class, because this would be the nearest instance then. If you think about it, this boundary goes halfway between this nearest red point and this nearest green point. Similarly, if I take a point up here, I guess the two nearest instances are this blue one and this green one. This blue one is closer. In this case, the boundary goes along this straight line here. You can see that it's not just a single line: this is a piecewise linear line, so this part of the boundary goes exactly halfway between these two points quite close to it. Down here, the boundary goes exactly halfway between these two points. It's the perpendicular bisector of the line joining these points. So we get a piecewise linear boundary made up of little pieces.
It's kind of interesting to see what happens if we change the parameter: if we look at, say, 5 nearest neighbors instead of just 1. Now we get a slightly blurry picture, because whereas down here in the pure red region the 5 nearest neighbors to a point are all red points, if we look in the intermediate region here, then the nearest neighbors to a point here -- this is going to be in the 5, and this might be another one in the 5, and there might be a couple more down here in the 5. So we get an intermediate color here, and IBk takes a vote. If we had 3 reds and 2 greens, then we'd be in the red region and that would be depicted as this darker red here. If it had been the other way round with more greens than reds, we'd be in the green region. So we've got a blurring of these boundaries. These are probabilistic descriptions of the boundary.
Let me just change k to 20 and see what happens. Now we get the same shape, but even more blurry boundaries. The Boundary Visualizer reveals the way that machine learning schemes are thinking, if you like. The internal representation of the dataset. They help you think about the sorts of things that machine learning methods do.
Let's choose another scheme. I'm going to choose NaiveBayes. When we talked about NaiveBayes, we only talked about discrete attributes. With continuous attributes, I'm going to choose a supervised discretization method. Don't worry about this detail, it's the most common way of using NaiveBayes with numeric attributes. Let's look at that picture.
This is interesting. When you think about NaiveBayes, it treats each of the two attributes as contributing equally and independently to the decision. It sort of decides what it should be along this dimension and decides what it should be along this dimension and multiples the two together. Remember the multiplication that went on in NaiveBayes. When you multiple these things together, you get a checkerboard pattern of probabilities, multiplying up the probabilities. That's because the attributes are being treated independently.
That's a very different kind of decision boundary from what we saw with instance-based learning. That's what's so good about the Boundary Visualizer: it helps you think about how things are working inside.
I'm going to do one more example. I'm going to do J48, which is in trees. Here we get this kind of structure. Let's take a look at what happens in the Explorer if we choose J48. We get this little decision tree: split first on petalwidth; if it's less than 0.6 it's a setosa for sure. Then split again on petalwidth; if it's greater than 1.7, it's a virginica for sure. Then, in between, split on petallength and then again on petalwidth, getting a mixture of versicolors and viriginicas.
We split first on petalwidth; that's this split here. Remember the vertical axis is the petalwidth axis. If it's less than a certain amount, it's a setosa for sure. Then we split again on the same axis. If it's greater than a certain amount, it's a virginica for sure. If it's in the intermediate region, we split on the other axis, which is petallength. Down here, it's a versicolor for sure, and here we're going to split again on the petalwidth attribute.
Let's change the minNumObj parameter, which controls the minimum size of the leaves. If we increase that, we're going to get a simpler tree. We discussed this parameter in one of the lessons of Class 3. If we run now, then we get a simpler version, corresponding to the simpler rules we get with this parameter set. Or we can set the parameter to a higher value, say 10, and run it again. We get even simpler rules, very similar to the rules produced by OneR.
We've looked at classification boundaries. Classifiers create boundaries in instance space and different classifiers have different capabilities for carving up instance space. That's called the "bias" of the classifier -- the way in which it's capable of carving up the instance space. We looked at OneR, IBk, NaiveBayes, and J48, and found completely different biases, completely different ways they carve up the instance space. Of course, this kind of visualization is restricted to numeric attributes and 2-dimensional plots, so it's not a very general tool, but it certainly helps you think about these different classifiers.
You can read about classification boundaries in Section 17.3 of the course text. Now off you go and do the activity associated with this lesson.