Hi! I'm sitting here in New Zealand. It's on the globe behind me -- that's New Zealand, at the top of the world, surrounded by water. But that's not where I'm from originally. I moved here about 20 years ago. Here on this map -- of course, this is New Zealand (Google puts things with the north at the top, which is probably what you're used to) -- I came here from the University of Calgary in Canada, where I was for many years. I used to be head of Computer Science at the University of Calgary. But originally I'm from Belfast, Northern Ireland, which is here in the United Kingdom. So my accent actually is Northern Irish, not New Zealand. This is not a New Zealand accent.
We're going to talk here in the last lesson of Class 3 about another machine learning method called the "nearest-neighbor", or "instance-based", machine learning method.
When people talk about "rote learning", they just [mean] remembering stuff without really thinking about it. It's the simplest kind of learning. "Nearest-neighbor" implements rote learning. It just remembers the training instances, and then, to classify a new instance, it searches the training set for one that is most like the new instance. The representation of the knowledge here is just the set of instances. It's a kind of "lazy" learning -- the learner does nothing until it has to make some predictions. Confusingly, it's also called "instance-based" learning. Nearest-neighbor learning and instance-based learning are the same thing.
Here is a little picture of two-dimensional instance space. The blue points and the white points are two different classes -- "yes" and "no", for example. Then we've got an unknown instance, the red one. We want to know which class it's in. So we simply find the closest instance in each of the classes and see which is closest. In this case, it's the blue class, so we would classify that red point as though it belonged to the blue class.
If you think about this, that's implicitly drawing a line between the two clouds of points. It's a straight line here, the perpendicular bisector of the line that joins the two closest points. So the nearest-neighbor method produces a linear decision boundary. Actually, it's a little bit more complicated than that. It produces a "piecewise" linear decision boundary, with sometimes a bunch of little linear pieces as the decision boundary.
Of course, the trick is, what do we mean by "most like"? We need a similarity function, and conventionally, people use the regular distance function, the "Euclidean distance", which is the sum of the squares of the differences between the attributes. Actually, it's the square root of the sum of the squares, but since we're just comparing two instances, we don't need to take the square root. Or you might use the "Manhattan" or "city block" distance, which is the sum of the absolute differences between the attribute values. Of course, I've been talking about numeric attributes here. If attributes are nominal, we need the distance between different attribute values. Conventionally, people say the distance is 1 if the attribute values are different and 0 if they are the same.
It might be a good idea with nearest-neighbor learning to normalize the attributes so that they all lie between 0 and 1, so the distance isn't skewed by some attribute that happens to be on some gigantic scale.
What about noisy instances? If we have a noisy dataset, then by accident we might find an incorrectly classified training instance as the nearest one to our test instance. You can guard against that by using k nearest neighbors: k might be 3 or 5, and you look for the 3 or the 5 nearest neighbors and choose the majority class amongst those when classifying an unknown point. That's the k-nearest-neighbors method. In Weka, it's called IBk (instance-based learning with parameter k), and it's in the "lazy" class.
Let's open the glass dataset, go to Classify, and choose the lazy classifier IBk. Let's just run it. We get an accuracy of 70.6%. The model is not printed here, because there is no model -- it's just the set of training instances. We're using 10-fold cross-validation, of course.
Let's change the value of k, this kNN is the k value. It's set by default to 1 (the number of neighbors to use). We'll change that to, say, 5 and run that. In this case, we get a slightly worse result, 67.8% with k = 5. This is not such a noisy dataset, I guess. If we change it to 20 and run it again, we get 65% accuracy -- slightly worse again. If we had a noisy dataset we might find that the accuracy figure improves as k got little bit larger. But then it would always start to decrease again -- if we set k to an extreme value, close to the size of the whole dataset, then we're taking the distance of the test instance to all the points in the dataset and averaging those, which will probably give us something close to the baseline accuracy. Here, if I set k to a ridiculous value like 100, I'm going to take the 100 nearest instances and average their classes. We get an accuracy of 35%, which I think is pretty close to the baseline accuracy for this dataset. Let me just find that out with ZeroR: the baseline accuracy is indeed 35%.
Nearest-neighbor is a really good method. It's often very accurate. It can be slow -- a simple implementation would involve scanning the entire training dataset to make each prediction, because we've got to calculate the distance of the test instance from all the training instances to see which is closest. But there are more sophisticated data structures that can make this faster, so you don't need to scan the whole dataset every time.
It assumes all attributes are equally important. If that wasn't the case, you might want to look at schemes for selecting attributes, or weighting attributes depending on their importance. If we've got noisy instances, than we can use a majority vote over the k nearest neighbors, or we might weight instances according to their prediction accuracy. Or we might try to identify reliable prototypes, one for each of the classes.
This is a very old method. Statisticians have used k-nearest-neighbor since the 1950s. There's an interesting theoretical result: if the number (n) of training instances approaches infinity, and k also gets larger in such a way that k/n approaches 0, but k also approaches infinity, the error of the k-nearest-neighbor method approaches the theoretical minimum error for that dataset. So there's a theoretical guarantee that with a huge dataset and large values of k, you're going to get good results from nearest-neighbor learning.
There's a section in the text, Section 4.7, on Instance-based learning.
This is the last lesson of Class 3. Off you go and do the activity, and I'll see you in Class 4. Bye for now!