1 00:00:16,770 --> 00:00:19,360 Hi! I'm sitting here in New Zealand. 2 00:00:19,360 --> 00:00:21,440 It's on the globe behind me. 3 00:00:21,440 --> 00:00:24,700 That's New Zealand, at the top of the world, surrounded by water. 4 00:00:24,700 --> 00:00:27,540 But that's not where I'm from originally. 5 00:00:27,540 --> 00:00:30,790 I moved here about 20 years ago. 6 00:00:30,790 --> 00:00:35,790 Here on this map, of course, this is New Zealand -- Google puts things with the north at the 7 00:00:35,790 --> 00:00:37,480 top, which is probably what you're used to. 8 00:00:37,480 --> 00:00:42,430 I came here from the University of Calgary in Canada, where I was for many years. 9 00:00:42,430 --> 00:00:45,540 I used to be head of computer science for the University of Calgary. 10 00:00:45,540 --> 00:00:50,860 But, originally, I'm from Belfast, Northern Ireland, which is here in the United Kingdom. 11 00:00:50,860 --> 00:00:55,410 So, my accent actually is Northern Irish, not New Zealand. 12 00:00:55,410 --> 00:00:59,010 This is not a New Zealand accent. 13 00:00:59,010 --> 00:01:03,440 We're going to talk here in the last lesson of Class 3 about another machine learning 14 00:01:03,440 --> 00:01:08,330 method called the nearest neighbor, or instance-based, machine learning method. 15 00:01:08,330 --> 00:01:15,330 When people talk about rote learning, they just talk about remember stuff without really 16 00:01:15,360 --> 00:01:17,550 thinking about it. 17 00:01:17,550 --> 00:01:20,110 It's the simplest kind of learning. 18 00:01:20,110 --> 00:01:23,030 Nearest neighbor implements rote learning. 19 00:01:23,030 --> 00:01:28,750 It just remembers the training instances, and then, to classify a new instance, it searches 20 00:01:28,750 --> 00:01:33,520 the training set for one that is most like the new instance. 21 00:01:33,520 --> 00:01:36,530 The representation of the knowledge here is just the set of instances. 22 00:01:36,530 --> 00:01:38,680 It's a kind of lazy learning. 23 00:01:38,680 --> 00:01:42,610 The learner does nothing until it has to do some predictions. 24 00:01:42,610 --> 00:01:46,970 Confusingly, it's also called instance-based learning. 25 00:01:46,970 --> 00:01:52,100 Nearest neighbor learning and instance-based learning are the same thing. 26 00:01:52,100 --> 00:01:56,390 Here is just a little picture of 2-dimensional instance space. 27 00:01:56,390 --> 00:02:02,330 The blue points and the white points are two different classes -- yes and no, for example. 28 00:02:02,330 --> 00:02:06,330 Then we've got an unknown instance, the red one. 29 00:02:06,330 --> 00:02:07,800 We want to know which class it's in. 30 00:02:07,800 --> 00:02:12,820 So, we simply find the closest instance in each of the classes and see which is closest. 31 00:02:12,820 --> 00:02:15,050 In this case, it's the blue class. 32 00:02:15,050 --> 00:02:19,280 So, we would classify that red point as though it belonged to the blue class. 33 00:02:19,280 --> 00:02:25,450 If you think about this, that's implicitly drawing a line between the two clouds of points. 34 00:02:25,450 --> 00:02:32,410 It's a straight line here, the perpendicular bisector of the line that joins the two closest points. 35 00:02:32,410 --> 00:02:36,570 The nearest neighbor method produces a linear decision boundary. 36 00:02:36,570 --> 00:02:39,470 Actually, it's a little bit more complicated than that. 37 00:02:39,470 --> 00:02:46,700 It produces a piece-wise linear decision boundary with sometimes a bunch of little linear pieces 38 00:02:46,700 --> 00:02:48,690 of the decision boundary. 39 00:02:50,840 --> 00:02:54,860 Of course, the trick is what do we mean by "most like". 40 00:02:54,860 --> 00:03:00,480 We need a similarity function, and conventionally, people use the regular distance function, 41 00:03:00,480 --> 00:03:07,920 the Euclidean distance, which is the sum of the squares of the differences between the attributes. 42 00:03:07,920 --> 00:03:11,480 Actually, it's the square root of the sum of the squares, but since we're just comparing 43 00:03:11,480 --> 00:03:14,220 two instances, we don't need to take the square root. 44 00:03:15,130 --> 00:03:20,510 Or, you might use the Manhattan or city block distance, which is the sum of the absolute 45 00:03:20,510 --> 00:03:22,290 differences between the attribute values. 46 00:03:23,410 --> 00:03:26,830 Of course, I've been talking about numeric attributes here. 47 00:03:26,830 --> 00:03:33,290 If attributes are nominal, we need the difference between different attribute values. 48 00:03:34,050 --> 00:03:38,210 Conventionally, people just say the distance is 1 if the attribute values are different 49 00:03:38,210 --> 00:03:39,960 and 0 if they are the same. 50 00:03:39,960 --> 00:03:44,820 It might be a good idea with nearest neighbor learning to normalize the attributes so that 51 00:03:44,820 --> 00:03:49,780 they all lie between 0 and 1, so the distance isn't skewed by some attribute that happens 52 00:03:49,780 --> 00:03:54,270 to be on some gigantic scale. 53 00:03:54,270 --> 00:03:57,070 What about noisy instances. 54 00:03:57,070 --> 00:04:04,000 If we have a noisy dataset, then by accident we might find an incorrectly classified training 55 00:04:04,000 --> 00:04:06,850 instance as the nearest one to our test instance. 56 00:04:06,850 --> 00:04:12,360 You can guard against that by using the k-nearest-neighbors. 57 00:04:12,360 --> 00:04:17,280 k might be 3 or 5, and you look for the 3 or the 5 nearest neighbors and choose the 58 00:04:17,280 --> 00:04:22,030 majority class amongst those when classifying an unknown point. 59 00:04:22,030 --> 00:04:24,540 That's the k-nearest-neighbor method. 60 00:04:24,540 --> 00:04:31,650 In Weka, it's called IBk (instance-based learning with parameter k), and it's in the lazy class. 61 00:04:31,650 --> 00:04:38,000 Let's open the glass dataset. 62 00:04:40,230 --> 00:04:46,490 Go to Classify and choose the lazy classifier IBk. 63 00:04:48,430 --> 00:04:49,960 Let's just run it. 64 00:04:49,960 --> 00:04:56,960 We get an accuracy of 70.6%. 65 00:04:57,150 --> 00:04:59,650 The model is not really printed here, because there is no model. 66 00:04:59,650 --> 00:05:02,520 It's just the set of training instances. 67 00:05:03,440 --> 00:05:05,960 We're using 10-fold cross-validation, of course. 68 00:05:05,960 --> 00:05:12,960 Let's change the value of k, this kNN is the k value. 69 00:05:15,070 --> 00:05:16,470 It's set by default to 1. 70 00:05:16,470 --> 00:05:23,470 (The number of neighbors to use.) We'll change that to, say, 5 and run that. 71 00:05:24,840 --> 00:05:31,380 In this case, we get a slightly worse result, 67.8% with k as 5. 72 00:05:32,120 --> 00:05:35,010 This is not such a noisy dataset, I guess. 73 00:05:35,010 --> 00:05:41,810 If we change it to 20 and run it again. 74 00:05:41,810 --> 00:05:44,840 We get 65% accuracy, slightly worse again. 75 00:05:45,440 --> 00:05:52,280 If we had a noisy dataset, we might find that the accuracy figures improved as k got little 76 00:05:52,280 --> 00:05:53,490 bit larger. 77 00:05:53,490 --> 00:05:55,960 Then, it would always start to decrease again. 78 00:05:55,960 --> 00:06:00,940 If we set k to be an extreme value, close to the size of the whole dataset, then we're 79 00:06:00,940 --> 00:06:06,550 taking the distance of the test instance to all of the points in the dataset and averaging 80 00:06:06,550 --> 00:06:10,090 those, which will probably give us something close to the baseline accuracy. 81 00:06:10,090 --> 00:06:15,440 Here, if I set k to be a ridiculous value like 100. 82 00:06:15,440 --> 00:06:22,150 I'm going to take the 100 nearest instances and average their classes. 83 00:06:22,150 --> 00:06:29,070 We get an accuracy of 35%, which, I think is pretty close to the baseline accuracy for 84 00:06:29,070 --> 00:06:30,090 this dataset. 85 00:06:30,090 --> 00:06:36,710 Let me just find that out with ZeroR, the baseline accuracy is indeed 35%. 86 00:06:40,280 --> 00:06:42,000 Nearest neighbor is a really good method. 87 00:06:42,000 --> 00:06:44,090 It's often very accurate. 88 00:06:44,090 --> 00:06:45,430 It can be slow. 89 00:06:45,430 --> 00:06:52,050 A simple implementation would involve scanning the entire training dataset to make each prediction, 90 00:06:52,050 --> 00:06:56,690 because we've got to calculate the distance of the unknown test instance from all of the 91 00:06:56,690 --> 00:06:59,320 training instances to see which is closest. 92 00:06:59,320 --> 00:07:03,190 There are more sophisticated data structures that can make this faster, so you don't need 93 00:07:03,190 --> 00:07:06,300 to scan the whole dataset every time. 94 00:07:06,300 --> 00:07:09,900 It assumes all attributes are equally important. 95 00:07:09,900 --> 00:07:14,810 If that wasn't the case, you might want to look at schemes for selecting or weighting 96 00:07:14,810 --> 00:07:18,240 attributes depending on their importance. 97 00:07:18,240 --> 00:07:23,220 If we've got noisy instances, than we can use a majority vote over the k nearest neighbors, 98 00:07:23,220 --> 00:07:27,370 or we might weight instances according to their prediction accuracy. 99 00:07:27,370 --> 00:07:32,850 Or, we might try to identify reliable prototypes, one for each of the classes. 100 00:07:32,850 --> 00:07:34,270 This is a very old method. 101 00:07:34,270 --> 00:07:37,650 Statisticians have used k-nearest-neighbor since the 1950's. 102 00:07:37,650 --> 00:07:41,210 There's an interesting theoretical result. 103 00:07:41,210 --> 00:07:49,090 If the number (n) of training instances approaches infinity, and k also gets larger in such a 104 00:07:49,090 --> 00:07:58,550 way that k/n approaches 0, but k also approaches infinity, the error of the k-nearest-neighbor 105 00:07:58,550 --> 00:08:03,190 method approaches the theoretical minimum error for that dataset. 106 00:08:03,190 --> 00:08:08,800 There is a theoretical guarantee that with a huge dataset and large values of k, you're 107 00:08:08,800 --> 00:08:12,800 going to get good results from nearest neighbor learning. 108 00:08:12,800 --> 00:08:16,800 There's a section in the text, Section 4.7 on Instance-based learning. 109 00:08:16,800 --> 00:08:19,830 This is the last lesson of Class 3. 110 00:08:19,830 --> 00:08:23,300 Off you go and do the activity, and I'll see you in Class 4. 111 00:08:23,300 --> 00:08:25,030 Bye for now!