1 00:00:16,250 --> 00:00:19,950 Hi! Welcome back to Data Mining with Weka. 2 00:00:19,950 --> 00:00:21,540 This is Class 2. 3 00:00:21,540 --> 00:00:28,150 In the first class, we downloaded Weka and we looked around the Explorer and a few datasets; 4 00:00:28,150 --> 00:00:31,380 we used a classifier, the J48 classifier; 5 00:00:31,380 --> 00:00:37,870 we used a filter to remove attributes and to remove some instances; 6 00:00:37,870 --> 00:00:44,870 we visualized some data—we visualized classification errors on a dataset; 7 00:00:45,030 --> 00:00:50,040 and along the way we looked at a few datasets, the weather data, both the nominal and numeric 8 00:00:50,040 --> 00:00:55,899 version, the glass data, and the iris dataset. 9 00:00:55,899 --> 00:00:58,670 This class is all about evaluation. 10 00:00:58,670 --> 00:01:03,840 In Lesson 1.4, we built a classifier using J48. 11 00:01:03,840 --> 00:01:09,289 In this first lesson of the second class, we're going to see what it's like to actually 12 00:01:09,289 --> 00:01:11,420 be a classifier ourselves. 13 00:01:11,420 --> 00:01:17,300 Then, later on in subsequent lessons in this class, we're going to look at more about evaluation, 14 00:01:17,300 --> 00:01:23,000 training and testing, baseline accuracy and cross-validation. 15 00:01:23,000 --> 00:01:27,480 First of all, we're going to see what it's like to be a classifier. 16 00:01:27,480 --> 00:01:30,800 We're going to construct a decision tree ourselves, interactively. 17 00:01:30,800 --> 00:01:32,840 I'm going to just open up Weka here. 18 00:01:32,840 --> 00:01:35,140 The Weka Explorer. 19 00:01:35,140 --> 00:01:43,170 I'm going to load the segment-challenge dataset. 20 00:01:43,170 --> 00:01:50,040 segment-challenge.arff -- that's the one I want. 21 00:01:50,040 --> 00:01:52,700 We're going to look at this dataset. 22 00:01:52,700 --> 00:01:56,100 Let's first of all look at the class. 23 00:01:56,100 --> 00:02:04,540 The class values are brickface, sky, foliage, cement, window, path, and grass. 24 00:02:04,540 --> 00:02:07,880 It looks like this is kind of an image analysis dataset. 25 00:02:07,880 --> 00:02:15,230 When we look at the attributes, we see things like the centroid of columns and rows, pixel 26 00:02:15,230 --> 00:02:22,110 counts, line densities, means of intensities, and various other things. 27 00:02:23,700 --> 00:02:29,110 Saturation, hue, and the class, as I said before, is different kinds of texture: bricks, 28 00:02:30,780 --> 00:02:33,650 sky, foliage, and so on. 29 00:02:33,650 --> 00:02:37,500 That's the segment challenge dataset. 30 00:02:37,500 --> 00:02:42,740 I'm going to select the user classifier. 31 00:02:42,740 --> 00:02:45,490 The user classifier is a tree classifier. 32 00:02:45,490 --> 00:02:48,680 We'll see what it does in just a minute. 33 00:02:48,680 --> 00:02:53,250 That's the user classifier. 34 00:02:53,250 --> 00:02:58,290 Before I start, this is really quite important. 35 00:02:58,290 --> 00:03:00,890 I'm going to use a supplied test set. 36 00:03:00,890 --> 00:03:13,180 I'm going to set the test set, which is used to evaluate the classifier to be segment-test. 37 00:03:13,180 --> 00:03:18,510 The training set is segment-challenge, the test set is segment-test. 38 00:03:20,610 --> 00:03:22,340 Now we're all set. 39 00:03:23,160 --> 00:03:29,340 I'm going to start the classifier. 40 00:03:29,860 --> 00:03:36,620 What we see is a window with two panels: the Tree Visualizer and the Data Visualizer. 41 00:03:36,620 --> 00:03:40,460 Let's start with the Data Visualizer. 42 00:03:40,460 --> 00:03:46,140 We looked at visualization in the last class, how you can select different attributes for 43 00:03:46,140 --> 00:03:48,430 the x and y. 44 00:03:48,430 --> 00:03:55,430 I'm going to plot the region-centroid-row against the intensity-mean. 45 00:04:09,630 --> 00:04:11,460 That's the plot I get. 46 00:04:21,200 --> 00:04:26,520 Now, we're going to select a class. 47 00:04:26,520 --> 00:04:31,090 I'm going to select Rectangle. 48 00:04:33,890 --> 00:04:43,150 If I draw out with my mouse a rectangle here, I'm going to have a rectangle that's pretty 49 00:04:43,150 --> 00:04:48,270 well pure reds, as far as I can see. 50 00:04:48,270 --> 00:04:54,020 I'm going to submit this rectangle. 51 00:04:54,020 --> 00:04:59,520 You can see that that area has gone and the picture has been rescaled. 52 00:04:59,520 --> 00:05:00,750 I'm building up a tree here. 53 00:05:00,750 --> 00:05:07,750 If I look at the Tree Visualizer, I've got a tree. 54 00:05:09,460 --> 00:05:15,860 We've split on these two attributes, region-centroid-row and intensity-mean. 55 00:05:15,860 --> 00:05:19,000 Here we've got sky, these are all sky classes. 56 00:05:19,000 --> 00:05:23,610 Here we've got a mixture of brickface, foliage, cement, window, path, and grass. 57 00:05:23,610 --> 00:05:26,110 We're kind of going to build up this tree. 58 00:05:26,110 --> 00:05:30,390 What I want to do is to take this node and refine it a bit more. 59 00:05:30,390 --> 00:05:32,780 Here is the Data Visualizer again. 60 00:05:32,780 --> 00:05:39,780 I'm going to select a rectangle containing these items here, and submit that. 61 00:05:41,470 --> 00:05:44,500 They've gone from this picture. 62 00:05:44,500 --> 00:05:53,240 You can see that here, I've created this split, another split on region-centroid-row and 63 00:05:53,240 --> 00:05:55,520 intensity-mean, and here, this is almost all path. 64 00:05:55,520 --> 00:06:01,710 233 path instances, and then a mixture here. 65 00:06:01,710 --> 00:06:03,750 This is a pure node we've got over there. 66 00:06:03,750 --> 00:06:05,920 This is almost a pure node. 67 00:06:05,920 --> 00:06:07,120 This is the one I want to work on. 68 00:06:07,120 --> 00:06:11,500 I'm going to cover some of those instances now. 69 00:06:11,500 --> 00:06:15,210 Let's take this lot here and submit that. 70 00:06:15,210 --> 00:06:22,210 Then I'm going to take this lot here and submit that. 71 00:06:23,250 --> 00:06:30,120 Maybe I'll take those ones there and submit that. 72 00:06:30,120 --> 00:06:33,900 This little cluster here seems pretty uniform. 73 00:06:33,900 --> 00:06:34,410 Submit that. 74 00:06:34,410 --> 00:06:38,110 I haven't actually changed the axes, but, of course, at any time, I could change these 75 00:06:38,110 --> 00:06:43,930 axes to better separate the remaining classes. 76 00:06:43,930 --> 00:06:45,800 I could kind of mess around with these. 77 00:06:45,800 --> 00:06:51,180 Actually, a quick way to do it is to click here on these bars. 78 00:06:51,180 --> 00:06:55,750 Left click for x and right click for y. 79 00:06:55,750 --> 00:07:02,750 I can quickly explore different pairs of axes to see if I can get a better split. 80 00:07:07,370 --> 00:07:08,300 Here's the tree I've created. 81 00:07:08,300 --> 00:07:11,680 I'm going to fit it to the screen. 82 00:07:11,680 --> 00:07:12,300 It looks like this. 83 00:07:12,300 --> 00:07:18,650 You can see that we have successively elaborated down this branch here. 84 00:07:18,650 --> 00:07:25,650 When I finish with this, I can accept the tree. 85 00:07:26,250 --> 00:07:33,250 Actually, before I do that, let me just show you that we were selecting rectangles here, 86 00:07:33,500 --> 00:07:37,650 but I've got other things I can select: a polygon or a polyline. 87 00:07:37,650 --> 00:07:43,520 If I don't want to use rectangles, I can use polygons or polylines. 88 00:07:43,520 --> 00:07:47,170 If you like, you can experiment with those to select different shaped areas. 89 00:07:51,000 --> 00:07:58,900 There's an area I've got selected I just can't quite finish it off. 90 00:07:58,900 --> 00:08:03,150 Alright, I right clicked to finish it off. 91 00:08:03,150 --> 00:08:04,900 I could submit that. 92 00:08:04,900 --> 00:08:06,590 I'm not confined to rectangles; 93 00:08:06,590 --> 00:08:08,940 I can use different shapes. 94 00:08:08,940 --> 00:08:10,430 I'm not going to do that. 95 00:08:10,430 --> 00:08:12,000 I'm satisfied with this tree for the moment. 96 00:08:12,000 --> 00:08:13,920 I'm going to accept the tree. 97 00:08:13,920 --> 00:08:18,420 Once I do this, there is no going back, so you want to be sure. 98 00:08:18,420 --> 00:08:21,840 If I accept the tree, "Are you sure?" -- yes. 99 00:08:21,840 --> 00:08:26,110 Here, I've got a confusion matrix, and I can look at the errors. 100 00:08:26,110 --> 00:08:35,320 My tree classifies 78% of the instances correctly, nearly 79% correctly, and 21% incorrectly. 101 00:08:35,320 --> 00:08:40,500 That's not too bad, especially considering how quickly I built that tree. 102 00:08:42,480 --> 00:08:44,870 It's over to you now. 103 00:08:44,870 --> 00:08:49,480 I'd like you to play around and see if you can do better than this by spending a little 104 00:08:49,480 --> 00:08:52,780 bit longer on getting a nice tree. 105 00:08:52,780 --> 00:08:56,010 I'd like you to reflect on a couple of things. 106 00:08:56,010 --> 00:08:59,300 First of all, what strategy you're using to build this tree. 107 00:08:59,300 --> 00:09:04,220 Basically, we're covering different regions of the instance space, trying to get pure 108 00:09:04,220 --> 00:09:07,430 regions to create pure branches. 109 00:09:07,430 --> 00:09:10,670 This is kind of like a bottom-up covering strategy. 110 00:09:10,670 --> 00:09:15,890 We cover this area and this area and this area. 111 00:09:15,890 --> 00:09:17,510 That's not how J48 works. 112 00:09:17,510 --> 00:09:23,110 When it builds its trees, it tries to do a judicious split through the whole dataset. 113 00:09:23,110 --> 00:09:30,110 At the very top level, it'll split the entire dataset into two in a way that doesn't necessarily 114 00:09:30,220 --> 00:09:34,760 split out particular classes, but makes it easier when it starts working on each half 115 00:09:34,760 --> 00:09:40,920 of the dataset further splitting in a top-down manner in order to try and produce an optimal tree. 116 00:09:40,920 --> 00:09:46,370 It will produce trees much better than the one that I just produced with the user classifier. 117 00:09:46,370 --> 00:09:52,350 I'd also like you to reflect on what it is we're trying to do here. 118 00:09:52,350 --> 00:09:57,940 Given enough time, you could produce a 'perfect' tree for the dataset, but don't forget that 119 00:09:57,940 --> 00:10:01,500 the dataset that we've loaded is the training dataset. 120 00:10:01,500 --> 00:10:07,690 We're going to evaluate this tree on a different dataset, the test dataset, which hopefully 121 00:10:07,690 --> 00:10:11,530 comes from the same source, but is not identical to the training dataset. 122 00:10:11,530 --> 00:10:15,240 We're not trying to precisely fit the training dataset; 123 00:10:15,240 --> 00:10:21,290 we're trying to fit it in a way that generalizes the kinds of patterns exhibited in the dataset. 124 00:10:21,290 --> 00:10:26,550 We're looking for something that will perform well on the test data. 125 00:10:26,550 --> 00:10:32,800 That highlights the importance of evaluation in machine learning. 126 00:10:32,800 --> 00:10:37,700 That's what this class is going to be about, different ways of evaluating your classifier. 127 00:10:37,700 --> 00:10:40,230 That's it. 128 00:10:40,230 --> 00:10:45,260 There's some information in the course text about the user classifier, which you can read 129 00:10:45,260 --> 00:10:47,050 if you like. 130 00:10:47,050 --> 00:10:52,950 Please go on and do the activity associated with this lesson and produce your own classifier. 131 00:10:52,950 --> 00:10:58,850 Hopefully, you'll be able to do much better than me given 5-10 minutes. 132 00:10:58,850 --> 00:11:05,850 Good luck!