1 00:00:18,109 --> 00:00:20,490 Hi! you probably learned a bit about flowers if you did the activity 2 00:00:20,490 --> 00:00:23,150 associated with the last lesson. 3 00:00:23,150 --> 00:00:26,509 Now, we're going to actually build a classifier: Lesson 1.4 4 00:00:26,509 --> 00:00:28,810 Building a classifier. 5 00:00:28,810 --> 00:00:30,349 We're going to use a 6 00:00:30,349 --> 00:00:35,730 system called J48—I'll tell you why it's called J48 in a minute— 7 00:00:35,730 --> 00:00:38,450 to analyze the glass dataset. 8 00:00:38,450 --> 00:00:41,820 That we looked at in the last lesson. 9 00:00:41,820 --> 00:00:44,990 I've got the glass dataset open here. 10 00:00:44,990 --> 00:00:49,180 I going to go to the Classify panel. 11 00:00:49,180 --> 00:00:52,590 I choose a classifier here. 12 00:00:52,590 --> 00:00:56,750 There are different kinds of classifiers. Weka has 13 00:00:56,750 --> 00:01:01,310 bayes classifiers, functions classifiers, lazy classifiers, meta classifiers, and so on. 14 00:01:02,000 --> 00:01:08,890 We're going to use a tree classifier. J48 is a tree classifier. I'm going to open trees and click 15 00:01:08,890 --> 00:01:10,700 J48. 16 00:01:10,700 --> 00:01:15,240 Here is the J48 classifier. 17 00:01:15,240 --> 00:01:19,590 Let's run it. If we just press start, we've got the dataset, we've got the classifier, 18 00:01:19,590 --> 00:01:21,040 and lo and behold, 19 00:01:21,040 --> 00:01:22,430 it's done it. 20 00:01:22,430 --> 00:01:24,290 It's a bit of an anticlimax, really. 21 00:01:24,290 --> 00:01:26,270 Weka makes things very easy 22 00:01:26,270 --> 00:01:27,650 for you to do. 23 00:01:27,650 --> 00:01:30,130 The problem is understanding what it is that you have done. 24 00:01:30,130 --> 00:01:32,190 Let's take a look. 25 00:01:32,190 --> 00:01:35,390 Here is some information about the datasets, 26 00:01:35,390 --> 00:01:38,890 glass dataset. The number of instances and attributes. 27 00:01:38,890 --> 00:01:42,720 Then it's printed out a representation of a tree here. 28 00:01:43,860 --> 00:01:46,900 We'll look at these trees later on, 29 00:01:46,900 --> 00:01:50,159 but just note that this tree has 30 00:01:50,159 --> 00:01:54,880 30 leaves and 59 nodes altogether. 31 00:01:54,880 --> 00:01:57,120 The overall accuracy 32 00:01:57,120 --> 00:02:00,180 is 66.8%. 33 00:02:00,180 --> 00:02:01,330 So, it's done pretty well. 34 00:02:02,619 --> 00:02:05,410 Down at the bottom, 35 00:02:05,410 --> 00:02:08,929 we've got a confusion matrix. Remember there were about seven different 36 00:02:08,929 --> 00:02:10,260 kinds of glass. 37 00:02:10,260 --> 00:02:11,320 This is 38 00:02:11,320 --> 00:02:15,329 building windows made of float glass. 39 00:02:15,329 --> 00:02:19,109 You can see that 50 of these have been classified as 'a', which is 40 00:02:19,109 --> 00:02:20,959 correctly classified. 41 00:02:20,959 --> 00:02:23,369 15 of them have been classified as 'b', 42 00:02:23,369 --> 00:02:26,719 which is building windows non-float glass, so those are errors, 43 00:02:26,719 --> 00:02:28,579 and 3 have been classified as 'c', 44 00:02:28,579 --> 00:02:29,649 and so on. 45 00:02:29,649 --> 00:02:32,619 This is a confusion matrix. 46 00:02:32,619 --> 00:02:36,019 Most of the weight is down the main diagonal, which 47 00:02:36,019 --> 00:02:39,780 we like to see because that indicates correct classifications. 48 00:02:39,780 --> 00:02:41,549 Everything off the main diagonal 49 00:02:41,549 --> 00:02:46,049 indicates a misclassification. 50 00:02:46,049 --> 00:02:50,360 That's the confusion matrix. 51 00:02:50,360 --> 00:02:52,260 Let's investigate this a bit further. 52 00:02:52,260 --> 00:02:55,950 We're going to open a configuration panel for J48. 53 00:02:55,950 --> 00:02:57,689 Remember I chose it 54 00:02:57,689 --> 00:03:00,979 by clicking the Choose button. 55 00:03:00,979 --> 00:03:03,099 Now, if I click it here, 56 00:03:03,099 --> 00:03:05,489 I get a configuration panel. 57 00:03:05,489 --> 00:03:10,559 I clicked J48 in this menu, and I get a configuration panel, which 58 00:03:10,559 --> 00:03:12,969 gives a bunch of parameters. 59 00:03:12,969 --> 00:03:14,359 I'm not going to 60 00:03:14,359 --> 00:03:18,659 really talk about these parameters. Let's just look at one of them, the unpruned 61 00:03:18,659 --> 00:03:20,849 parameter, which by default is false. 62 00:03:20,849 --> 00:03:22,730 What we've just done is to build a 63 00:03:22,730 --> 00:03:26,659 pruned tree, because unpruned is False. 64 00:03:26,659 --> 00:03:28,709 We can change this to 65 00:03:28,709 --> 00:03:31,949 make it true and build an unpruned tree. 66 00:03:31,949 --> 00:03:33,499 We've changed the configuration. 67 00:03:33,499 --> 00:03:36,059 We can run it again. 68 00:03:36,059 --> 00:03:38,999 It just ran again, and now we have 69 00:03:38,999 --> 00:03:43,209 a potentially different result. 70 00:03:43,209 --> 00:03:48,149 Let's just have a look. We have 67% correct classification. 71 00:03:48,149 --> 00:03:49,739 What did we have before? 72 00:03:49,739 --> 00:03:52,579 These are the runs. This is the previous run, 73 00:03:52,579 --> 00:03:54,040 and there we had 74 00:03:54,040 --> 00:03:57,139 66.8%. 75 00:03:57,139 --> 00:04:01,109 Now, in this run that we've just done with 76 00:04:01,109 --> 00:04:06,939 the unpruned tree, we've got 67% accuracy, 77 00:04:06,939 --> 00:04:11,559 and the tree is the same size. 78 00:04:11,559 --> 00:04:14,619 That's one option. 79 00:04:14,619 --> 00:04:18,239 I'm just going to look at another option, and then we'll look at some trees. 80 00:04:18,239 --> 00:04:20,430 I'm going to click the configuration panel again, 81 00:04:20,430 --> 00:04:24,930 and I'm going to change 82 00:04:26,330 --> 00:04:30,439 the minNumObj parameter. 83 00:04:30,439 --> 00:04:32,229 What is that? 84 00:04:32,229 --> 00:04:36,470 That is the minimum number of instances per leaf. 85 00:04:36,470 --> 00:04:38,969 I'm going to change that from 2 86 00:04:38,969 --> 00:04:41,169 up to 15 87 00:04:41,169 --> 00:04:44,599 to have larger leaves. 88 00:04:44,599 --> 00:04:47,090 These are the leaves of the tree here, 89 00:04:47,090 --> 00:04:49,610 and these numbers in brackets are the number of 90 00:04:49,610 --> 00:04:53,419 instances that get to the leaf. When there are two numbers, this means that one 91 00:04:53,419 --> 00:04:56,699 incorrectly classified instance got to this leaf and five correctly 92 00:04:56,699 --> 00:04:59,159 classified instances got there. 93 00:04:59,159 --> 00:05:00,000 You can see that all of 94 00:05:00,000 --> 00:05:01,730 these leaves are pretty small, 95 00:05:01,730 --> 00:05:03,810 with sometimes just two or three 96 00:05:03,810 --> 00:05:05,530 or here is one with 31 97 00:05:05,530 --> 00:05:09,630 instances. We've constrained now this number, 98 00:05:09,630 --> 00:05:12,730 the tree is going to be generated, and this number is always going to be 99 00:05:12,730 --> 00:05:16,670 15 or more. Let's run it again. 100 00:05:16,670 --> 00:05:17,630 Now we've got 101 00:05:17,630 --> 00:05:22,080 a worse result, 61% correct classification, but a much 102 00:05:22,080 --> 00:05:25,920 smaller tree, 103 00:05:25,920 --> 00:05:30,920 with only eight leaves. 104 00:05:32,470 --> 00:05:35,630 Now, we can visualize this tree. 105 00:05:35,630 --> 00:05:37,660 If I right click 106 00:05:37,660 --> 00:05:41,910 on the line—these are the lines that describe each of the runs that we've done, and this 107 00:05:41,910 --> 00:05:45,360 is the third run—if I right click on that, I get a little menu, 108 00:05:45,360 --> 00:05:49,220 and I can visualize the tree. 109 00:05:49,220 --> 00:05:53,660 There it is. If I right click on empty space, I can fit this to the screen. 110 00:05:54,880 --> 00:05:57,940 This is the decision tree. This says first look at the 111 00:05:57,940 --> 00:05:59,850 Barium (Ba) content. 112 00:05:59,850 --> 00:06:02,910 If it's large, then it must be headlamps. 113 00:06:02,910 --> 00:06:05,700 If it's small, then Magnesium (Mg). 114 00:06:05,700 --> 00:06:11,280 If that's small, then let's look at potassium (K), and if that's small, then we've got tableware. 115 00:06:11,280 --> 00:06:16,320 That sounds like a pretty good thing to me; I don't want too much potassium in my tableware. 116 00:06:16,320 --> 00:06:18,560 This is a visualization of the tree 117 00:06:18,560 --> 00:06:24,470 and it's the same tree that you can see by looking here. 118 00:06:24,470 --> 00:06:30,580 This is a different representation of the same tree. 119 00:06:30,580 --> 00:06:33,540 I'll just show you one more thing about this configuration panel, 120 00:06:33,540 --> 00:06:36,930 the More button. This gives you more information 121 00:06:36,930 --> 00:06:39,350 about the classifier, 122 00:06:39,350 --> 00:06:41,190 about J48. 123 00:06:41,190 --> 00:06:44,230 It's always useful to look at that to see where these classifiers have come from. 124 00:06:47,970 --> 00:06:49,000 In this case, 125 00:06:49,000 --> 00:06:52,910 let me explain why it's called J48. It's based on a famous 126 00:06:52,910 --> 00:06:56,070 system that's called C4.5, which was described in a book. 127 00:06:56,070 --> 00:06:57,880 The book is referenced here. 128 00:06:57,880 --> 00:06:59,260 In fact, I think I've got 129 00:06:59,260 --> 00:07:01,290 on my shelf here. This book here, 130 00:07:01,290 --> 00:07:05,830 "C4.5: Programs for Machine Learning" by an Australian 131 00:07:05,830 --> 00:07:09,250 computer scientist called Ross Quinlan. 132 00:07:09,250 --> 00:07:12,460 He started out with a system called ID3— 133 00:07:12,460 --> 00:07:14,740 I think that might have been in his PhD thesis— 134 00:07:14,740 --> 00:07:18,630 and then C4.5 became quite famous. This kind of morphed through various 135 00:07:18,630 --> 00:07:20,750 versions into C4.5. 136 00:07:20,750 --> 00:07:25,080 It became famous; the book came out, and so on. He continued to work on this system. 137 00:07:25,080 --> 00:07:26,880 It went up to C4.8, 138 00:07:26,880 --> 00:07:30,950 and then he went commercial. Up until then, these were all open source 139 00:07:30,950 --> 00:07:32,070 systems. 140 00:07:32,070 --> 00:07:33,890 When we built Weka, 141 00:07:33,890 --> 00:07:37,420 we took the latest version 142 00:07:37,420 --> 00:07:39,900 of C4.5, 143 00:07:39,900 --> 00:07:41,380 which was C4.8, 144 00:07:41,380 --> 00:07:45,500 and we rewrote it. Weka's written in Java, so we called it J48. 145 00:07:45,500 --> 00:07:47,410 Maybe it's not a 146 00:07:47,410 --> 00:07:48,810 very good name, 147 00:07:48,810 --> 00:07:50,500 but that's the name that stuck. 148 00:07:50,500 --> 00:07:54,240 There's a little bit of history for you. 149 00:07:54,240 --> 00:07:57,950 We've talked about classifiers in Weka. 150 00:07:57,950 --> 00:08:00,380 I've shown you where you find the classifiers. We classified the glass 151 00:08:00,380 --> 00:08:04,260 dataset. We looked at how to interpret the output from J48, in 152 00:08:04,260 --> 00:08:09,170 particular the confusion matrix. We looked at the configuration panel for J48. 153 00:08:09,170 --> 00:08:12,810 We looked at a couple of options: pruned versus unpruned trees and the option to 154 00:08:12,810 --> 00:08:14,330 avoid small leaves. 155 00:08:14,330 --> 00:08:15,530 I told you how 156 00:08:15,530 --> 00:08:18,850 J48 really corresponds to the machine learning system that 157 00:08:18,850 --> 00:08:24,670 most people know as C4.5. C4.5 and C4.8 were really pretty similar, 158 00:08:24,670 --> 00:08:26,030 so we just talk 159 00:08:26,030 --> 00:08:30,450 about J48 as if it's synonymous with C4.5. 160 00:08:30,450 --> 00:08:32,220 You can read about this in the book— 161 00:08:32,220 --> 00:08:35,930 Section 11.1 about Building a decision tree and Examining the output. 162 00:08:35,930 --> 00:08:40,520 Now, off you go, and do the activity associated with this lesson. 163 00:08:40,520 --> 00:08:47,520 See you again soon!