1 00:00:17,199 --> 00:00:24,199 Hi! Here in Lesson 3.4, we're continuing our exploration of simple classifiers by looking 2 00:00:24,230 --> 00:00:29,219 at classifiers that produce decision trees. 3 00:00:29,219 --> 00:00:33,269 We're going to look at J48. 4 00:00:33,269 --> 00:00:35,920 We've used this classifier quite a bit so far. 5 00:00:35,920 --> 00:00:38,399 Let's have a look at how it works inside. 6 00:00:38,399 --> 00:00:45,399 J48 is based on a top-down strategy, a recursive divide and conquer strategy. 7 00:00:46,479 --> 00:00:51,680 You select which attribute to split on at the root node, and then you create a branch 8 00:00:51,680 --> 00:00:57,960 for each possible attribute value, and that splits the instances into subsets, one for 9 00:00:57,960 --> 00:01:01,589 each branch that extends from the root node. 10 00:01:01,589 --> 00:01:06,210 Then you repeat the the procedure recursively for each branch, selecting an attribute at each node, 11 00:01:06,210 --> 00:01:12,180 and you use only instances that reach that branch to make the selection. 12 00:01:12,180 --> 00:01:19,180 At the end you stop, perhaps you might continue until all instances have the same class. 13 00:01:19,439 --> 00:01:26,439 The trick is, the question is, how do you select a good attribute for the root node. 14 00:01:29,490 --> 00:01:36,119 This is the weather data, and as you can see, outlook has been selected for the root node. 15 00:01:37,140 --> 00:01:42,899 Here are the four possibilities: outlook, windy, humidity, and temperature. 16 00:01:42,899 --> 00:01:46,819 These are the consequences of splitting on each of these attributes. 17 00:01:48,000 --> 00:01:53,819 What we're really looking for is a pure split, a split into pure nodes. 18 00:01:54,709 --> 00:02:00,709 We would be delighted if we found an attribute that split exactly into one node where they 19 00:02:00,709 --> 00:02:04,749 are all yeses, another node where they are all nos, and perhaps a third node where 20 00:02:04,749 --> 00:02:05,779 they are all yeses again. 21 00:02:05,779 --> 00:02:06,990 That would be the best thing. 22 00:02:06,990 --> 00:02:12,340 What we don't want is mixtures, because when we get mixtures of yeses and nos at a node, 23 00:02:12,340 --> 00:02:14,740 then we've got to split again. 24 00:02:14,740 --> 00:02:17,040 You can see that splitting on outlook looks pretty good. 25 00:02:17,040 --> 00:02:24,040 We get one branch with two yeses and three nos, then we get a pure yes branch for overcast, 26 00:02:24,209 --> 00:02:29,950 and, when outlook is rainy, we get three yeses and two nos. 27 00:02:29,950 --> 00:02:35,120 How are we going to quantify this to decide which one of these attributes produces the 28 00:02:35,120 --> 00:02:42,120 purest nodes? We're on a quest here for purity. 29 00:02:42,269 --> 00:02:51,110 The aim is to get the smallest tree, and top-down tree induction methods use some kind of heuristic. 30 00:02:51,110 --> 00:02:58,110 The most popular heuristic to produce pure nodes is an information theory-based heuristic. 31 00:03:00,269 --> 00:03:04,030 I'm not going to explain information theory to you, that would be another MOOC of its 32 00:03:04,030 --> 00:03:06,939 own -- quite an interesting one, actually. 33 00:03:06,939 --> 00:03:12,909 Information theory was founded by Claude Shannon, an American mathematician and scientist who 34 00:03:12,909 --> 00:03:14,880 died about 12 years ago. 35 00:03:14,880 --> 00:03:16,439 He was an amazing guy. 36 00:03:16,439 --> 00:03:17,680 He did some amazing things. 37 00:03:17,680 --> 00:03:23,000 One of the most amazing things, I think, is that he could ride a unicycle and juggle clubs 38 00:03:23,000 --> 00:03:26,799 at the same time when he was in his 80's. 39 00:03:26,799 --> 00:03:29,829 That's pretty impressive. 40 00:03:29,829 --> 00:03:35,519 He came up the whole idea of information theory and quantifying entropy, which measures information 41 00:03:35,519 --> 00:03:37,019 in bits. 42 00:03:37,019 --> 00:03:44,019 This is the formula for entropy: the sum of p log p's for each of the possible outcomes. 43 00:03:44,659 --> 00:03:47,299 I'm not really going to explain it to you. 44 00:03:47,299 --> 00:03:51,299 All of those minus signs are there because logarithms are negative if numbers are less 45 00:03:51,299 --> 00:03:53,930 than 1 and probabilities always are less than 1. 46 00:03:53,930 --> 00:03:56,709 So, the entropy comes out to be a positive number. 47 00:03:57,460 --> 00:03:59,939 What we do is we look at the information gain. 48 00:03:59,939 --> 00:04:07,170 How much information in bits do you gain by knowing the value of an attribute? That is, 49 00:04:07,170 --> 00:04:12,180 the entropy of the distribution before the split minus the entropy of the distribution 50 00:04:12,180 --> 00:04:14,359 after the split. 51 00:04:14,359 --> 00:04:17,480 Here's how it works out for the weather data. 52 00:04:18,440 --> 00:04:19,620 These are the number of bits. 53 00:04:19,620 --> 00:04:24,150 If you split on outlook, you gain 0.247 bits. 54 00:04:24,150 --> 00:04:28,750 I know you might be surprise to see fractional numbers of bits, normally we think of 1 bit 55 00:04:28,750 --> 00:04:34,690 or 8 bits or 32 bits, but information theory shows how you can regard bits as fractions. 56 00:04:34,690 --> 00:04:36,330 These produce fractional numbers of bits. 57 00:04:36,330 --> 00:04:38,820 I don't want to go into the details. 58 00:04:38,820 --> 00:04:46,310 You can see, knowing the value for windy gives you only 0.048 bits of information. 59 00:04:46,310 --> 00:04:53,310 Humidity is quite a bit better; temperature is way down there at 0.029 bits. 60 00:04:53,310 --> 00:04:58,630 We're going to choose the attribute that gains the most bits of information, and that, in 61 00:04:58,630 --> 00:05:00,460 this case, is outlook. 62 00:05:00,460 --> 00:05:05,610 At the top level of this tree, the root node, we're going to split on outlook. 63 00:05:05,610 --> 00:05:11,000 Having decided to split on outlook, we need to look at each of 3 branches that emanate 64 00:05:11,000 --> 00:05:16,250 from outlook corresponding to the 3 possible values of outlook, and consider what to do 65 00:05:16,250 --> 00:05:17,600 at each of those branches. 66 00:05:17,600 --> 00:05:22,710 At the first branch, we might split on temperature, windy or humidity. 67 00:05:22,710 --> 00:05:25,980 We're not going to split on outlook again because we know that outlook is sunny. 68 00:05:25,980 --> 00:05:30,020 For all instances that reach this place, the outlook is sunny. 69 00:05:30,020 --> 00:05:32,950 For the other 3 things, we do exactly the same thing. 70 00:05:32,950 --> 00:05:38,750 We evaluate the information gain for temperature at that point, for windy and humidity, and 71 00:05:38,750 --> 00:05:39,640 we choose the best. 72 00:05:39,640 --> 00:05:44,230 In this case, it's humidity with a gain of 0.971 bits. 73 00:05:44,230 --> 00:05:50,710 You can see that, if we branch on humidity, then we get pure nodes: 3 nos in one and 2 74 00:05:50,710 --> 00:05:52,010 yeses in the other. 75 00:05:52,010 --> 00:05:54,120 When we get that, we don't need to split anymore. 76 00:05:54,850 --> 00:06:00,800 We're on a quest for purity. 77 00:06:00,800 --> 00:06:01,520 That's how it works. 78 00:06:01,520 --> 00:06:05,990 It just carries on until it reaches the end, until it has pure nodes. 79 00:06:05,990 --> 00:06:11,860 Let's open up Weka, and just do this with the nominal weather data. 80 00:06:11,860 --> 00:06:15,520 Of course, we've done this before, but I'll just do it again. 81 00:06:15,520 --> 00:06:17,650 It won't take long. 82 00:06:18,140 --> 00:06:22,640 J48 is the workhorse data mining algorithm. 83 00:06:22,640 --> 00:06:23,910 There's the data. 84 00:06:23,910 --> 00:06:26,430 We're going to choose J48. 85 00:06:26,430 --> 00:06:28,500 It's a tree classifier. 86 00:06:29,840 --> 00:06:38,110 We're going to run this, and we get a tree -- the very tree I showed you before -- split 87 00:06:38,110 --> 00:06:42,150 first on outlook: sunny, overcast and rainy. 88 00:06:42,150 --> 00:06:47,360 Then, if it's sunny, split on humidity, 3 instances reach that node. 89 00:06:47,360 --> 00:06:51,610 Then split on normal, 3 yes instances reach that node, and so on. 90 00:06:51,610 --> 00:06:58,610 We can look at the tree using Visualize the tree in the right-click menu. 91 00:06:58,990 --> 00:07:03,060 Here it is. 92 00:07:03,060 --> 00:07:08,140 These are the number of yes instances that reach this node and the number of no instances. 93 00:07:08,140 --> 00:07:12,900 In the case of this particular tree, of course we're using cross validation here. 94 00:07:12,900 --> 00:07:16,250 It's done an 11th run on the whole dataset. 95 00:07:16,250 --> 00:07:19,750 It's given us these numbers by looking at the training set. 96 00:07:19,750 --> 00:07:24,520 In fact, this becomes a pure node here. 97 00:07:24,520 --> 00:07:29,950 Sometimes you get 2 numbers here -- 3/2 or 3/1. 98 00:07:29,950 --> 00:07:35,670 The first number indicates the number of correct things that reach that node, so in this case 99 00:07:35,670 --> 00:07:36,980 the number of nos. 100 00:07:36,980 --> 00:07:41,090 If there was another number following the 3, that would indicate the number of yeses, 101 00:07:41,090 --> 00:07:43,460 that is, incorrect things that reach that node. 102 00:07:43,460 --> 00:07:47,850 But that doesn't occur in this very simple situation. 103 00:07:50,740 --> 00:07:55,560 There you have it, J48: top-down induction of decision trees. 104 00:07:56,490 --> 00:07:59,360 It's soundly based in information theory. 105 00:07:59,360 --> 00:08:02,230 It's a pretty good data mining algorithm. 106 00:08:02,230 --> 00:08:08,590 10 years ago I might have said it's the best data mining algorithm, but some even better 107 00:08:08,590 --> 00:08:11,280 ones, I think, have been produced since then. 108 00:08:11,280 --> 00:08:18,200 However, the real advantage of J48 is that it's reliable and robust, and, most importantly, 109 00:08:18,200 --> 00:08:20,980 it produces a tree that people can understand. 110 00:08:20,980 --> 00:08:23,850 It's very easy to understand the output of J48. 111 00:08:23,850 --> 00:08:28,090 That's really important when you're applying data mining. 112 00:08:28,090 --> 00:08:31,430 There are a lot of different criteria you could use for attribute selection. 113 00:08:31,430 --> 00:08:33,459 Here we're using information gain. 114 00:08:33,459 --> 00:08:38,000 Actually, in practice, these don't normally make a huge difference. 115 00:08:38,000 --> 00:08:42,269 There are some important modifications that need to be done to this algorithm to be useful 116 00:08:42,269 --> 00:08:42,820 in practice. 117 00:08:42,820 --> 00:08:45,310 I've only really explained the basic principles. 118 00:08:45,310 --> 00:08:51,590 The actual J48 incorporates some more complex stuff to make it work under different circumstances 119 00:08:51,590 --> 00:08:52,300 in practice. 120 00:08:52,300 --> 00:08:56,370 We'll talk about those in the next lesson. 121 00:08:56,370 --> 00:09:02,580 Section 4.3 of the text Divide-and-conquer: Constructing decision trees explains the simple 122 00:09:02,580 --> 00:09:05,360 version of J48 that I've explained here. 123 00:09:06,100 --> 00:09:09,590 Now you should go and do the activity associated with this lesson. 124 00:09:09,590 --> 00:09:12,220 Good luck! See you next time!