Hi! Here in Lesson 3.4 we're continuing our exploration of simple classifiers by looking at classifiers that produce decision trees.
We're going to look at J48. We've used this classifier quite a bit so far. Let's have a look at how it works inside. J48 is based on a top-down strategy, a recursive divide and conquer strategy. You select which attribute to split on at the root node, and then you create a branch for each possible attribute value, and that splits the instances into subsets, one for each branch that extends from the root node. Then you repeat the the procedure recursively for each branch, selecting an attribute at each node, and you use only instances that reach that [node] to make the selection. At the end you stop: perhaps you might continue until all instances have the same class. The trick is, the question is, how do you select a good attribute for the root node?
This is the weather data, and as you can see, "outlook" has been selected for the root node. Here are the four possibilities: outlook, windy, humidity, and temperature. These are the consequences of splitting on each of these attributes. What we're really looking for is a pure split, a split into pure nodes. We would be delighted if we found an attribute that split exactly into one node where they are all yes's, another node where they are all no's, and perhaps a third node where they are all yes's again. That would be the best thing. What we don't want is mixtures, because when we get mixtures of yes's and no's at a node, then we've got to split again.
You can see that splitting on "outlook" looks pretty good. We get one branch with two yes's and three no's, then we get a pure yes branch for "overcast", and when "outlook" is "rainy" we get three yes's and two no's.
How are we going to quantify this to decide which one of these attributes produces the purest nodes? We're on a quest here for purity.
The aim is to get the smallest tree, and top-down tree induction methods use some kind of heuristic. The most popular heuristic to produce pure nodes is an information theory-based heuristic. I'm not going to explain information theory to you: that would be another MOOC of its own -- quite an interesting one, actually.
Information theory was founded by Claude Shannon, an American mathematician and scientist who died about 12 years ago. He was an amazing guy. He did some amazing things. One of the most amazing things, I think, is that he could ride a unicycle and juggle clubs at the same time -- when he was in his 80's. That's pretty impressive. He came up the whole idea of information theory and quantifying entropy, which measures information in bits.
This is the formula for entropy: the sum of p log p's for each of the possible outcomes. I'm not really going to explain it to you. All of those minus signs are there because logarithms are negative if numbers are less than 1, and probabilities always are less than 1, so the entropy comes out to be a positive number.
What we do is we look at the information gain. How much information in bits do you gain by knowing the value of an attribute? That is, the entropy of the distribution before the split minus the entropy of the distribution after the split.
Here's how it works out for the weather data. These are the number of bits. If you split on "outlook", you gain 0.247 bits. Now I know you might be surprised to see fractional numbers of bits -- normally we think of 1 bit or 8 bits or 32 bits -- but information theory shows how you can regard bits as fractions, and these produce fractional numbers of bits. I don't want to go into the details. You can see that knowing the value for "windy" gives you only 0.048 bits of information. "Humidity" is quite a bit better; "temperature" is way down there at 0.029 bits.
We're going to choose the attribute that gains the most bits of information, and that, in this case, is "outlook". So at the top level of this tree, the root node, we're going to split on "outlook".
Having decided to split on "outlook", we need to look at each of the 3 branches that emanate from "outlook" corresponding to the 3 possible values of "outlook", and consider what to do at each of those branches. At the first branch, we might split on "temperature", "windy" or "humidity". We're not going to split on "outlook" again, because we know that outlook is sunny -- for all instances that reach this place, the outlook is sunny. For the other 3 things, we do exactly the same thing. We evaluate the information gain for "temperature" at that point, for "windy", and for "humidity", and we choose the best. In this case, it's humidity, with a gain of 0.971 bits. You can see that if we branch on humidity, then we get pure nodes: 3 no's in one and 2 yes's in the other. When we get that, we don't need to split anymore. We're on a quest for purity.
That's how it works. It just carries on until it reaches the end, until it has pure nodes. Let's open up Weka, and just do this with the nominal weather data. Of course, we've done this before, but I'll just do it again, it won't take long. J48 is a workhorse data mining algorithm.
There's the data. We're going to choose J48. It's a tree classifier. We're going to run this, and we get a tree -- the very tree I showed you before. Split first on "outlook", with values "sunny", "overcast" and "rainy". Then if it's "sunny", split on "humidity": 3 instances reach that node. Then split on ["windy"]: 3 "yes" instances reach that node, and so on. We can look at the tree using "Visualize the tree" in the right-click menu; here it is. These are the number of "yes" instances that reach this node, and the number of "no" instances.
In the case of this particular tree -- of course, we're using cross-validation here, so it's done an 11th run on the whole dataset. It's given us these numbers by looking at the training set. In fact, this becomes a pure node here. Sometimes you get 2 numbers here -- 3/2 or 3/1. The first number indicates the number of correct things that reach that node, so in this case the number of no's. If there was another number following the 3, that would indicate the number of yes's, that is, incorrect things that reach that node. But that doesn't occur in this very simple situation.
There you have it, J48: top-down induction of decision trees. It's soundly based in information theory. It's a pretty good data mining algorithm. Ten years ago I might have said it's the best data mining algorithm, but some even better ones, I think, have been produced since then. However, the real advantage of J48 is that it's reliable and robust, and, most importantly, it produces a tree that people can understand. It's very easy to understand the output of J48. That's really important when you're applying data mining.
There are a lot of different criteria you could use for attribute selection. Here we're using information gain. Actually, in practice, these don't normally make a huge difference.
There are some important modifications that need to be done to this algorithm to be useful in practice. I've only really explained the basic principles. The actual J48 incorporates some more complex stuff to make it work under different circumstances in practice. We'll talk about those in the next lesson.
Section 4.3 of the text, "Divide-and-conquer: Constructing decision trees", explains the simple version of J48 that I've explained here.
Now you should go and do the activity associated with this lesson. Good luck! See you next time!