Hi! We've just finished Class 3, and here are some of the issues that arose. I have a list of them here, so let's start at the top. Numeric precision in activities has caused a little bit of unnecessary angst. So we've simplified our policy. In general, we're asking you to round your percentages to the nearest integer. We certainly don't want you typing in those 4 decimal places, because that accuracy is misleading.
Some people are getting the wrong results in Weka. One reason you might get the wrong results is that the random seed is not set to the default value. Whenever you change the random seed, it stays there until you change it back or until you restart Weka. Just restart Weka or reset the random seed to 1. Another thing you should do is check your version of Weka. We asked you to download 3.6.10. There have been some bug fixes since the previous version, so you really do need to use this new version.
One of the activities asked you to copy an attribute, and some people found some surprising things with Weka claiming 100% accuracy. If you accidentally ask Weka to predict something that's already there as an attribute, it will do very well, with very high accuracy! It's very easy to mislead yourself when you're doing data mining. You just need to make sure you know what you're trying to predict, you know what the attributes are, and you haven't accidentally included a copy of the class attribute as one of the attributes that's being used for prediction.
There's been some discussion on the mailing list about whether OneR is really always better than ZeroR on the training set. In fact, it is. Someone proved it. (Thank you Jurek for sharing that proof with us.)
Someone else found a counterexample! "If we had a dataset with 10 instances, 6 belonging to Class A and 4 belonging to Class B, with attribute values selected randomly, wouldn't ZeroR outperform OneR? -- OneR would be fooled by the randomness of attribute values." It's kind of anthropomorphic to talk about OneR being "fooled by" things. It's not fooled by anything. It's not a person; it's not a being: it's just an algorithm. It just gets an input and does its thing with the data. If you think that OneR might be fooled, then why don't you try it? Set up this dataset with 10 instances, 6 in A and 4 in B, select the attributes randomly, and see what happens. I think you'll be able to convince yourself quite easily that this counterexample isn't a counterexample at all. It is definitely true that OneR is always better than ZeroR on the training set. That doesn't necessarily mean it's going to be better on an independent test set, of course.
The next thing is Activity 3.3, which asks you to repeat attributes with NaiveBayes. Some people asked "why are we doing this?" It's just an exercise! We're just trying to understand NaiveBayes a bit better, and what happens when you get highly correlated attributes, like repeated attributes. With NaiveBayes, enough repetitions mean that the other attributes won't matter at all. This is because all attributes contribute equally to the decision, so multiple copies of an attribute skew it in that direction. This is not true with other learning algorithms. It's true for NaiveBayes, but it's not true for OneR or J48, for example. Copied attributes doesn't effect OneR at all. The copying exercise is just to illustrate what happens with NaiveBayes when you have non-independent attributes. It's not something you do in real life. Although you might copy an attribute in order to transform it in some way, for example.
Someone asked about the mathematics. In Bayes formula you get Pr[E|H]^k, if the attribute was repeated k times, in the top line. How does this work mathematically? First of all, I'd just like to say that the Bayes formulation assumes independent attributes. Bayes expansion is not true if the attributes are dependent. But the algorithm works off that, so let's see what would happen. If you can stomach a bit of mathematics, here's the equation for the probability of the hypothesis given the evidence (Pr[H|E]). H might be Play is "yes" or Play is "no", for example, in the weather data. It's equal to this fairly complicated formula at the top, which, let me just simplify it by writing "..." for all the bits after here. So Pr[E1|H]^k, where E1 is repeated k times, times all the other stuff, divided by Pr[E]. What the algorithm does: because we don't know Pr[E], we normalize the 2 probabilities by calculating Pr[yes|E] using this formula and Pr[no|E], and normalizing them so that they add up to 1. That then computes Pr[yes|E] as this thing here -- which is at the top, up here -- Pr[E1|yes]^k, divided by that same thing, plus the corresponding thing for "no". If you look at this formula and just forget about the "…", what's going to happen is that these probabilities are less than 1. If we take them to the k'th power, they are going to get very small as k gets bigger. In fact, they're going to approach 0. But one of them is going to approach 0 faster than the other one. Whichever one is bigger -- for example, if the "yes" one is bigger than the "no" one -- then it's going to dominate. The normalized probability then is going to be 1 if the "yes" probability is bigger than the "no" probability, otherwise 0. That's what's actually going to happen in this formula as k approaches infinity. The result is as though there is only one attribute: E1. That's a mathematical explanation of what happens when you copy attributes in NaiveBayes. Don't worry if you didn't follow that; that was just for someone who asked.
Decision trees and bits. Someone said on the mailing list that in the lecture there was a condition that resulted in branches with all "yes" or all "no" results completely determining things. Why was the information gain only 0.971 and not the full 1 bit? This is the picture they were talking about. Here, "humidity" determines these are all "no" and these are all "yes" for high and normal humidity, respectively. When you calculate the information gain -- and this is the formula for information gain -- you get 0.971 bits. You might expect 1 (and I would agree), and you would get 1 if you had 3 no's and 3 yes's here, or if you had 2 no's and 2 yes's. But because there is a slight imbalance between the number of no's and the number of yes's, you don't actually get 1 bit under these circumstances.
There were some questions on Class 2 about stratified cross-validation, which tries to get the same proportion of class values in each fold. Some suggested maybe you should choose the number of folds so that it can do this exactly, instead of approximately. If you chose as the number of folds an exact divisor of the number of elements in each class, we'd be able to do this exactly. "Would that be a good thing to do?" was the question. The answer is no, not really. These things are all estimates, and you're treating them as though they were exact answers. They are all just estimates. There are more important considerations to take into account when determining the number of folds to do in your cross-validation. Like: you want a large enough test set to get an accurate estimate of the classification performance, and you want a large enough training set to train the classifier adequately. Don't worry about stratification being approximate. The whole thing is pretty approximate actually.
Someone else asked "why is there a 'Use training set'" option on the Classify tab. It's very misleading to take the evaluation you get on the training data seriously, as we know. So why is it there in Weka? Well, we might want it for some purposes. For example, it does give you a quick upper bound on an algorithm's performance: it couldn't possibly do better than it would do on the training set. That might be useful, allowing you to quickly reject a learning algorithm. The important thing here is to understand what is wrong with using the training set for a performance estimate, and what overfitting is. Rather than changing the interface so you can't do bad things, I would rather protect you by educating you about what the issues are here.
There have been quite a few suggested topics for a follow-up course: attribute selection, clustering, the Experimenter, parameter optimization, the KnowledgeFlow interface, and simple command line interface. We're considering a followup course, and we'll be asking you for feedback on that at the end of this course.
Finally, someone said "Please let me know if there is a way to make a small donation" -- he's enjoying the course so much! Well, thank you very much. We'll make sure there is a way to make a small donation at the end of the course.
That's it for now. On with Class 4. I hope you enjoy Class 4, and we'll talk again later. Bye for now!