Hi! Well, Class 2 has gone flying by, and here are some things I'd like to discuss.
First of all, we made some mistakes in the answers to the activities. Sorry about that. We've corrected them.
Secondly -- a general point -- some people have been asking questions, for example, about huge datasets. How big a dataset can Weka deal with? The answer is pretty big, actually. But it depends on what you do, and it's a fairly complicated question to discuss. If it's not big enough, there are ways of improving things. Anyway, issues like that should be discussed on the Weka mailing list, or you should look in the Weka FAQ, where there's quite a lot of discussion on this particular issue.
The Weka API: the programming interface to Weka. You can incorporate the Weka routines in your program. It's wonderful stuff, but it's not covered in this MOOC. So the right place to discuss those issues is the Weka mailing list.
Finally, personal emails to me. You know, there are 5,000 people on this MOOC, and I can't cope with personal emails, so please send them to the mailing list and not to me personally.
I'd like to discuss the issues of numeric precision in Weka. Weka prints percentages to 4 decimal places; it prints most numbers to 4 decimal places. That's misleadingly high accuracy. Don't take these at face value. For example, here we've done an experiment using a 40% percentage split, and we get 92.3333% accuracy printed out. Well, that's the exact right answer to the wrong question. We're not interested in the performance on this particular test set. What we're interested in is how Weka will do in general on data from this source. We certainly can't infer that that's this percentage to 4 decimal place accuracy.
In Class 2, we're trying to sensitize you to the fact that these figures aren't to be taken at face value. For example, there we are with a 40% split. If we do a 30% split we get 92.381%. The difference between these two numbers is completely insignificant. You shouldn't be saying this is better than the other number. They are both the same, really, within the amount of statistical fuzz that's involved in the experiment.
We're trying to train you to write your answers to the nearest percentage point, or perhaps 1 decimal place. Those are the answers that are being accepted as correct. The reason we're doing that is to try to train you to think about these numbers and what they really represent, rather than just copy/pasting whatever Weka prints out. These numbers need to be interpreted. For example, in Activity 2.6 in question 2, the 4-digit answer would be 0.7354%, and 0.7 and 0.74 are the only accepted answers. In question 5, the 4-decimal place accuracy is 1.7256%, and we would accept 1.73%, 1.7% and 2%. We're a bit selective in what we'll accept here.
I want to move on to the user classifier now. Some people got some confusing results, because they created splits that involved the class attribute. When you're dealing with the test set, you don't know the class attribute -- that's what you're trying to find out. So it doesn't make sense to create splits in the decision tree that involve testing the class attribute. If you do that, you're going to get 0 accuracy on test data, because the class value cannot be evaluated on the test data. That was the cause of that confusion.
Here's the league table for the user classifier. J48 gets 96.2%, just as a reference point. Magda did really well and got very close to that, with 93.9%. It took her 6.5-7 minutes, according to the script that she mailed in. Myles did pretty well -- 93.5%. In the class, I got 78% in just a few seconds. I think if you get over 90% you're doing pretty well on this dataset for the user classifier. The point is not to get a good result, it's to think about the process of classification.
Let's move to Activity 2.2, partitioning the datasets for training and testing. Question 1 asked you to evaluate J48 with percentage split, using 10% for the training set, 20%, 40%, 60%, and 80%. What you observed is that the accuracy increases as we go through that set of numbers. "Performance always increases" for those numbers. It doesn't always increase in general. In general, you would expect an increasing trend -- the more training data the better the performance, asymptoting off at some point. You would expect some fluctuation, though, so sometimes you would expect it to go down and up again. In this particular case, performance always increases.
You were asked to estimate J48's true accuracy on the segment-challenge dataset in Question 4. Well, "true accuracy" -- what do we mean by "true accuracy"? I guess maybe it's not very well defined, but what one thinks of is if you have a large enough training set, the performance of J48 is going to increase up to some kind of point, and what would that point be? Actually, if you do this -- in fact, you've done it! -- you found that between 60% training sets and 97-98% training sets using the percentage split option consistently yield correctly classified instances in the range 94-97%. So 95% is probably the best fit from this selection of possible numbers.
It's true, by the way, that greater weight is normally given to the training portion of this split. Usually when we use percentage split, we would use 2/3, or maybe 3/4, or maybe 90% of the training data, and the smaller amount for the test data.
Questions 6 and 7 were confusing, and we've changed those. The issue there was how a classifier's performance, and secondly the reliability of the estimate of the classifier's performance, is expected to increase as the volume of the training data increases. Or, how they change with the size of the dataset. The performance is expected to increase as the volume of training data increases, and the reliability of the estimate is also expected to increase as the volume of test data increases. With the percentage split option, there's a trade-off between the amount of test data and the amount of training data. That's what that question is trying to get at.
Activity 2.3 Question 5: "How do the mean and standard deviation estimates depend on the number of samples?" Well, the answer is that roughly speaking both stay the same. Let me find Activity 2.3, Question 5. As you increase the number of samples, you expect the estimated mean to converge to the true value of the mean, and the estimated standard deviation to converge to the true standard deviation. So, they would both stay about the same. This is, in fact, now marked as correct. Actually, because of the "n - 1" in the denominator of the formula for variance, it's true that the standard deviation decreases a tiny bit, but it's a very small effect. So we've also accepted that answer as correct.
That's how the mean and standard deviation estimates depend on the number of samples. Perhaps a more important question is how the reliability of the mean would change. What decreases is the standard error of the estimate of the mean, which is the standard deviation of the theoretical distribution of the large population of such estimates. The estimate of the mean is a better, more reliable estimate with a larger training set size.
"The supermarket dataset is weird." Yes, it is weird: it's intended to be weird. Actually, in the supermarket dataset, each instance represents a supermarket trolley, and, instead of putting a 0 for every item you don't buy -- of course, when we go to the supermarket, we don't buy most of the items in the supermarket -- the ARFF file codes that as a question mark, which stands for "missing value". We're going to discuss missing values in Class 5. This dataset is suitable for association rule learning, which we're not doing in this course. The message I'm trying to emphasize here is that you need to understand what you're doing, not just process datasets blindly. Yes, it is weird.
There's been some discussion on the mailing list about cross-validation and the extra model. When you do cross-validation, you're trying to do two things. You're trying to get an estimate of the expected accuracy of a classifier, and you're trying to actually produce a really good classifier. To produce a really good classifier to use in the future, you want to use the entire training set to train up the classifier. To get an estimate of its accuracy, however, you can't do that unless you have an independent test set. So cross-validation takes 90% for training and 10% for testing, repeats that 10 times, and averages the results to get an estimate. Once you've got the estimate, if you want an actual classifier to use, the best classifier is one built on the full training set. The same is true with a percentage split option. Weka will evaluate the percentage split, but then it will print the classifier that it produces from the entire training set to give you a classifier to use on your problem in the future.
There's been a little bit of discussion on advanced stuff. I think maybe a follow-up course might be a good idea here. Someone noticed that if you apply a filter to the training set, you need to apply exactly the same filter to the test set, which is sometimes a bit difficult to do, particularly if the training and test sets are produced by cross-validation. There's an advanced classifier called the "FilteredClassifier" which addresses that problem.
In his response to a question on the supermarket dataset, Peter mentioned "unbalanced" datasets, and the cost of different kinds of error. This is something that Weka can take into account with a cost sensitive evaluation, and there is a classifier called the CostSensitiveClassifier that allows you to do that.
Finally, someone just asked a question on attribute selection: how do you select a good subset of attributes? Excellent question! There's a whole attribute Selection panel, which we're not able to talk about in this MOOC. This is just an introductory MOOC on Weka. Maybe we'll come up with an advanced, followup MOOC where we're able to discuss some of these more advanced issues.
That's it. I just want to finish with a picture that someone sent in of two wekas in an enclosure. It's rare to see wekas in the wild -- I've seen them a couple of times myself, but not very often. More likely, to see a weka you need to go to a place where they keep captured wekas for you to look at. Here are two wekas that Leah from Vancouver sent in.
That's it. Now Class 3 is up now, and off you go with Class 3. Good luck! We'll talk to you later. Bye for now!