Hello again! This is the last class of Data Mining with Weka, and we're going to step back a little bit and take a look at some more global issues with regard to the data mining process. It's a short class with just four lessons: the data mining process, pitfalls and pratfalls, data mining and ethics, and finally, a quick summary.
Let's get on with Lesson 5.1. This might be your vision of the data mining process. You've got some data or someone gives you some data. You've got Weka. You apply Weka to the data, you get some kind of cool result from that, and everyone's happy.
If so, I've got bad news for you. It's not going to be like that at all. Really, this would be a better way to think about it. You're going to have a circle; you're going to go round and round the circle. It's true that Weka is important -- it's in the very middle of the circle here. It's going to be crucial, but it's only a small part of what you have to do.
Perhaps the biggest problem is going to be to ask the right kind of question. You need to be answering a question, not just vaguely exploring a collection of data. Then, you need to get together the data that you can get hold of that gives you a chance of answering this question using data mining techniques. It's hard to collect the data. You're probably going to have an initial dataset, but you might need to add some demographic data, or some weather data, or some data about other stuff. You're going to have to go to the web and find more information to augment your dataset. Then you'll merge all that together: do some database hacking to get a dataset that contains all the attributes that you think you might need -- or that you think Weka might need.
Then you're going to have to clean the data. The bad news is that real world data is always very messy. That's a long and painstaking process of looking around, looking at the data, trying to understand it, trying to figure out what the anomalies are and whether it's good to delete them or not. That's going to take a while. Then you're going to need to define some new features, probably. This is the feature engineering process, and it's the key to successful data mining. Then, finally, you're going to use Weka, of course. You might go around this circle a few times to get a nice algorithm for classification, and then you're going to need to deploy the algorithm in the real world.
Each of these processes is difficult. You need to think about the question that you want to answer. "Tell me something cool about this data" is not a good enough question. You need to know what you want to know from the data. Then you need to gather it. There's a lot of data around, like I said at the very beginning, but the trouble is that we need classified data to use classification techniques in data mining. We need expert judgements on the data, expert classifications, and there's not so much data around that includes expert classifications, or correct results.
They say that more data beats a clever algorithm. So rather than spending time trying to optimize the exact algorithm you're going to use in Weka, you might be better off employed in getting more and more data. Then you've got to clean it, and like I said before, real data is very mucky. That's going to be a painstaking matter of looking through it and looking for anomalies.
Feature engineering, the next step, is the key to data mining. We'll talk about how Weka can help you a little bit in a minute. Then you've got to deploy the result. Implementing it -- well, that's the easy part. The difficult part is to convince your boss to use this result from this data mining process that he probably finds very mysterious and perhaps doesn't trust very much. Getting anything actually deployed in the real world is a pretty tough call.
The key technical part of all this is feature engineering, and Weka has a lot of [filters] that will help with this. Here are just a few of them. It might be worth while defining a new feature, a new attribute that's a mathematical expression involving existing attributes. Or you might want to modify an existing attribute. With AddExpression, you can use any kind of mathematical formula to create a new attribute from existing ones.
You might want to normalize or center your data, or standardize it statistically. Transform a numeric attribute to have a zero mean -- that's "center". Or transform it to a given numeric range -- that's "normalize". Or give it a zero mean and unit variance, that's a statistical operation called "standardization".
You might want to take those numeric attributes and discretize them into nominal values. Weka has both supervised and unsupervised attribute discretization filters. There are a lot of other transformations. For example, the PrincipalComponents transformation involves a matrix analysis of the data to select the principal components in a linear space. That's mathematical, and Weka contains a good implementation.
RemoveUseless will remove attributes that don't vary at all, or vary too much. Actually, I think we encountered that in one of our activities.
Then, there are a couple of filters that help you deal with time series, when your instances represent a series over time. You probably want to take the difference between one instance and the next, or a difference with some kind of lag -- one instance and the one 5 before it, or 10 before it.
These are just a few of the filters that Weka contains to help you with your feature engineering.
The message of this lesson is that Weka is only a small part of the entire data mining process, and it's the easiest part. In this course, we've chosen to tell you about the easiest part of the process! I'm sorry about that. The other bits are, in practice, much more difficult. There's an old programmer's blessing: "May all your problems be technical ones". It's the other problems -- the political problems in getting hold of the data, and deploying the result -- those are the ones that tend to be much more onerous in the overall data mining process. So good luck!
There's some stuff about this in the course text. Section 1.3 contains information on Fielded Applications, all of which have gone through this kind of process in order to get them out there and used in the field.
There's an activity associated with this lesson. Off you go and do it, and we'll see you in the next lesson. Bye for now!