1 00:00:17,630 --> 00:00:24,750 Hello again! This is the last class of Data Mining with Weka, and we're going to step 2 00:00:24,750 --> 00:00:29,070 back a little bit and take a look at some more global issues with regard to the data 3 00:00:29,070 --> 00:00:29,859 mining process. 4 00:00:29,859 --> 00:00:38,400 It's a short class with just four lessons: the data mining process, pitfalls and pratfalls, 5 00:00:38,400 --> 00:00:41,730 data mining and ethics, and finally, a quick summary. 6 00:00:42,760 --> 00:00:45,760 Let's get on with Lesson 5.1. 7 00:00:45,760 --> 00:00:50,570 This might be your vision of the data mining process. 8 00:00:50,570 --> 00:00:53,100 You've got some data or someone gives you some data. 9 00:00:53,100 --> 00:00:54,860 You've got Weka. 10 00:00:54,860 --> 00:01:00,720 You apply Weka to the data, you get some kind of cool result from that, and everyone's happy. 11 00:01:02,820 --> 00:01:05,509 If so, I've got bad news for you. 12 00:01:05,509 --> 00:01:08,500 It's not going to be like that at all. 13 00:01:08,500 --> 00:01:11,579 Really, this would be a better way to think about it. 14 00:01:11,579 --> 00:01:15,650 You're going to have a circle; you're going to go round and round the circle. 15 00:01:15,650 --> 00:01:19,770 It's true that Weka is important -- it's in the very middle of the circle here. 16 00:01:19,770 --> 00:01:26,069 It's going to be crucial, but it's only a small part of what you have to do. 17 00:01:26,069 --> 00:01:30,590 Perhaps the biggest problem is going to be to ask the right kind of question. 18 00:01:30,590 --> 00:01:37,380 You need to be answering a question, not just vaguely exploring a collection of data. 19 00:01:38,420 --> 00:01:44,680 Then, you need to get together the data that you can get hold of that gives you a chance 20 00:01:44,689 --> 00:01:49,329 of answering this question using data mining techniques. 21 00:01:49,329 --> 00:01:50,950 It's hard to collect the data. 22 00:01:50,950 --> 00:01:56,670 You're probably going to have an initial dataset, but you might need to add some demographic 23 00:01:56,670 --> 00:02:00,319 data, or some weather data, or some data about other stuff. 24 00:02:00,319 --> 00:02:05,079 You're going to have to go to the web and find more information to augment your dataset. 25 00:02:05,079 --> 00:02:11,819 Then you'll merge all that together: do some database hacking to get a dataset that contains 26 00:02:11,819 --> 00:02:17,410 all the attributes that you think you might need -- or that you think Weka might need. 27 00:02:17,410 --> 00:02:19,069 Then you're going to have to clean the data. 28 00:02:19,069 --> 00:02:24,890 The bad news is that real world data is always very messy. 29 00:02:24,890 --> 00:02:29,610 That's a long and painstaking process of looking around, looking at the data, trying to understand it, 30 00:02:29,610 --> 00:02:35,390 trying to figure out what the anomalies are and whether it's good to delete them or not. 31 00:02:35,390 --> 00:02:37,260 That's going to take a while. 32 00:02:37,260 --> 00:02:40,550 Then you're going to need to define some new features, probably. 33 00:02:40,550 --> 00:02:44,810 This is the feature engineering process, and it's the key to successful data mining. 34 00:02:44,810 --> 00:02:49,030 Then, finally, you're going to use Weka, of course. 35 00:02:49,030 --> 00:02:54,860 You might go around this circle a few times to get a nice algorithm for classification, 36 00:02:54,860 --> 00:03:00,420 and then you're going to need to deploy the algorithm in the real world. 37 00:03:00,420 --> 00:03:03,340 Each of these processes is difficult. 38 00:03:04,340 --> 00:03:08,340 You need to think about the question that you want to answer. 39 00:03:08,440 --> 00:03:13,330 "Tell me something cool about this data" is not a good enough question. 40 00:03:13,330 --> 00:03:17,890 You need to know what you want to know from the data. 41 00:03:17,890 --> 00:03:19,660 Then you need to gather it. 42 00:03:19,660 --> 00:03:23,110 There's a lot of data around, like I said at the very beginning, but the trouble is 43 00:03:23,110 --> 00:03:30,110 that we need classified data to use classification techniques in data mining. 44 00:03:30,290 --> 00:03:36,080 We need expert judgements on the data, expert classifications, and there's not so much data 45 00:03:36,080 --> 00:03:42,810 around that includes expert classifications, or correct results. 46 00:03:42,810 --> 00:03:45,680 They say that more data beats a clever algorithm. 47 00:03:45,680 --> 00:03:49,910 So rather than spending time trying to optimize the exact algorithm you're going to use in 48 00:03:49,910 --> 00:03:53,670 Weka, you might be better off employed in getting more and more data. 49 00:03:53,670 --> 00:04:00,570 Then you've got to clean it, and like I said before, real data is very mucky. 50 00:04:00,570 --> 00:04:04,650 That's going to be a painstaking matter of looking through it and looking for anomalies. 51 00:04:04,650 --> 00:04:08,000 Feature engineering, the next step, is the key to data mining. 52 00:04:08,000 --> 00:04:12,930 We'll talk about how Weka can help you a little bit in a minute. 53 00:04:12,930 --> 00:04:16,340 Then you've got to deploy the result. 54 00:04:16,340 --> 00:04:18,490 Implementing it -- well, that's the easy part. 55 00:04:18,490 --> 00:04:24,430 The difficult part is to convince your boss to use this result from this data mining process 56 00:04:24,430 --> 00:04:29,620 that he probably finds very mysterious and perhaps doesn't trust very much. 57 00:04:29,620 --> 00:04:36,620 Getting anything actually deployed in the real world is a pretty tough call. 58 00:04:37,060 --> 00:04:43,370 The key technical part of all this is feature engineering, and Weka has a lot of [filters] 59 00:04:43,370 --> 00:04:44,200 that will help with this. 60 00:04:44,200 --> 00:04:46,150 Here are just a few of them. 61 00:04:46,150 --> 00:04:53,150 It might be worth while defining a new feature, a new attribute that's a mathematical expression 62 00:04:54,530 --> 00:04:56,120 involving existing attributes. 63 00:04:56,120 --> 00:04:59,890 Or you might want to modify an existing attribute. 64 00:04:59,890 --> 00:05:05,240 With AddExpression, you can use any kind of mathematical formula to create a new attribute 65 00:05:05,240 --> 00:05:08,050 from existing ones. 66 00:05:08,050 --> 00:05:13,730 You might want to normalize or center your data, or standardize it statistically. 67 00:05:13,730 --> 00:05:18,210 Transform a numeric attribute to have a zero mean -- that's "center". 68 00:05:18,210 --> 00:05:21,830 Or transform it to a given numeric range -- that's "normalize". 69 00:05:21,830 --> 00:05:28,830 Or give it a zero mean and unit variance, that's a statistical operation called "standardization". 70 00:05:30,530 --> 00:05:37,500 You might want to take those numeric attributes and discretize them into nominal values. 71 00:05:37,500 --> 00:05:43,440 Weka has both supervised and unsupervised attribute discretization filters. 72 00:05:44,790 --> 00:05:46,000 There are a lot of other transformations. 73 00:05:46,000 --> 00:05:51,480 For example, the PrincipalComponents transformation involves a matrix analysis of the data to 74 00:05:51,480 --> 00:05:54,150 select the principal components in a linear space. 75 00:05:54,150 --> 00:05:58,920 That's mathematical, and Weka contains a good implementation. 76 00:05:58,920 --> 00:06:04,220 RemoveUseless will remove attributes that don't vary at all, or vary too much. 77 00:06:04,220 --> 00:06:07,800 Actually, I think we encountered that in one of our activities. 78 00:06:07,800 --> 00:06:14,800 Then, there are a couple of filters that help you deal with time series, when your instances 79 00:06:14,830 --> 00:06:17,300 represent a series over time. 80 00:06:17,300 --> 00:06:21,080 You probably want to take the difference between one instance and the next, or a difference 81 00:06:21,080 --> 00:06:27,680 with some kind of lag -- one instance and the one 5 before it, or 10 before it. 82 00:06:27,680 --> 00:06:33,650 These are just a few of the filters that Weka contains to help you with your feature engineering. 83 00:06:33,650 --> 00:06:39,250 The message of this lesson is that Weka is only a small part of the entire data mining 84 00:06:39,250 --> 00:06:41,810 process, and it's the easiest part. 85 00:06:41,810 --> 00:06:46,310 In this course, we've chosen to tell you about the easiest part of the process! I'm sorry 86 00:06:46,310 --> 00:06:46,780 about that. 87 00:06:46,780 --> 00:06:50,230 The other bits are, in practice, much more difficult. 88 00:06:50,230 --> 00:06:56,270 There's an old programmer's blessing: "May all your problems be technical ones". 89 00:06:56,270 --> 00:07:01,170 It's the other problems -- the political problems in getting hold of the data, and deploying 90 00:07:01,170 --> 00:07:06,610 the result -- those are the ones that tend to be much more onerous in the overall data 91 00:07:06,610 --> 00:07:07,330 mining process. 92 00:07:07,330 --> 00:07:09,920 So good luck! 93 00:07:09,920 --> 00:07:12,400 There's some stuff about this in the course text. 94 00:07:12,400 --> 00:07:17,810 Section 1.3 contains information on Fielded Applications, all of which have gone through 95 00:07:17,810 --> 00:07:24,480 this kind of process in order to get them out there and used in the field. 96 00:07:24,480 --> 00:07:26,200 There's an activity associated with this lesson. 97 00:07:26,200 --> 00:07:29,180 Off you go and do it, and we'll see you in the next lesson. 98 00:07:29,180 --> 00:07:36,180 Bye for now!