1 00:00:17,789 --> 00:00:20,869 Hi! Welcome back to Data Mining with Weka. 2 00:00:20,869 --> 00:00:26,449 In the last lesson, we looked at classification by regression, how to use linear regression 3 00:00:26,449 --> 00:00:33,079 to perform classification tasks. In this lesson we're going to look at a more powerful way 4 00:00:33,079 --> 00:00:37,059 of doing the same kind of thing. It's called "logistic regression". It's fairly mathematical, 5 00:00:37,059 --> 00:00:43,399 and we're not going to go into the dirty details of how it works, but I'd like to give you 6 00:00:43,399 --> 00:00:48,879 a flavor of the kinds of things it does and the basic principles that underline logistic 7 00:00:48,879 --> 00:00:53,629 regression. Then, of course, you can use it yourself in Weka without any problem. 8 00:00:55,560 --> 00:00:59,750 One of the things about data mining is that you can sometimes do better by using prediction 9 00:00:59,750 --> 00:01:05,970 probabilities rather than actual classes. Instead of predicting whether it's going to 10 00:01:05,970 --> 00:01:10,820 be a "yes" or a "no", you might do better to predict the probability with which you 11 00:01:10,820 --> 00:01:15,790 think it's going to be a "yes" or a "no". For example, the weather is 95% likely to 12 00:01:15,790 --> 00:01:21,420 be rainy tomorrow, or 72% likely to be sunny, instead of saying it's definitely going to 13 00:01:21,420 --> 00:01:26,080 be rainy or it's definitely going to be sunny. 14 00:01:26,080 --> 00:01:32,110 Probabilities are really useful things in data mining. NaiveBayes produces probabilities; 15 00:01:32,110 --> 00:01:36,360 it works in terms of probabilities. We've sen that in an earlier lesson. 16 00:01:36,360 --> 00:01:43,360 I'm going to open diabetes and run NaiveBayes. 17 00:01:49,640 --> 00:01:55,660 I'm going to use a percentage split with 90%, 18 00:01:55,660 --> 00:02:07,280 so that leaves 10% as a test set. Then I'm going to make sure I output the predictions 19 00:02:07,280 --> 00:02:14,280 on those 10%, and run it. I want to look at the predictions that have been output. 20 00:02:14,960 --> 00:02:20,840 This is a 2-class dataset, the classes are tested_negative and tested_positive, and these are the instances 21 00:02:20,840 --> 00:02:25,569 -- number 1, number 2, number 3, etc. This is the actual class -- tested_negative, tested_positive, 22 00:02:25,569 --> 00:02:29,959 tested_negative, etc. This is the predicted class -- tested_negative, tested_negative, 23 00:02:29,959 --> 00:02:34,819 tested_negative, tested_negative, etc. This is a plus under the error column to say where 24 00:02:34,819 --> 00:02:41,459 there's an error, so there's an error with instance number 2. These are the actual probabilities 25 00:02:41,459 --> 00:02:43,019 that come out of NaiveBayes. 26 00:02:43,019 --> 00:02:51,020 So for instance 1 we've got a 99% probability that it's negative, and a 1% probability that 27 00:02:51,029 --> 00:02:56,340 it's positive. So we predict it's going to be negative; that's why that's tested_negative. 28 00:02:56,340 --> 00:03:02,489 And in fact we're correct; it is tested_negative. This instance, which is actually incorrect, 29 00:03:02,489 --> 00:03:07,909 we're predicting 67% percent for negative and 33% for positive, so we decide it's a 30 00:03:07,909 --> 00:03:14,549 negative, and we're wrong. We might have been better saying that here we're really sure 31 00:03:14,549 --> 00:03:18,760 it's going to be a negative, and we're right; here we think it's going to be a negative, 32 00:03:18,760 --> 00:03:24,260 but we're not sure, and it turns out that we're wrong. Sometimes it's a lot better to 33 00:03:24,260 --> 00:03:31,150 think in terms of the output as probabilities, rather than being forced to make a binary, 34 00:03:31,150 --> 00:03:34,620 black-or-white classification. 35 00:03:34,620 --> 00:03:41,620 Other data mining methods produce probabilities, as well. If I look at ZeroR, and run that, 36 00:03:46,689 --> 00:03:53,689 these are the probabilities -- 65% versus 35%. All of them are the same. 37 00:03:55,000 --> 00:04:00,650 Of course, it's ZeroR! -- it always produces the same thing. In this case, it always says tested_negative 38 00:04:00,650 --> 00:04:05,699 and always has the same probabilities. The reason why the numbers are like that, if you 39 00:04:05,699 --> 00:04:11,650 look at the slide here, is that we've chosen a 90% training set and a 10% test set, and 40 00:04:11,650 --> 00:04:18,650 the training set contains 448 negative instances and 243 positive instances. 41 00:04:18,650 --> 00:04:28,180 Remember the "Laplace Correction" in Lesson 3.2? -- we add 1 to each of those counts to get 449 and 244. 42 00:04:29,560 --> 00:04:37,620 That gives us a 65% probability for being a negative instance. That's where these numbers come from. 43 00:04:40,150 --> 00:04:50,800 If we look at J48 and run that, then we get more interesting probabilities here -- 44 00:04:51,920 --> 00:04:56,560 the negative and positive probabilities, respectively. 45 00:04:56,560 --> 00:04:58,200 You can see where the errors are. 46 00:04:58,200 --> 00:05:00,430 These probabilities are all different. 47 00:05:00,430 --> 00:05:06,110 Internally, J48 uses probabilities in order to do its pruning operations. 48 00:05:06,110 --> 00:05:11,820 We talked about that when we discussed J48's pruning, although I didn't explain explicitly 49 00:05:11,820 --> 00:05:15,400 how the probabilities are derived. 50 00:05:15,400 --> 00:05:21,380 The idea of logistic regression is to make linear regression produce probabilities, too. 51 00:05:21,380 --> 00:05:23,990 This gets a little bit hairy. 52 00:05:23,990 --> 00:05:29,380 Remember, when we use linear regression for classification, we calculate a linear function 53 00:05:29,380 --> 00:05:36,380 using regression and then apply a threshold to decide whether it's a 0 or a 1. 54 00:05:36,650 --> 00:05:41,200 It's tempting to imagine that you can interpret these numbers as probabilities, instead of 55 00:05:41,200 --> 00:05:43,660 thresholding like that, but that's a mistake. 56 00:05:43,660 --> 00:05:45,690 They're not probabilities. 57 00:05:45,690 --> 00:05:48,960 These numbers that come out on the regression line are sometimes negative, and sometimes 58 00:05:48,960 --> 00:05:50,100 greater than 1. 59 00:05:50,100 --> 00:05:54,710 They can't be probabilities, because probabilities don't work like that. 60 00:05:54,710 --> 00:06:01,660 In order to get better probability estimates, a slightly more sophisticated technique is used. 61 00:06:01,660 --> 00:06:04,350 In linear regression, we have a linear sum. 62 00:06:04,350 --> 00:06:10,020 In logistic regression, we have the same linear sum down here -- the same kind of linear sum 63 00:06:10,020 --> 00:06:13,540 that we saw before -- but we embed it in this kind of formula. 64 00:06:13,540 --> 00:06:16,120 This is called a "logit transform". 65 00:06:16,120 --> 00:06:21,460 A logit transform -- this is multi-dimensional with a lot of different a's here. 66 00:06:21,460 --> 00:06:27,340 If we've got just one dimension, one variable, a1, then if this is the input to the logit 67 00:06:27,340 --> 00:06:32,360 transform, the output looks like this: it's between 0 and 1. 68 00:06:32,360 --> 00:06:36,090 It's sort of an S-shaped curve that applies a softer function. 69 00:06:36,090 --> 00:06:42,540 Rather than just 0 and then a step function, it's soft version of a step function that 70 00:06:42,540 --> 00:06:49,800 never gets below 0, never gets above 1, and has a smooth transition in between. 71 00:06:49,800 --> 00:06:54,930 When you're working with a logit transform, instead of minimizing the squared error (remember, 72 00:06:54,930 --> 00:07:00,460 when we do linear regression we minimize the squared error), it's better to choose weights 73 00:07:00,460 --> 00:07:05,860 to maximize a probabilistic function called the "log-likelihood function", which is this 74 00:07:05,860 --> 00:07:10,210 pretty scary looking formula down at the bottom. 75 00:07:10,210 --> 00:07:12,620 That's the basis of logistic regression. 76 00:07:12,620 --> 00:07:15,889 We won't talk about the details any more: let me just do it. 77 00:07:15,889 --> 00:07:19,139 We're going to use the diabetes dataset. 78 00:07:19,139 --> 00:07:23,360 In the last lesson we got 76.8% with classification by regression. 79 00:07:23,360 --> 00:07:29,370 Let me tell you if you do ZeroR, NaiveBayes, and J48, you get these numbers here. 80 00:07:29,370 --> 00:07:35,460 I'm going to find the logistic regression scheme. 81 00:07:35,460 --> 00:07:38,310 It's in "functions", and called "Logistic". 82 00:07:38,310 --> 00:07:41,620 I'm going to use 10-fold cross-validation. 83 00:07:41,620 --> 00:07:43,540 I'm not going to output the predictions. 84 00:07:45,360 --> 00:07:50,540 I'll just run it -- and I get 77.2% accuracy. 85 00:07:52,080 --> 00:07:59,290 That's the best figure in this column, though it's not much better than NaiveBayes, so you 86 00:07:59,290 --> 00:08:02,070 might be a bit skeptical about whether it really is better. 87 00:08:02,070 --> 00:08:07,639 I did this 10 times and calculated the means myself, and we get these figures for the mean 88 00:08:07,639 --> 00:08:08,930 of 10 runs. 89 00:08:08,930 --> 00:08:15,480 ZeroR stays the same, of course, at 65.1%; it produces the same accuracy on each run. 90 00:08:15,480 --> 00:08:21,910 NaiveBayes and J48 are different, and here logistic regression gets an average of 77.5%, 91 00:08:21,910 --> 00:08:27,970 which is appreciably better than the other figures in this column. 92 00:08:27,970 --> 00:08:30,880 You can extend the idea to multiple classes. 93 00:08:30,880 --> 00:08:37,880 When we did this in the previous lesson, we performed a regression for each class, a multi-response 94 00:08:37,880 --> 00:08:38,810 regression. 95 00:08:38,810 --> 00:08:44,209 That actually doesn't work well with logistic regression, because you need the probabilities 96 00:08:44,209 --> 00:08:48,149 to sum to 1 over the various different classes. 97 00:08:48,149 --> 00:08:50,700 That introduces more computational complexity 98 00:08:50,700 --> 00:08:55,000 and needs to be tackled as a joint optimization problem. 99 00:08:57,040 --> 00:09:02,850 The result is logistic regression, a popular and powerful machine learning method that 100 00:09:02,850 --> 00:09:07,009 uses the logit transform to predict probabilities directly. 101 00:09:07,009 --> 00:09:12,749 It works internally with probabilities, like NaiveBayes does. 102 00:09:12,749 --> 00:09:17,250 We also learned in this lesson about prediction probabilities that can be obtained from other 103 00:09:17,250 --> 00:09:21,699 methods, and how to calculate probabilities from ZeroR. 104 00:09:21,699 --> 00:09:26,520 You can read in the course text about logistic regression in Section 4.6. 105 00:09:26,520 --> 00:09:30,500 Now you should go and do the activity associated with this lesson. 106 00:09:30,500 --> 00:09:31,500 See you soon. 107 00:09:31,500 --> 00:09:33,000 Bye for now!