1 00:00:17,560 --> 00:00:21,910 Hi! This is Lesson 3.3 on using probabilities. 2 00:00:21,910 --> 00:00:26,160 It's the one bit of Data Mining with Weka that we're going to see a little bit of mathematics, 3 00:00:26,160 --> 00:00:31,609 but don't worry, I'll take you through it gently. 4 00:00:31,609 --> 00:00:36,879 The OneR strategy that we've just been studying assumes that there is one of the attributes 5 00:00:36,879 --> 00:00:40,930 that does all the work, that takes the responsibility of the decision. 6 00:00:41,520 --> 00:00:43,120 That's a simple strategy. 7 00:00:43,120 --> 00:00:48,420 Another simple strategy is the opposite, to assume all of the attributes contribute equally 8 00:00:48,420 --> 00:00:51,659 and independently to the decision. 9 00:00:51,659 --> 00:00:54,229 This is called the "Naive Bayes" method -- 10 00:00:54,229 --> 00:00:55,869 I'll explain the name later on. 11 00:00:56,580 --> 00:01:02,159 There are two assumptions that underline Naive Bayes: that the attributes are equally important 12 00:01:02,159 --> 00:01:05,070 and that they are statistically independent, 13 00:01:05,070 --> 00:01:09,909 that is, knowing the value of one of the attributes doesn't tell you anything about the value 14 00:01:09,909 --> 00:01:12,619 of any of the other attributes. 15 00:01:12,619 --> 00:01:17,780 This independence assumption is never actually correct, but the method based on it often 16 00:01:17,780 --> 00:01:23,509 works well in practice. 17 00:01:23,509 --> 00:01:30,159 There's a theorem in probability called "Bayes Theorem" after this guy Thomas Bayes from the 18 00:01:30,159 --> 00:01:33,030 18th century. 19 00:01:33,030 --> 00:01:39,369 It's about the probability of a hypothesis H given evidence E. 20 00:01:39,369 --> 00:01:46,100 In our case, the hypothesis is the class of an instance and the evidence is the attribute 21 00:01:46,100 --> 00:01:48,899 values of the instance. 22 00:01:48,899 --> 00:01:55,319 The theorem is that Pr[H|E] -- the probability of the class given the instance, the hypothesis 23 00:01:55,319 --> 00:02:02,109 given the evidence -- is equal to Pr[E|H] times Pr[H] divided 24 00:02:02,109 --> 00:02:06,119 by Pr[E]. 25 00:02:06,119 --> 00:02:13,119 Pr[H] by itself is called the [prior] probability of the hypothesis H. 26 00:02:13,290 --> 00:02:18,480 That's the probability of the event before any evidence is seen. 27 00:02:18,480 --> 00:02:22,800 That's really the baseline probability of the event. 28 00:02:22,800 --> 00:02:29,370 For example, in the weather data, I think there are 9 yeses and 5 nos, so the baseline 29 00:02:29,370 --> 00:02:38,280 probability of the hypothesis "play equals yes" is 9/14 and "play equals no" is 5/14. 30 00:02:38,280 --> 00:02:44,920 What this equation says is how to update that probability Pr[H] when you see some evidence, 31 00:02:44,920 --> 00:02:51,340 to get what's call the "a posteriori" probability of H, that means after the evidence. 32 00:02:51,340 --> 00:02:58,340 The evidence in our case is the attribute values of an unknown instance. That's E. 33 00:03:01,159 --> 00:03:02,129 That's Bayes Theorem. 34 00:03:02,129 --> 00:03:08,430 Now, what makes this method "naive"? The naive assumption is -- I've said it before -- that the 35 00:03:08,430 --> 00:03:13,140 evidence splits into parts that are statistically independent. 36 00:03:13,140 --> 00:03:19,390 The parts of the evidence in our case are the four different attribute values in the 37 00:03:19,390 --> 00:03:20,950 weather data. 38 00:03:20,950 --> 00:03:28,280 When you have independent events, the probabilities multiply, so Pr[H|E], 39 00:03:28,280 --> 00:03:33,719 according to the top equation, is the product of Pr[E|H] times the prior probability 40 00:03:33,719 --> 00:03:37,379 Pr[H] divided by Pr[E]. 41 00:03:37,379 --> 00:03:43,079 Pr[E|H] splits up into these parts: Pr[E1|H], 42 00:03:43,079 --> 00:03:48,030 the first attribute value; Pr[E2|H], the second attribute value; and so on for all 43 00:03:48,030 --> 00:03:51,030 of the attributes. 44 00:03:51,030 --> 00:03:56,650 That's maybe a bit abstract, let's look at the actual weather data. 45 00:03:56,650 --> 00:03:59,829 On the right-hand side is the weather data. 46 00:03:59,829 --> 00:04:03,930 In the large table at the top, we've taken each of the attributes. 47 00:04:03,930 --> 00:04:09,799 Let's start with "outlook". Under the "yes" hypothesis and the "no" hypothesis, we've looked at 48 00:04:09,799 --> 00:04:11,959 how many times the outlook is "sunny". 49 00:04:11,959 --> 00:04:14,849 It's sunny twice under yes and 3 times under no. 50 00:04:14,849 --> 00:04:18,220 That comes straight from the data in the table. 51 00:04:18,220 --> 00:04:19,840 Overcast. 52 00:04:19,840 --> 00:04:25,120 When the outlook is overcast, it's always a "yes" instance, so there were 4 of those, 53 00:04:25,120 --> 00:04:26,950 and zero "no" instances. 54 00:04:26,950 --> 00:04:31,250 Then, rainy is 3 "yes" instances and 2 "no" instances. 55 00:04:31,250 --> 00:04:35,979 Those numbers just come straight from the data table given the instance values. 56 00:04:35,979 --> 00:04:40,380 Then, we take those numbers and underneath we make them into probabilities. 57 00:04:40,380 --> 00:04:43,259 Let's say we know the hypothesis. 58 00:04:43,259 --> 00:04:46,160 Let's say we know it's a "yes". 59 00:04:46,160 --> 00:04:52,960 Then the probability of it being "sunny" is 2/9ths, "overcast" is 4/9ths, and "rainy" 3/9ths, 60 00:04:52,960 --> 00:04:56,460 simply because when you add up 2 plus 4 plus 3 you get 9. 61 00:04:56,460 --> 00:04:59,400 Those are the probabilities. 62 00:04:59,400 --> 00:05:06,860 If we know that the outcome is "no", the probabilities are "sunny" 3/5ths, "overcast" 0/5ths, and "rainy" 63 00:05:06,860 --> 00:05:08,340 2/5ths. 64 00:05:08,340 --> 00:05:10,169 That's for the "outlook" attribute. 65 00:05:11,740 --> 00:05:18,060 That's what we're looking for, you see, the probability of each of these attribute values 66 00:05:18,060 --> 00:05:21,729 given the hypothesis H. 67 00:05:21,729 --> 00:05:25,889 The next attribute is temperature, and we just do the same thing with that to get the 68 00:05:25,889 --> 00:05:30,729 probabilities of the 3 values -- hot, mild, and cool -- under the "yes" hypothesis or the 69 00:05:30,729 --> 00:05:32,199 "no" hypothesis. 70 00:05:32,199 --> 00:05:39,960 The same with humidity and windy. Play, that's the prior probability -- Pr[H]. 71 00:05:39,960 --> 00:05:45,669 It's "yes" 9/14ths of the time, "no" 5/14ths of the time, even if you don't know anything about 72 00:05:45,669 --> 00:05:47,810 the attribute values. 73 00:05:47,810 --> 00:05:52,669 The equation we're looking at is this one below, and we just need to work it out. 74 00:05:52,669 --> 00:05:54,090 Here's an example. 75 00:05:54,090 --> 00:05:56,970 Here's an unknown day, a new day. 76 00:05:56,970 --> 00:06:03,970 We don't know what the value of "play" is, but we know it's sunny, cool, high, and windy. 77 00:06:05,280 --> 00:06:07,509 We can just multiply up these probabilities. 78 00:06:07,509 --> 00:06:13,819 If we multiply for the yes hypothesis, we get 2/9th times 3/9ths times 3/9ths times 79 00:06:13,819 --> 00:06:22,300 3/9ths -- those are just the numbers on the previous slide Pr[E1|H], Pr[E2|H], Pr[E3|H] 80 00:06:22,300 --> 00:06:28,400 Pr[E4|H] -- finally Pr[H], that is 9/14ths. 81 00:06:28,400 --> 00:06:36,560 That gives us a likelihood of 0.0053 when you multiply them. 82 00:06:36,560 --> 00:06:43,560 Then, for the "no" class, we do the same to get a likelihood of 0.0206. 83 00:06:44,120 --> 00:06:46,720 These numbers are not probabilities. 84 00:06:46,720 --> 00:06:48,129 Probabilities have to add up to 1. 85 00:06:48,129 --> 00:06:49,639 They are likelihoods. 86 00:06:49,639 --> 00:06:55,610 But we can get the probabilities from them by using a straightforward technique of normalization. 87 00:06:55,610 --> 00:06:56,500 Take those likelihoods for "yes" 88 00:06:56,500 --> 00:07:02,440 and "no" and we normalize them as shown below to make them add up to 1. 89 00:07:02,440 --> 00:07:09,440 That's how we get the probability of "play" on a new day with different attribute values. 90 00:07:10,030 --> 00:07:11,380 Just to go through that again. 91 00:07:11,380 --> 00:07:17,340 The evidence is "outlook" is "sunny", "temperature" is "cool", "humidity" is "high", "windy" is "true" -- 92 00:07:17,340 --> 00:07:19,550 and we don't know what play is. 93 00:07:19,550 --> 00:07:26,990 The [likelihood] of a "yes", given the evidence is the product of those 4 probabilities -- one 94 00:07:26,990 --> 00:07:33,000 for outlook, temperature, humidity and windy -- times the prior probability, which is 95 00:07:33,000 --> 00:07:37,000 just the baseline probability of a "yes". 96 00:07:37,000 --> 00:07:40,650 That product of fractions is divided by Pr[E]. 97 00:07:40,650 --> 00:07:45,160 We don't know what Pr[E] is, but it doesn't matter, because we can do the same calculation 98 00:07:45,160 --> 00:07:52,240 for Pr[E] of "no", which gives us another equation just like this, and then we can calculate 99 00:07:52,240 --> 00:07:56,870 the actual probabilities by normalizing them so that the two probabilities add up to 1. 100 00:07:56,870 --> 00:08:01,560 Pr[E] for "yes" plus Pr[E] for "no" equals 1. 101 00:08:02,220 --> 00:08:07,850 It's actually quite simple when you look at it in numbers, and it's simple when you look 102 00:08:07,850 --> 00:08:09,660 at it in Weka, as well. 103 00:08:09,660 --> 00:08:15,490 I'm going to go to Weka here, and I'm going to open the nominal weather data, 104 00:08:15,490 --> 00:08:19,920 which is here. 105 00:08:19,920 --> 00:08:22,540 We've seen that before, of course, many times. 106 00:08:22,540 --> 00:08:25,590 I'm going to go to Classify. 107 00:08:25,590 --> 00:08:29,150 I'm going to use the NaiveBayes method. 108 00:08:29,150 --> 00:08:30,800 It's under this bayes category here. 109 00:08:30,800 --> 00:08:34,280 There are a lot of implementations of different variants of Bayes. 110 00:08:34,280 --> 00:08:38,240 I'm just going to use the straightforward NaiveBayes method here. 111 00:08:38,650 --> 00:08:42,480 I'll just run it. 112 00:08:42,480 --> 00:08:43,960 This is what we get. 113 00:08:44,870 --> 00:08:48,170 The success probability calculated according to cross-validation. 114 00:08:48,170 --> 00:08:51,570 More interestingly, we get the model. 115 00:08:51,570 --> 00:08:56,900 The model is just like the table I showed you before divided under the "yes" class and 116 00:08:56,900 --> 00:08:58,320 the "no" class. 117 00:08:58,320 --> 00:09:04,600 We've got the four attributes -- outlook, temperature, humidity, and windy -- and then, 118 00:09:04,600 --> 00:09:10,020 for each of the attribute values, we've got the number of times that attribute value appears. 119 00:09:10,630 --> 00:09:15,400 Now, there's one little and important difference between this table and the one I showed you before. 120 00:09:15,400 --> 00:09:15,420 Let me go back to my slide and look at these numbers. before. 121 00:09:15,420 --> 00:09:18,490 Let me go back to my slide and look at these numbers. 122 00:09:18,490 --> 00:09:26,670 You can see that for outlook under "yes" on my slide, I've got 2, 4, and 3, and Weka has 123 00:09:26,670 --> 00:09:29,410 got 3, 5, and 4. 124 00:09:29,410 --> 00:09:35,960 That's 1 more each time for a total of 12, instead of a total of 9. 125 00:09:35,960 --> 00:09:39,410 Weka adds 1 to all of the counts. 126 00:09:39,410 --> 00:09:42,990 The reason it does this is to get rid of the zeros. 127 00:09:42,990 --> 00:09:50,580 In the original table under outlook, under "no", the probability of overcast given "no" is 128 00:09:50,580 --> 00:09:53,670 zero, and we're going to be multiplying that into things. 129 00:09:53,670 --> 00:09:58,200 What that would mean in effect, if we took that zero at face value, is that the probability 130 00:09:58,200 --> 00:10:06,050 of the class being "no" given any day for which the outlook was overcast would be zero. 131 00:10:06,050 --> 00:10:09,230 Anything multiplied by zero is zero. 132 00:10:09,230 --> 00:10:13,970 These zeros in probability terms have sort of a veto over all of the other numbers, and 133 00:10:13,970 --> 00:10:14,940 we don't want that. 134 00:10:14,940 --> 00:10:21,010 We don't want to categorically conclude that it must be a "no" day on a basis that it's overcast, 135 00:10:21,010 --> 00:10:25,590 and we've never seen an overcast outlook on a "no" day before. 136 00:10:26,270 --> 00:10:30,800 That's called a "zero-frequency problem", and Weka's solution -- the most common solution 137 00:10:30,800 --> 00:10:34,650 -- is very simple, we just add 1 to all the counts. 138 00:10:34,650 --> 00:10:39,690 That's why all those numbers in the Weka table are 1 bigger than the numbers in the table 139 00:10:39,690 --> 00:10:41,290 on the slide. 140 00:10:42,030 --> 00:10:45,540 Aside from that, it's all exactly the same. 141 00:10:45,540 --> 00:10:50,780 We're avoiding zero frequencies by effectively starting all counts at 1 instead of starting 142 00:10:50,780 --> 00:10:56,480 them at 0, so they can't end up at 0. 143 00:10:57,090 --> 00:10:59,480 That's the Naive Bayes method. 144 00:10:59,480 --> 00:11:04,210 The assumption is that all attributes contribute equally and independently to the outcome. 145 00:11:04,210 --> 00:11:09,710 That works surprisingly well, even in situations where the independence assumption is clearly violated. 146 00:11:11,040 --> 00:11:13,520 Why does it work so well when the assumption is wrong? 147 00:11:13,520 --> 00:11:15,450 That's a good question. 148 00:11:15,450 --> 00:11:19,170 Basically, classification doesn't need accurate probability estimates. 149 00:11:19,170 --> 00:11:25,110 We're just going to choose as the class the outcome with the largest probability. 150 00:11:25,110 --> 00:11:29,600 As long as the greatest probability is assigned to the correct class, it doesn't matter if 151 00:11:29,600 --> 00:11:33,540 the probability estimates are all that accurate. 152 00:11:33,540 --> 00:11:38,330 This actually means that if you add redundant attributes you get problems with Naive Bayes. 153 00:11:38,330 --> 00:11:44,630 The extreme case of dependence is where two attributes have the same values, identical 154 00:11:44,630 --> 00:11:46,160 attributes. 155 00:11:46,160 --> 00:11:49,780 That will cause havoc with the Naive Bayes method. 156 00:11:49,780 --> 00:11:54,550 However, Weka contains methods for attribute selection to allow you to select a subset 157 00:11:54,550 --> 00:12:00,100 of fairly independent attributes after which you can safely use Naive Bayes. 158 00:12:01,610 --> 00:12:07,100 There's quite a bit of stuff on statistical modeling in Section 4.2 of the course text. 159 00:12:07,890 --> 00:12:12,530 Now you need to go and do that activity. 160 00:12:12,530 --> 00:12:14,070 See you soon!