1
00:00:17,560 --> 00:00:21,910
Hi! This is Lesson 3.3 on using probabilities.
2
00:00:21,910 --> 00:00:26,160
It's the one bit of Data Mining with Weka
that we're going to see a little bit of mathematics,
3
00:00:26,160 --> 00:00:31,609
but don't worry, I'll take you through it
gently.
4
00:00:31,609 --> 00:00:36,879
The OneR strategy that we've just been studying
assumes that there is one of the attributes
5
00:00:36,879 --> 00:00:40,930
that does all the work, that takes the responsibility
of the decision.
6
00:00:41,520 --> 00:00:43,120
That's a simple strategy.
7
00:00:43,120 --> 00:00:48,420
Another simple strategy is the opposite, to
assume all of the attributes contribute equally
8
00:00:48,420 --> 00:00:51,659
and independently to the decision.
9
00:00:51,659 --> 00:00:54,229
This is called the "Naive Bayes" method --
10
00:00:54,229 --> 00:00:55,869
I'll explain the name later on.
11
00:00:56,580 --> 00:01:02,159
There are two assumptions that underline Naive
Bayes: that the attributes are equally important
12
00:01:02,159 --> 00:01:05,070
and that they are statistically independent,
13
00:01:05,070 --> 00:01:09,909
that is, knowing the value of one of the attributes
doesn't tell you anything about the value
14
00:01:09,909 --> 00:01:12,619
of any of the other attributes.
15
00:01:12,619 --> 00:01:17,780
This independence assumption is never actually
correct, but the method based on it often
16
00:01:17,780 --> 00:01:23,509
works well in practice.
17
00:01:23,509 --> 00:01:30,159
There's a theorem in probability called "Bayes
Theorem" after this guy Thomas Bayes from the
18
00:01:30,159 --> 00:01:33,030
18th century.
19
00:01:33,030 --> 00:01:39,369
It's about the probability of a hypothesis
H given evidence E.
20
00:01:39,369 --> 00:01:46,100
In our case, the hypothesis is the class of
an instance and the evidence is the attribute
21
00:01:46,100 --> 00:01:48,899
values of the instance.
22
00:01:48,899 --> 00:01:55,319
The theorem is that Pr[H|E] -- the probability of the class
given the instance, the hypothesis
23
00:01:55,319 --> 00:02:02,109
given the evidence -- is equal to Pr[E|H] times Pr[H] divided
24
00:02:02,109 --> 00:02:06,119
by Pr[E].
25
00:02:06,119 --> 00:02:13,119
Pr[H] by itself is called the [prior] probability
of the hypothesis H.
26
00:02:13,290 --> 00:02:18,480
That's the probability of the event before
any evidence is seen.
27
00:02:18,480 --> 00:02:22,800
That's really the baseline probability of
the event.
28
00:02:22,800 --> 00:02:29,370
For example, in the weather data, I think
there are 9 yeses and 5 nos, so the baseline
29
00:02:29,370 --> 00:02:38,280
probability of the hypothesis "play equals
yes" is 9/14 and "play equals no" is 5/14.
30
00:02:38,280 --> 00:02:44,920
What this equation says is how to update that
probability Pr[H] when you see some evidence,
31
00:02:44,920 --> 00:02:51,340
to get what's call the "a posteriori" probability
of H, that means after the evidence.
32
00:02:51,340 --> 00:02:58,340
The evidence in our case is the attribute
values of an unknown instance. That's E.
33
00:03:01,159 --> 00:03:02,129
That's Bayes Theorem.
34
00:03:02,129 --> 00:03:08,430
Now, what makes this method "naive"? The naive
assumption is -- I've said it before -- that the
35
00:03:08,430 --> 00:03:13,140
evidence splits into parts that are statistically
independent.
36
00:03:13,140 --> 00:03:19,390
The parts of the evidence in our case are
the four different attribute values in the
37
00:03:19,390 --> 00:03:20,950
weather data.
38
00:03:20,950 --> 00:03:28,280
When you have independent events, the probabilities
multiply, so Pr[H|E],
39
00:03:28,280 --> 00:03:33,719
according to the top equation, is the product
of Pr[E|H] times the prior probability
40
00:03:33,719 --> 00:03:37,379
Pr[H] divided by Pr[E].
41
00:03:37,379 --> 00:03:43,079
Pr[E|H] splits up into
these parts: Pr[E1|H],
42
00:03:43,079 --> 00:03:48,030
the first attribute value; Pr[E2|H],
the second attribute value; and so on for all
43
00:03:48,030 --> 00:03:51,030
of the attributes.
44
00:03:51,030 --> 00:03:56,650
That's maybe a bit abstract, let's look at
the actual weather data.
45
00:03:56,650 --> 00:03:59,829
On the right-hand side is the weather data.
46
00:03:59,829 --> 00:04:03,930
In the large table at the top, we've taken
each of the attributes.
47
00:04:03,930 --> 00:04:09,799
Let's start with "outlook". Under the "yes" hypothesis and the "no" hypothesis, we've looked at
48
00:04:09,799 --> 00:04:11,959
how many times the outlook is "sunny".
49
00:04:11,959 --> 00:04:14,849
It's sunny twice under yes and 3 times under no.
50
00:04:14,849 --> 00:04:18,220
That comes straight from the data in the table.
51
00:04:18,220 --> 00:04:19,840
Overcast.
52
00:04:19,840 --> 00:04:25,120
When the outlook is overcast, it's always
a "yes" instance, so there were 4 of those,
53
00:04:25,120 --> 00:04:26,950
and zero "no" instances.
54
00:04:26,950 --> 00:04:31,250
Then, rainy is 3 "yes" instances and 2 "no"
instances.
55
00:04:31,250 --> 00:04:35,979
Those numbers just come straight from the
data table given the instance values.
56
00:04:35,979 --> 00:04:40,380
Then, we take those numbers and underneath
we make them into probabilities.
57
00:04:40,380 --> 00:04:43,259
Let's say we know the hypothesis.
58
00:04:43,259 --> 00:04:46,160
Let's say we know it's a "yes".
59
00:04:46,160 --> 00:04:52,960
Then the probability of it being "sunny" is
2/9ths, "overcast" is 4/9ths, and "rainy" 3/9ths,
60
00:04:52,960 --> 00:04:56,460
simply because when you add up 2 plus 4 plus
3 you get 9.
61
00:04:56,460 --> 00:04:59,400
Those are the probabilities.
62
00:04:59,400 --> 00:05:06,860
If we know that the outcome is "no", the probabilities
are "sunny" 3/5ths, "overcast" 0/5ths, and "rainy"
63
00:05:06,860 --> 00:05:08,340
2/5ths.
64
00:05:08,340 --> 00:05:10,169
That's for the "outlook" attribute.
65
00:05:11,740 --> 00:05:18,060
That's what we're looking for, you see, the
probability of each of these attribute values
66
00:05:18,060 --> 00:05:21,729
given the hypothesis H.
67
00:05:21,729 --> 00:05:25,889
The next attribute is temperature, and we
just do the same thing with that to get the
68
00:05:25,889 --> 00:05:30,729
probabilities of the 3 values -- hot, mild,
and cool -- under the "yes" hypothesis or the
69
00:05:30,729 --> 00:05:32,199
"no" hypothesis.
70
00:05:32,199 --> 00:05:39,960
The same with humidity and windy. Play,
that's the prior probability -- Pr[H].
71
00:05:39,960 --> 00:05:45,669
It's "yes" 9/14ths of the time, "no" 5/14ths of the
time, even if you don't know anything about
72
00:05:45,669 --> 00:05:47,810
the attribute values.
73
00:05:47,810 --> 00:05:52,669
The equation we're looking at is this one
below, and we just need to work it out.
74
00:05:52,669 --> 00:05:54,090
Here's an example.
75
00:05:54,090 --> 00:05:56,970
Here's an unknown day, a new day.
76
00:05:56,970 --> 00:06:03,970
We don't know what the value of "play" is, but
we know it's sunny, cool, high, and windy.
77
00:06:05,280 --> 00:06:07,509
We can just multiply up these probabilities.
78
00:06:07,509 --> 00:06:13,819
If we multiply for the yes hypothesis, we
get 2/9th times 3/9ths times 3/9ths times
79
00:06:13,819 --> 00:06:22,300
3/9ths -- those are just the numbers on the
previous slide Pr[E1|H], Pr[E2|H], Pr[E3|H]
80
00:06:22,300 --> 00:06:28,400
Pr[E4|H] -- finally Pr[H], that is 9/14ths.
81
00:06:28,400 --> 00:06:36,560
That gives us a likelihood of 0.0053 when
you multiply them.
82
00:06:36,560 --> 00:06:43,560
Then, for the "no" class, we do the same to
get a likelihood of 0.0206.
83
00:06:44,120 --> 00:06:46,720
These numbers are not probabilities.
84
00:06:46,720 --> 00:06:48,129
Probabilities have to add up to 1.
85
00:06:48,129 --> 00:06:49,639
They are likelihoods.
86
00:06:49,639 --> 00:06:55,610
But we can get the probabilities from them
by using a straightforward technique of normalization.
87
00:06:55,610 --> 00:06:56,500
Take those likelihoods for "yes"
88
00:06:56,500 --> 00:07:02,440
and "no" and we normalize them as shown below
to make them add up to 1.
89
00:07:02,440 --> 00:07:09,440
That's how we get the probability of "play"
on a new day with different attribute values.
90
00:07:10,030 --> 00:07:11,380
Just to go through that again.
91
00:07:11,380 --> 00:07:17,340
The evidence is "outlook" is "sunny", "temperature"
is "cool", "humidity" is "high", "windy" is "true" --
92
00:07:17,340 --> 00:07:19,550
and we don't know what play is.
93
00:07:19,550 --> 00:07:26,990
The [likelihood] of a "yes", given the evidence
is the product of those 4 probabilities -- one
94
00:07:26,990 --> 00:07:33,000
for outlook, temperature, humidity and windy
-- times the prior probability, which is
95
00:07:33,000 --> 00:07:37,000
just the baseline probability of a "yes".
96
00:07:37,000 --> 00:07:40,650
That product of fractions is divided by Pr[E].
97
00:07:40,650 --> 00:07:45,160
We don't know what Pr[E] is, but it doesn't
matter, because we can do the same calculation
98
00:07:45,160 --> 00:07:52,240
for Pr[E] of "no", which gives us another
equation just like this, and then we can calculate
99
00:07:52,240 --> 00:07:56,870
the actual probabilities by normalizing them
so that the two probabilities add up to 1.
100
00:07:56,870 --> 00:08:01,560
Pr[E] for "yes" plus Pr[E] for "no" equals 1.
101
00:08:02,220 --> 00:08:07,850
It's actually quite simple when you look at
it in numbers, and it's simple when you look
102
00:08:07,850 --> 00:08:09,660
at it in Weka, as well.
103
00:08:09,660 --> 00:08:15,490
I'm going to go to Weka here, and I'm going
to open the nominal weather data,
104
00:08:15,490 --> 00:08:19,920
which is here.
105
00:08:19,920 --> 00:08:22,540
We've seen that before, of course, many times.
106
00:08:22,540 --> 00:08:25,590
I'm going to go to Classify.
107
00:08:25,590 --> 00:08:29,150
I'm going to use the NaiveBayes method.
108
00:08:29,150 --> 00:08:30,800
It's under this bayes category here.
109
00:08:30,800 --> 00:08:34,280
There are a lot of implementations of different
variants of Bayes.
110
00:08:34,280 --> 00:08:38,240
I'm just going to use the straightforward
NaiveBayes method here.
111
00:08:38,650 --> 00:08:42,480
I'll just run it.
112
00:08:42,480 --> 00:08:43,960
This is what we get.
113
00:08:44,870 --> 00:08:48,170
The success probability calculated according
to cross-validation.
114
00:08:48,170 --> 00:08:51,570
More interestingly, we get the model.
115
00:08:51,570 --> 00:08:56,900
The model is just like the table I showed
you before divided under the "yes" class and
116
00:08:56,900 --> 00:08:58,320
the "no" class.
117
00:08:58,320 --> 00:09:04,600
We've got the four attributes -- outlook,
temperature, humidity, and windy -- and then,
118
00:09:04,600 --> 00:09:10,020
for each of the attribute values, we've got
the number of times that attribute value appears.
119
00:09:10,630 --> 00:09:15,400
Now, there's one little and important difference
between this table and the one I showed you before.
120
00:09:15,400 --> 00:09:15,420
Let me go back to my slide and look at these
numbers.
before.
121
00:09:15,420 --> 00:09:18,490
Let me go back to my slide and look at these
numbers.
122
00:09:18,490 --> 00:09:26,670
You can see that for outlook under "yes" on
my slide, I've got 2, 4, and 3, and Weka has
123
00:09:26,670 --> 00:09:29,410
got 3, 5, and 4.
124
00:09:29,410 --> 00:09:35,960
That's 1 more each time for a total of 12,
instead of a total of 9.
125
00:09:35,960 --> 00:09:39,410
Weka adds 1 to all of the counts.
126
00:09:39,410 --> 00:09:42,990
The reason it does this is to get
rid of the zeros.
127
00:09:42,990 --> 00:09:50,580
In the original table under outlook, under
"no", the probability of overcast given "no" is
128
00:09:50,580 --> 00:09:53,670
zero, and we're going to be multiplying that
into things.
129
00:09:53,670 --> 00:09:58,200
What that would mean in effect, if we took
that zero at face value, is that the probability
130
00:09:58,200 --> 00:10:06,050
of the class being "no" given any day for which
the outlook was overcast would be zero.
131
00:10:06,050 --> 00:10:09,230
Anything multiplied by zero is zero.
132
00:10:09,230 --> 00:10:13,970
These zeros in probability terms have sort
of a veto over all of the other numbers, and
133
00:10:13,970 --> 00:10:14,940
we don't want that.
134
00:10:14,940 --> 00:10:21,010
We don't want to categorically conclude that
it must be a "no" day on a basis that it's overcast,
135
00:10:21,010 --> 00:10:25,590
and we've never seen an overcast outlook on
a "no" day before.
136
00:10:26,270 --> 00:10:30,800
That's called a "zero-frequency problem", and
Weka's solution -- the most common solution
137
00:10:30,800 --> 00:10:34,650
-- is very simple, we just add 1 to all the
counts.
138
00:10:34,650 --> 00:10:39,690
That's why all those numbers in the Weka table
are 1 bigger than the numbers in the table
139
00:10:39,690 --> 00:10:41,290
on the slide.
140
00:10:42,030 --> 00:10:45,540
Aside from that, it's all exactly the same.
141
00:10:45,540 --> 00:10:50,780
We're avoiding zero frequencies by effectively
starting all counts at 1 instead of starting
142
00:10:50,780 --> 00:10:56,480
them at 0, so they can't end up at 0.
143
00:10:57,090 --> 00:10:59,480
That's the Naive Bayes method.
144
00:10:59,480 --> 00:11:04,210
The assumption is that all attributes contribute
equally and independently to the outcome.
145
00:11:04,210 --> 00:11:09,710
That works surprisingly well, even in situations
where the independence assumption is clearly violated.
146
00:11:11,040 --> 00:11:13,520
Why does it work so well when the assumption
is wrong?
147
00:11:13,520 --> 00:11:15,450
That's a good question.
148
00:11:15,450 --> 00:11:19,170
Basically, classification doesn't need accurate
probability estimates.
149
00:11:19,170 --> 00:11:25,110
We're just going to choose as the class the
outcome with the largest probability.
150
00:11:25,110 --> 00:11:29,600
As long as the greatest probability is assigned
to the correct class, it doesn't matter if
151
00:11:29,600 --> 00:11:33,540
the probability estimates are all that accurate.
152
00:11:33,540 --> 00:11:38,330
This actually means that if you add redundant
attributes you get problems with Naive Bayes.
153
00:11:38,330 --> 00:11:44,630
The extreme case of dependence is where two
attributes have the same values, identical
154
00:11:44,630 --> 00:11:46,160
attributes.
155
00:11:46,160 --> 00:11:49,780
That will cause havoc with the Naive Bayes
method.
156
00:11:49,780 --> 00:11:54,550
However, Weka contains methods for attribute
selection to allow you to select a subset
157
00:11:54,550 --> 00:12:00,100
of fairly independent attributes after which
you can safely use Naive Bayes.
158
00:12:01,610 --> 00:12:07,100
There's quite a bit of stuff on statistical
modeling in Section 4.2 of the course text.
159
00:12:07,890 --> 00:12:12,530
Now you need to go and do that activity.
160
00:12:12,530 --> 00:12:14,070
See you soon!