1
00:00:18,230 --> 00:00:23,180
Hi! This is Lesson 4.2 on Linear Regression.
2
00:00:23,180 --> 00:00:28,840
Back in Lesson 1.3, we actually mentioned
the difference between a classification problem
3
00:00:28,840 --> 00:00:31,300
and a regression problem.
4
00:00:31,300 --> 00:00:35,809
A classification problem is when what you're
trying to predict is a nominal value, whereas
5
00:00:35,809 --> 00:00:41,400
in a regression problem what you're trying
to predict is a numeric value.
6
00:00:41,400 --> 00:00:46,400
We've seen examples of datasets with nominal
and numeric attributes before, but we've never
7
00:00:46,400 --> 00:00:51,230
looked at the problem of regression, of trying
to predict a numeric value as the output of
8
00:00:51,230 --> 00:00:53,110
a machine learning scheme.
9
00:00:53,110 --> 00:00:56,690
That's what we're doing in this [lesson],
linear regression.
10
00:00:56,690 --> 00:01:02,859
We've only had nominal classes so far, so
now we're going to look at numeric classes.
11
00:01:02,859 --> 00:01:08,340
This is a classical statistical method, dating
back more than 2 centuries.
12
00:01:08,340 --> 00:01:11,190
This is the kind of picture you see.
13
00:01:11,190 --> 00:01:15,450
You have a cloud of data points in 2 dimensions,
and we're trying to fit a straight line to
14
00:01:15,450 --> 00:01:21,560
this cloud of data points and looking for
the best straight-line fit.
15
00:01:21,560 --> 00:01:25,590
Only in our case we might have more than 2
dimensions, there might be multiple dimensions.
16
00:01:25,590 --> 00:01:28,509
It's still a standard problem.
17
00:01:28,509 --> 00:01:31,649
Let's just look at the 2-dimensional case
here.
18
00:01:31,649 --> 00:01:39,560
You can write a straight line equation in
this form, with weights w0 plus w1a1 plus
19
00:01:39,560 --> 00:01:41,179
w2a2, and so on.
20
00:01:41,179 --> 00:01:44,209
Just think about this in one dimension where
there's only one "a".
21
00:01:44,209 --> 00:01:51,209
Forget about all the things at the end here,
just consider w0 plus w1a1.
22
00:01:51,770 --> 00:01:55,560
That's the equation of this line -- it's the
equation of a straight line -- where w0 and
23
00:01:55,560 --> 00:01:59,920
w1 are two constants to be determined from
the data.
24
00:01:59,920 --> 00:02:06,289
This, of course, is going to work most naturally
with numeric attributes, because we're multiplying
25
00:02:06,289 --> 00:02:08,849
these attribute values by weights.
26
00:02:08,849 --> 00:02:13,129
We'll worry about nominal attributes in just
a minute.
27
00:02:13,129 --> 00:02:19,260
We're going to calculate these weights from
the training data -- w0, w1, and w2.
28
00:02:19,260 --> 00:02:22,239
Those are what we're going to calculate from
the training data.
29
00:02:22,239 --> 00:02:27,930
Then, once we've calculated the weights, we're
going to predict the value for the first training
30
00:02:27,930 --> 00:02:29,010
instance, a1.
31
00:02:29,010 --> 00:02:31,670
The notation gets really horrendous here.
32
00:02:31,670 --> 00:02:33,599
I know it looks pretty scary, but it's pretty
simple.
33
00:02:33,599 --> 00:02:38,049
We're using this linear sum with these weights
that we've calculated, using the attribute
34
00:02:38,049 --> 00:02:45,049
values of the first [training] instance in order
to get the predicted value for that instance.
35
00:02:48,239 --> 00:02:54,450
We're going to get predicted values for the
training instances using this rather horrendous
36
00:02:54,450 --> 00:02:55,749
formula here.
37
00:02:55,749 --> 00:02:58,810
I know it looks pretty scary, but it's actually
not so scary.
38
00:02:58,810 --> 00:03:04,549
These w's are just numbers that we've calculated
from the training data, and then these things
39
00:03:04,549 --> 00:03:09,680
here are the attribute values of the first
training instance a1 -- that 1 at the top
40
00:03:09,680 --> 00:03:12,409
here means it's the first training instance.
41
00:03:12,409 --> 00:03:16,840
This 1, 2, 3 means it's the first, second,
and third attribute.
42
00:03:16,840 --> 00:03:21,170
We can write this in this neat little sum
form here, which looks a little bit better.
43
00:03:21,170 --> 00:03:28,040
Notice, by the way, that we're defining a0
-- the zeroth attribute value -- to be 1.
44
00:03:28,040 --> 00:03:31,260
That just makes this formula work.
45
00:03:31,260 --> 00:03:38,510
For the first training instance, that gives
us this number x, the predicted value for
46
00:03:38,519 --> 00:03:45,519
the first training instance and this particular
value of a1.
47
00:03:47,889 --> 00:03:54,139
Then we're choosing the weights to minimize
the squared error on the training data.
48
00:03:54,139 --> 00:03:58,639
This is the actual x value for this i'th training
instance.
49
00:03:58,639 --> 00:04:02,249
This is the predicted value for the i'th training
instance.
50
00:04:02,249 --> 00:04:05,579
We're going to take the difference between
the actual and the predicted value, square
51
00:04:05,579 --> 00:04:07,410
them up, and add them all together.
52
00:04:07,410 --> 00:04:09,680
And that's what we're trying to minimize.
53
00:04:09,680 --> 00:04:15,370
We get the weights by minimizing this sum
of squared errors.
54
00:04:15,370 --> 00:04:20,190
That's a mathematical job; we don't need to
worry about the mechanics of doing that.
55
00:04:20,190 --> 00:04:23,639
It's a standard matrix problem.
56
00:04:23,639 --> 00:04:26,750
It works fine if there are more instances
than attributes.
57
00:04:26,750 --> 00:04:31,660
You couldn't expect this to work if you had
a huge number of attributes and not very many instances.
58
00:04:31,669 --> 00:04:35,530
But providing there are more instances than
attributes -- and usually there are, of course
59
00:04:35,530 --> 00:04:38,110
-- that's going to work ok.
60
00:04:38,110 --> 00:04:44,170
If we did have nominal values, if we just
have a 2-valued/binary-valued, we could just
61
00:04:44,170 --> 00:04:47,170
convert it to 0 and 1 and use those numbers.
62
00:04:47,170 --> 00:04:52,210
If we have multi-valued nominal attributes,
you'll have a look at that in the activity
63
00:04:52,210 --> 00:04:58,250
at the end of this lesson.
64
00:04:58,250 --> 00:05:05,250
We're going to open a regression dataset and
see what it does: cpu.arff.
65
00:05:06,100 --> 00:05:07,400
This is a regular kind of dataset.
66
00:05:07,400 --> 00:05:11,750
It's got numeric attributes, and the most
important thing here is that it's got a numeric
67
00:05:11,750 --> 00:05:15,690
class -- we're trying to predict a numeric
value.
68
00:05:15,690 --> 00:05:22,690
We can run LinearRegression; it's in the functions
category.
69
00:05:24,060 --> 00:05:28,030
We just run it, and this is the output.
70
00:05:28,030 --> 00:05:29,530
We've got the model here.
71
00:05:29,530 --> 00:05:32,580
The class has been predicted as a linear sum.
72
00:05:32,580 --> 00:05:34,320
These are the weights I was talking about.
73
00:05:34,320 --> 00:05:39,060
It's this weight times this attribute value
plus this weight times this attribute value,
74
00:05:39,060 --> 00:05:39,960
and so on.
75
00:05:39,960 --> 00:05:46,960
Minus -- and this is w0, the constant weight,
not modified by an attribute.
76
00:05:48,490 --> 00:05:51,170
This is a formula for computing the class.
77
00:05:51,170 --> 00:05:55,940
When you use that formula, you can look at
the success of it in terms of the training data.
78
00:05:55,940 --> 00:06:01,710
The correlation coefficient, which is a standard
statistical measure, is 0.9.
79
00:06:01,710 --> 00:06:02,700
That's pretty good.
80
00:06:02,700 --> 00:06:06,720
Then there are various other error figures
here that are printed.
81
00:06:06,720 --> 00:06:11,300
On the slide, you can see the interpretation
of these error figures.
82
00:06:11,300 --> 00:06:14,630
It's really hard to know which one to use.
83
00:06:14,630 --> 00:06:19,050
They all tend to produce the same sort of
picture, but I guess the exact one you should
84
00:06:19,050 --> 00:06:21,700
use depends on the application.
85
00:06:23,420 --> 00:06:27,900
There's the mean absolute error and the root
mean squared error, which is the standard
86
00:06:27,900 --> 00:06:33,270
metric to use.
87
00:06:33,270 --> 00:06:33,920
That's linear regression.
88
00:06:33,920 --> 00:06:38,700
I'm actually going to look at nonlinear regression
here.
89
00:06:38,700 --> 00:06:45,080
A "model tree" is a tree where each leaf has
one of these linear regression models.
90
00:06:45,080 --> 00:06:50,040
We create a tree like this, and then at each
leaf we have a linear model, which has got
91
00:06:50,040 --> 00:06:51,100
those coefficients.
92
00:06:51,100 --> 00:07:00,220
It's like a patchwork of linear models, and
this set of 6 linear patches approximates
93
00:07:00,220 --> 00:07:02,290
a continuous function.
94
00:07:02,290 --> 00:07:12,990
There's a method under "trees" with the rather
mysterious name of M5P.
95
00:07:12,990 --> 00:07:18,440
If we just run that, that produces a model
tree.
96
00:07:19,520 --> 00:07:23,370
Maybe I should just visualize the tree.
97
00:07:25,100 --> 00:07:32,920
Now I can see the model tree, which is similar
to the one on the slide.
98
00:07:32,920 --> 00:07:38,280
You can see that each of these -- in this
case 5 -- leaves has a linear model -- LM1,
99
00:07:38,280 --> 00:07:45,730
LM2, LM3, ... And if we look back here, the
linear models are defined like this: LM1 has
100
00:07:45,730 --> 00:07:52,730
this linear formula; this linear formula for
LM2; and so on.
101
00:07:58,510 --> 00:08:03,150
We chose trees > M5P, we ran it, and we looked
at the output.
102
00:08:03,150 --> 00:08:13,070
We could compare these performance figures
-- 92-93% correlation, mean absolute error
103
00:08:13,070 --> 00:08:20,360
of 30, and so on -- with the ones for regular
linear regression, which got a slightly lower
104
00:08:20,360 --> 00:08:24,960
correlation, and a slightly higher absolute
error -- in fact, I think all these error
105
00:08:24,960 --> 00:08:26,930
figures are slightly higher.
106
00:08:26,930 --> 00:08:33,930
That's something we'll be asking you to do
in the activity associated with this lesson.
107
00:08:34,220 --> 00:08:40,270
Linear regression is a well-founded, venerable
mathematical technique.
108
00:08:40,270 --> 00:08:45,540
Practical problems often require non-linear
solutions.
109
00:08:45,540 --> 00:08:50,640
The M5P method builds trees of regression
models, with linear models at each leaf of
110
00:08:50,640 --> 00:08:51,320
the tree.
111
00:08:51,320 --> 00:08:56,210
You can read about this in the course text
in Section 4.6.
112
00:08:56,210 --> 00:08:59,990
Off you go now and do the activity associated
with this lesson.
113
00:08:59,990 --> 00:09:01,160
See you soon.
114
00:09:01,160 --> 00:09:02,300
Bye!