1 00:00:18,419 --> 00:00:22,310 Hi! Welcome back for another five minutes in New Zealand 2 00:00:22,310 --> 00:00:24,330 with Data Mining with Weka. 3 00:00:24,330 --> 00:00:28,499 This is Lesson 1.3, and we're going to look at exploring datasets 4 00:00:28,499 --> 00:00:32,230 in this lesson. 5 00:00:32,230 --> 00:00:36,130 We looked at this data file in the last lesson. It's the 6 00:00:36,130 --> 00:00:38,280 weather data 7 00:00:38,280 --> 00:00:42,580 toy dataset, of course. It has fourteen days, or 8 00:00:42,580 --> 00:00:46,149 instances, and each instance, each day, is described by 9 00:00:46,149 --> 00:00:47,750 five attributes, 10 00:00:47,750 --> 00:00:49,070 four to do with the weather, and 11 00:00:49,070 --> 00:00:51,570 the last attribute, 12 00:00:51,570 --> 00:00:53,809 which we called the class value, 13 00:00:53,809 --> 00:00:57,490 the thing that we're trying to predict, whether or not to play this 14 00:00:57,490 --> 00:00:59,860 unspecified game. 15 00:00:59,860 --> 00:01:03,300 This is called a classification problem. 16 00:01:03,300 --> 00:01:05,330 We're trying to predict the class value. 17 00:01:05,330 --> 00:01:07,340 Let's open up Weka. 18 00:01:07,340 --> 00:01:09,320 It's here on my desktop. 19 00:01:09,320 --> 00:01:11,340 I'm going to go into the Explorer. 20 00:01:11,340 --> 00:01:13,150 We always use the Explorer. 21 00:01:13,150 --> 00:01:15,560 I'm going to open the file. 22 00:01:15,560 --> 00:01:20,460 I put the datasets in My Documents folder, so I can see them here. 23 00:01:20,460 --> 00:01:21,400 Just open 24 00:01:21,400 --> 00:01:26,100 the Weka datasets and the nominal weather data. 25 00:01:26,100 --> 00:01:30,070 There's the weather data in Weka. 26 00:01:30,070 --> 00:01:31,430 As we saw last time, 27 00:01:31,430 --> 00:01:33,190 28 00:01:33,190 --> 00:01:37,140 you can see the size of the dataset, the number of instances—fourteen— 29 00:01:37,140 --> 00:01:39,270 you can see the attributes, 30 00:01:39,270 --> 00:01:41,900 you can click any of these attributes 31 00:01:41,900 --> 00:01:44,370 and get the values for those attributes 32 00:01:44,370 --> 00:01:46,700 up here in this panel. 33 00:01:46,700 --> 00:01:52,280 You also get at the bottom a histogram of the attribute values 34 00:01:52,280 --> 00:01:55,390 with respect to the different class values. The different class 35 00:01:55,390 --> 00:01:56,790 values are 36 00:01:56,790 --> 00:02:00,490 blue for yes, play and 37 00:02:00,490 --> 00:02:03,480 red for no, don't play. 38 00:02:03,480 --> 00:02:04,420 By default, 39 00:02:04,420 --> 00:02:07,370 the last attribute in Weka is always the class value. 40 00:02:07,370 --> 00:02:10,940 You can change this if you like. If you change it here you can decide to 41 00:02:10,940 --> 00:02:17,549 predict a different one other than the last attribute. 42 00:02:17,549 --> 00:02:23,549 That's the weather dataset, and we've already explored that. 43 00:02:23,549 --> 00:02:27,540 As I said, it's a classification problem, sometimes called a supervised learning 44 00:02:27,540 --> 00:02:29,310 problem. Supervised 45 00:02:29,310 --> 00:02:30,770 because you get to know the 46 00:02:30,770 --> 00:02:34,379 class values of the training instances. 47 00:02:34,379 --> 00:02:38,639 We take as inputted data set as classified examples, 48 00:02:38,639 --> 00:02:42,199 these examples are independent examples with a class value attached. 49 00:02:42,199 --> 00:02:43,339 50 00:02:43,339 --> 00:02:47,089 The idea is to produce automatically 51 00:02:47,089 --> 00:02:48,409 some kind of model 52 00:02:48,409 --> 00:02:50,629 that can classify new examples. 53 00:02:50,629 --> 00:02:52,959 That's the classification problem. 54 00:02:52,959 --> 00:02:57,259 Here is what the examples look like. This is an instance, with 55 00:02:57,259 --> 00:02:59,389 the different attribute values 56 00:02:59,389 --> 00:03:01,019 a fixed set of features, 57 00:03:01,019 --> 00:03:02,290 and then we add to that 58 00:03:02,290 --> 00:03:05,589 the class to get the classified example. 59 00:03:05,589 --> 00:03:10,499 That's what we have to have in our training dataset. 60 00:03:10,499 --> 00:03:11,360 61 00:03:11,360 --> 00:03:14,920 These attributes or features can be discrete or continuous. 62 00:03:14,920 --> 00:03:15,879 What we 63 00:03:15,879 --> 00:03:18,659 looked at in the weather data were 64 00:03:18,659 --> 00:03:20,560 discrete, or we call them nominal, 65 00:03:20,560 --> 00:03:23,870 attribute values where they belong to a certain fixed set, 66 00:03:23,870 --> 00:03:25,499 or they can be numeric 67 00:03:25,499 --> 00:03:27,949 or continuous values. 68 00:03:27,949 --> 00:03:32,339 Also, the class can be discrete or continuous. We're looking at a discrete class, 69 00:03:32,339 --> 00:03:36,169 yes or no, in the case of the weather data. Another kind of machine 70 00:03:36,169 --> 00:03:37,800 learning problem would involve 71 00:03:37,800 --> 00:03:41,010 continuous classes, where you're trying to predict a number. 72 00:03:41,010 --> 00:03:43,470 That's called a regression problem 73 00:03:43,470 --> 00:03:45,439 in the trade. 74 00:03:45,439 --> 00:03:48,859 I'm going to have a look at a similar 75 00:03:48,859 --> 00:03:52,509 dataset to the weather dataset. 76 00:03:52,509 --> 00:03:53,209 The numeric weather 77 00:03:53,209 --> 00:03:54,829 dataset. 78 00:03:54,829 --> 00:03:57,979 Let me just open that in Weka, 79 00:03:57,979 --> 00:04:00,739 weather.numeric.arff. 80 00:04:00,739 --> 00:04:02,840 Here it is. It's very similar, 81 00:04:02,840 --> 00:04:05,389 almost identical in fact, 82 00:04:05,389 --> 00:04:09,329 for 14 instances, 5 attributes, the same attributes. 83 00:04:09,329 --> 00:04:12,229 Maybe I should just look at this dataset 84 00:04:12,229 --> 00:04:13,769 in the edit panel. 85 00:04:13,769 --> 00:04:17,600 You can see here that two of the attributes—temperature and humidity— 86 00:04:17,600 --> 00:04:21,739 are numeric attributes, whereas previously they were nominal 87 00:04:21,739 --> 00:04:25,660 attributes. So here there are numbers. 88 00:04:25,660 --> 00:04:29,830 What we see when we look at the attributes values for outlook, just as 89 00:04:29,830 --> 00:04:30,719 before, we have 90 00:04:30,719 --> 00:04:32,729 sunny, overcast and rainy. 91 00:04:32,729 --> 00:04:36,159 For temperature, though, we can't enumerate the values, 92 00:04:36,159 --> 00:04:38,189 there are too many numbers to enumerate. 93 00:04:38,189 --> 00:04:42,910 We have the minimum and maximum value, mean, and standard deviation. 94 00:04:42,910 --> 00:04:44,740 That's what Weka gives you 95 00:04:44,740 --> 00:04:46,039 for the numeric values. 96 00:04:46,039 --> 00:04:49,939 97 00:04:49,939 --> 00:04:53,099 I'm going to look at a different dataset. 98 00:04:53,099 --> 00:04:57,360 I'm going to look at the glass dataset, which is a rather more extensive dataset. 99 00:04:57,360 --> 00:04:59,639 It's a real world dataset, 100 00:04:59,639 --> 00:05:02,610 not a terribly big one. 101 00:05:02,610 --> 00:05:04,189 Let's open it. 102 00:05:04,189 --> 00:05:07,150 Here we've got 214 instances 103 00:05:07,150 --> 00:05:09,520 and 10 attributes. 104 00:05:09,520 --> 00:05:13,229 Here are the 10 attributes, it's not clear what they are. 105 00:05:13,229 --> 00:05:15,529 Let's look at the class, 106 00:05:15,529 --> 00:05:17,650 by default the last 107 00:05:17,650 --> 00:05:20,120 attribute shown. 108 00:05:20,120 --> 00:05:24,400 There are seven values for the class, and the labels of these values give 109 00:05:24,400 --> 00:05:26,819 you some indication of what this dataset is about. 110 00:05:26,819 --> 00:05:31,469 We have headlamps, tableware, and containers. 111 00:05:31,469 --> 00:05:34,250 Then we have building and vehicle windows, 112 00:05:34,250 --> 00:05:36,080 both float and non-float. 113 00:05:36,080 --> 00:05:37,560 You may not know this, but there are 114 00:05:37,560 --> 00:05:40,349 different ways of making glass, and 115 00:05:40,349 --> 00:05:43,439 the floating process is a way of making glass. 116 00:05:43,439 --> 00:05:47,050 These are seven different kinds of glass. 117 00:05:47,050 --> 00:05:50,209 What are the attribute values? 118 00:05:50,209 --> 00:05:52,570 I don't know what you remember about physics, 119 00:05:52,570 --> 00:05:53,679 120 00:05:53,679 --> 00:05:55,850 and I guess it doesn't matter if you don't remember. 121 00:05:55,850 --> 00:05:59,369 RI stands for the refractive index. 122 00:05:59,369 --> 00:06:02,399 It's always a good idea to check for reasonableness when you're looking at 123 00:06:02,399 --> 00:06:04,739 datasets. It's really important to 124 00:06:04,739 --> 00:06:06,830 get down and dirty with your data. 125 00:06:06,830 --> 00:06:10,620 Here we're looking at the values of the refractive index—a minimum of 1.511, 126 00:06:10,620 --> 00:06:12,199 127 00:06:12,199 --> 00:06:14,650 a maximum of 1.534. 128 00:06:14,650 --> 00:06:16,580 It's good to think about whether these are 129 00:06:16,580 --> 00:06:20,310 reasonable values for refractive index. If you go to the web and have a look around, 130 00:06:20,310 --> 00:06:21,710 you'll find that these are 131 00:06:21,710 --> 00:06:22,699 good values for 132 00:06:22,699 --> 00:06:24,539 the refractive index. 133 00:06:24,539 --> 00:06:25,940 Na. 134 00:06:25,940 --> 00:06:29,429 If you did chemistry, you'll recognize Na as sodium. 135 00:06:29,429 --> 00:06:33,350 Here, it looks like these are percentages, 136 00:06:33,350 --> 00:06:36,370 the different percentages of sodium. 137 00:06:36,370 --> 00:06:38,610 Magnesium, Mg, 138 00:06:38,610 --> 00:06:43,159 and so on. We would expect Silicon (Si), 139 00:06:43,159 --> 00:06:47,669 to make up the majority of glass. It varies between 69.81% 140 00:06:47,669 --> 00:06:49,169 141 00:06:49,169 --> 00:06:51,229 and 75.41%. 142 00:06:51,229 --> 00:06:57,289 These are percentages of different elements in the glass. 143 00:06:57,289 --> 00:07:02,240 We can confirm our guesses here by looking at the data file itself. 144 00:07:02,240 --> 00:07:04,569 Let me just find the glass data. 145 00:07:04,569 --> 00:07:07,379 It's in Weka datasets, 146 00:07:07,379 --> 00:07:08,030 147 00:07:08,030 --> 00:07:09,599 148 00:07:09,599 --> 00:07:12,419 and it's glass.arff. 149 00:07:12,419 --> 00:07:14,580 150 00:07:14,580 --> 00:07:15,619 This is the ARFF 151 00:07:15,619 --> 00:07:17,419 file format. 152 00:07:17,419 --> 00:07:20,479 It starts with a bunch of comments about 153 00:07:20,479 --> 00:07:24,969 the glass database. These lines beginning with percentage signs (%) are comments. 154 00:07:24,969 --> 00:07:27,689 You can read about this. We don't have time to read it now. 155 00:07:27,689 --> 00:07:31,209 You can see about the attributes and it does say that 156 00:07:31,209 --> 00:07:32,570 the attributes are 157 00:07:32,570 --> 00:07:36,679 refractive index, sodium, magnesium, and so on. 158 00:07:36,679 --> 00:07:39,050 And the type of glass, just like I said, is about 159 00:07:39,050 --> 00:07:45,839 windows, containers, and tableware, and so on. 160 00:07:45,839 --> 00:07:48,999 We can get down to the end of the comments, 161 00:07:48,999 --> 00:07:53,249 and here we have stuff for Weka. This is the ARFF format. The relation has a 162 00:07:53,249 --> 00:07:54,479 name, 163 00:07:54,479 --> 00:07:57,219 you'll see it printed in the interface when you look. 164 00:07:57,219 --> 00:08:01,119 The attributes are defined, they are real valued attributes, 165 00:08:01,119 --> 00:08:03,269 numeric attributes. 166 00:08:03,269 --> 00:08:04,439 The type 167 00:08:04,439 --> 00:08:08,440 attribute is nominal, and the different values of type are 168 00:08:08,440 --> 00:08:11,599 enumerated here in quotes. 169 00:08:11,599 --> 00:08:14,979 That defines the relation and the attributes. Then we have an 170 00:08:14,979 --> 00:08:19,459 '@data' line, and following that in the ARFF format, are simply the instances, 171 00:08:19,459 --> 00:08:24,239 one after the other, with the attribute values all on one line, ending with 172 00:08:24,239 --> 00:08:26,430 class by default. This is the 173 00:08:26,430 --> 00:08:29,219 class value for the first instance. 174 00:08:29,219 --> 00:08:31,889 I think there are 214 175 00:08:31,889 --> 00:08:33,670 instances here. 176 00:08:33,670 --> 00:08:37,030 There's the last one. 177 00:08:37,030 --> 00:08:39,829 That's the ARFF format. It is a very simple, 178 00:08:39,829 --> 00:08:43,040 textual file format. 179 00:08:43,040 --> 00:08:46,870 Now we've confirmed our guesses about these numbers being percentages 180 00:08:46,870 --> 00:08:49,460 and different elements. 181 00:08:49,460 --> 00:08:52,420 We can think about 182 00:08:52,420 --> 00:08:56,310 this some more. It's important then, that these numbers are 183 00:08:56,310 --> 00:09:00,520 reasonable. If they went negative, for example, 184 00:09:00,520 --> 00:09:03,670 that would indicate some kind of corrupted value. You can't have a negative 185 00:09:03,670 --> 00:09:04,820 percentage. 186 00:09:04,820 --> 00:09:08,560 We're expected silicon to be the majority component; 187 00:09:08,560 --> 00:09:12,290 we're expecting the refractive index to be in this kind of range. It's always a good 188 00:09:12,290 --> 00:09:14,749 idea when you get a dataset to just 189 00:09:14,749 --> 00:09:16,870 click around in the Weka interface 190 00:09:16,870 --> 00:09:20,090 and make sure things look real. Rather small amounts 191 00:09:20,090 --> 00:09:24,220 of aluminum in glass. I guess that's not surprising; 192 00:09:24,220 --> 00:09:27,260 I don't know very much about glass myself. 193 00:09:27,260 --> 00:09:29,839 We're just kind of checking for reasonableness here— 194 00:09:29,839 --> 00:09:36,440 a very good thing to do. 195 00:09:36,440 --> 00:09:37,180 That's it then. 196 00:09:37,180 --> 00:09:40,670 In this lesson, we've looked at the classification problem. 197 00:09:40,670 --> 00:09:44,199 We've looked at the nominal weather data and the numeric weather data. 198 00:09:44,199 --> 00:09:47,400 We've talked about nominal versus numeric attributes, 199 00:09:47,400 --> 00:09:48,090 and we've 200 00:09:48,090 --> 00:09:50,820 talked about the ARFF file format. 201 00:09:50,820 --> 00:09:52,680 We've looked at the glass.arff 202 00:09:52,680 --> 00:09:54,030 dataset, 203 00:09:54,030 --> 00:09:57,970 and I've talked about sanity checking of attributes, and the importance of 204 00:09:57,970 --> 00:10:00,850 getting down and dirty with your data. 205 00:10:00,850 --> 00:10:04,410 If you'd like some further background on this, you can read Section 11.1 206 00:10:04,410 --> 00:10:08,130 of the text and read about Preparing the data and Loading the data 207 00:10:08,130 --> 00:10:10,080 into the Explorer. 208 00:10:10,080 --> 00:10:11,429 Whether or not you do that, 209 00:10:11,429 --> 00:10:16,640 please go and look at the activity associated with this lesson. 210 00:10:16,640 --> 00:10:23,000 We'll see you soon. Bye!