1 00:00:16,160 --> 00:00:19,950 Hi! Welcome to Lesson 5.3 of Data Mining with Weka. 2 00:00:19,950 --> 00:00:23,369 Before we start, I thought I'd show you where I live. 3 00:00:23,369 --> 00:00:28,669 I told you before that I moved to New Zealand many years ago. 4 00:00:28,669 --> 00:00:29,939 I live in a place called Hamilton. 5 00:00:29,939 --> 00:00:35,220 Let me just zoom in and see if we can find Hamilton in the North Island of New Zealand, 6 00:00:35,220 --> 00:00:37,670 around the center of the North Island. 7 00:00:37,670 --> 00:00:44,030 This is where the University of Waikato is. 8 00:00:44,030 --> 00:00:47,660 Here is the university; this is where I live. 9 00:00:47,660 --> 00:00:52,160 This is my journey to work: I cycle every morning through the countryside. 10 00:00:52,160 --> 00:00:53,930 As you can see, it's really nice. 11 00:00:53,930 --> 00:00:55,390 I live out here in the country. 12 00:00:55,390 --> 00:01:02,390 I'm a sheep farmer! I've got four sheep, three in the paddock and one in the freezer. 13 00:01:02,500 --> 00:01:05,780 I cycle in -- it takes about half an hour -- and I get to the university. 14 00:01:05,780 --> 00:01:11,970 I have the distinction of being able to go from one week to the next without ever seeing 15 00:01:11,970 --> 00:01:16,090 a traffic light, because I live out on the same edge of town as the university. 16 00:01:16,090 --> 00:01:21,500 When I get to the campus of the University of Waikato, it's a very beautiful campus. 17 00:01:21,500 --> 00:01:23,060 We've got three lakes. 18 00:01:23,060 --> 00:01:27,349 There are two of the lakes, and another lake down here. 19 00:01:27,349 --> 00:01:32,330 It's a really nice place to work! So I'm very happy here. 20 00:01:32,330 --> 00:01:39,330 Let's move on to talk about data mining and ethics. 21 00:01:39,530 --> 00:01:46,530 In Europe, they have a lot of pretty stringent laws about information privacy. 22 00:01:47,000 --> 00:01:51,450 For example, if you're going to collect any personal information about anyone, a purpose 23 00:01:51,450 --> 00:01:52,860 must be stated. 24 00:01:52,860 --> 00:01:57,750 The information should not be disclosed to others without consent. 25 00:01:57,750 --> 00:02:01,390 Records kept on individuals must be accurate and up to date. 26 00:02:01,390 --> 00:02:03,920 People should be able to review data about themselves. 27 00:02:03,920 --> 00:02:08,110 Data should be deleted when it's no longer needed. 28 00:02:08,110 --> 00:02:12,690 Personal information must not be transmitted to other locations. 29 00:02:12,690 --> 00:02:17,390 Some data is too sensitive to be collected, except in extreme circumstances. 30 00:02:17,390 --> 00:02:20,489 This is true in some countries in Europe, particularly Scandinavia. 31 00:02:20,489 --> 00:02:24,230 It's not true, of course, in the United States. 32 00:02:24,230 --> 00:02:29,750 Data mining is about collecting and utilizing recorded information, and it's good to be 33 00:02:29,750 --> 00:02:32,600 aware of some of these ethical issues. 34 00:02:32,600 --> 00:02:39,000 People often try to anonymize data so that it's safe to distribute for other people to 35 00:02:39,000 --> 00:02:42,790 work on, but anonymization is much harder than you think. 36 00:02:42,790 --> 00:02:44,760 Here's a little story for you. 37 00:02:44,760 --> 00:02:49,500 When Massachusetts released medical records summarizing every state employee's hospital 38 00:02:49,500 --> 00:02:54,780 record in the mid-1990's, the Governor gave a public assurance that it had been anonymized 39 00:02:54,780 --> 00:02:59,950 by removing all identifying information -- name, address, and social security number. 40 00:02:59,950 --> 00:03:06,040 He was surprised to receive is own health records (which included a lot of private information) 41 00:03:06,040 --> 00:03:11,040 in the mail shortly afterwards! People could be re-identified from the information that 42 00:03:11,040 --> 00:03:13,490 was left there. 43 00:03:13,490 --> 00:03:18,220 There's been quite a bit of research done on re-identification techniques. 44 00:03:18,220 --> 00:03:24,370 For example, using publicly available records on the internet, 50% of Americans can be identified 45 00:03:24,370 --> 00:03:28,010 from their city, birth date, and sex. 46 00:03:28,010 --> 00:03:34,470 85% can be identified if you include their zip code as well. 47 00:03:34,470 --> 00:03:40,140 There was some interesting work done on a movie database. 48 00:03:40,140 --> 00:03:47,140 Netflix released a database of 100 million records of movie ratings. 49 00:03:47,290 --> 00:03:51,810 They got individuals to rate movies [on the scale] 1-5, and they had a whole bunch of 50 00:03:51,810 --> 00:03:56,100 people doing this -- a total of 100 million records. 51 00:03:56,100 --> 00:04:02,060 It turned out that you could identify 99% of people in the database if you knew their 52 00:04:02,060 --> 00:04:06,420 ratings for 6 movies and approximately when they saw them. 53 00:04:06,420 --> 00:04:11,650 Even if you only know their ratings for 2 movies, you can identify 70% of people. 54 00:04:11,650 --> 00:04:16,349 This means you can use the database to find out the other movies that these people watched. 55 00:04:16,349 --> 00:04:19,300 They might not want you to know that. 56 00:04:19,300 --> 00:04:25,500 Re-identification is remarkably powerful, and it is incredibly hard to anonymize data 57 00:04:25,500 --> 00:04:30,660 effectively in a way that doesn't destroy the value of the entire dataset for data mining 58 00:04:30,660 --> 00:04:33,310 purposes. 59 00:04:33,310 --> 00:04:37,540 Of course, the purpose of data mining is to discriminate: that's what we're trying to do! 60 00:04:37,540 --> 00:04:42,070 We're trying to learn rules that discriminate one class from another in the data -- who 61 00:04:42,070 --> 00:04:48,000 gets the loan? -- who gets a special offer? But, of course, certain kinds of discrimination 62 00:04:48,000 --> 00:04:50,720 are unethical, not to mention illegal. 63 00:04:50,720 --> 00:04:56,570 For example, racial, sexual, and religious discrimination is certainly unethical, and 64 00:04:56,570 --> 00:04:59,550 in most places illegal. 65 00:04:59,550 --> 00:05:01,910 But it depends on the context. 66 00:05:01,910 --> 00:05:06,500 Sexual discrimination is usually illegal ... except for doctors. 67 00:05:06,500 --> 00:05:11,350 Doctors are expected to take gender into account when they make their make their diagnoses. 68 00:05:11,350 --> 00:05:16,400 They don't want to tell a man that he is pregnant, for example. 69 00:05:16,400 --> 00:05:20,010 Also, information that appears innocuous may not be. 70 00:05:20,010 --> 00:05:26,880 For example, area codes -- zip codes in the US -- correlate strongly with race; membership 71 00:05:26,880 --> 00:05:29,100 of certain organizations correlates with gender. 72 00:05:29,100 --> 00:05:34,260 So although you might have removed the explicit racial and gender information from you database, 73 00:05:34,260 --> 00:05:37,880 it still might be able to be inferred from other information that's there. 74 00:05:37,880 --> 00:05:48,550 It's very hard to deal with data: it has a way of revealing secrets about itself in unintended ways. 75 00:05:48,550 --> 00:05:55,550 Another ethical issue concerning data mining is that correlation does not imply causation. 76 00:05:56,610 --> 00:06:02,169 Here's a classic example: as ice cream sales increase, so does the rate of drownings. 77 00:06:02,169 --> 00:06:06,970 Therefore, ice cream consumption causes drowning? Probably not. 78 00:06:06,970 --> 00:06:12,320 They're probably both caused by warmer temperatures -- people going to beaches. 79 00:06:12,320 --> 00:06:17,800 What data mining reveals is simply correlations, not causation. 80 00:06:17,800 --> 00:06:20,010 Really, we want causation. 81 00:06:20,010 --> 00:06:25,550 We want to be able to predict the effects of our actions, but all we can look at using 82 00:06:25,550 --> 00:06:27,919 data mining techniques is correlation. 83 00:06:27,919 --> 00:06:34,919 To understand about causation, you need a deeper model of what's going on. 84 00:06:36,340 --> 00:06:40,150 I just wanted to alert you to some of the issues, some of the ethical issues, in data 85 00:06:40,150 --> 00:06:46,790 mining, before you go away and use what you've learned in this course on your own datasets: 86 00:06:46,790 --> 00:06:51,270 issues about the privacy of personal information; the fact that anonymization is harder than 87 00:06:51,270 --> 00:06:57,650 you think; re-identification of individuals from supposedly anonymized data is easier 88 00:06:57,650 --> 00:07:03,699 than you think; data mining and discrimination -- it is, after all, about discrimination; 89 00:07:03,699 --> 00:07:08,250 and the fact that correlation does not imply causation. 90 00:07:08,250 --> 00:07:13,729 There's a section in the textbook, Data mining and ethics, which you can read for more background 91 00:07:13,729 --> 00:07:18,030 information, and there's a little activity associated with this lesson, which you should 92 00:07:18,030 --> 00:07:20,190 go and do now. 93 00:07:20,190 --> 00:07:23,900 I'll see you in the next lesson, which is the last lesson of the course. 94 00:07:23,900 --> 00:07:26,500 Bye for now!