COMP313A-2011 Assignment 4 This assignment is due on Friday May 6, 23.55pm. This assignment is worth 10% of your total 313 marks. ============================================================ Here is a simple job, implement it in two different languages: one of Prolog or Scala, and your favourite language (ala Assignment 1: I need to be able to run the code in your favourite language on either Linux in the labs or on MacOS X 10.6.6). On the Linux machines in R block you should be able to read the following file: /home/ml/datasets/tweets/someTweets.txt which contains one tweet per line (just the text, no meta info) Altogether it is about 6 million tweets and 0.5 gig Your program should determine the K most frequent words of sizes 1 to 20 characters respectively. K will be a user parameter between 1 and 1000, i.e. 1 <= K <= 1000. Your program should print out these 20 sets of words and their respective frequency like so (K=2 in this example): Size Word Frequency 1 a 4500000 1 I 350000 2 an 235678 2 at 12398 3 the 567890 3 all 34567 4 iraq 9999 4 fair 9876 ... You may ignore ties. Words are defined very liberally: spaces separate/define words. Therefore in Java you could use String.split(" ") to turn a line into "words". Run your program 7 times for K = 25 and record the runtimes. ============================================================ What to submit: - source code for both programming languages - for both languages: output from one run for K=1 - a table listing: programming language time of slowest run median runtime time of fastest run estimate of time needed to program number of lines of code e.g.: programming language Scala Python time of slowest run 10 min 9 min median runtime 7 min 8 min time of fastest run 5 min 7 min time needed to program 2 hours 0.5 hours number of lines of code 120 lines 80 lines ============================================================