COMP313A-2011 Assignment 5 This assignment is due on Monday May 30, 23.55pm. This assignment is worth 10% of your total 313 marks. ============================================================ If your program takes too long, especially the char-based version, use the smaller tweet-file (see below for details) ============================================================ In the R block linux labs, you should be able to access Erlang as: /home/ml/313/bin/erl /home/ml/313/bin/erlc or you could add /home/ml/313/bin/ to your PATH variable, e.g. in .profile add: export PATH=$PATH:/home/ml/313/bin/ ============================================================ Similar problem as Assignment 4, but different language and different algorithm: Erlang & SpaceSaving Again, use the tweets from: /home/ml/datasets/tweets/someTweets.txt [R block labs, linux] /home/ml/datasets/tweets/fewerTweets.txt [~10% of the above, R block labs, linux] Your program should determine the K most frequent words of sizes 1 to 20 characters respectively. K will be a user parameter between 1 and 1000, i.e. 1 <= K <= 1000. Your program should print out these 20 sets of words and their respective frequency like so (K=2 in this example): Size Word Frequency 1 a 4500000 1 I 350000 2 an 235678 2 at 12398 3 the 567890 3 all 34567 4 iraq 9999 4 fair 9876 ... You may ignore ties. Words are defined very liberally: spaces separate/define words. Therefore in Java you could use String.split(" ") to turn a line into "words". Here is the twist: - you must use Erlang, and you must use 21 processes: 1 master reading in the file line by line and passing lines onto all 20 clients, which are processing one word size each, i.e. one client does 1 char words, one does 2 char words, and so on. When the master is done with the file, it asks all clients for their top-K words, to produce the above output. As messages are copied, you will have to implement two versions, and compare their efficiencies. In one version the master sends full lines to the clients, i.e. clients need to accept the following messages: { line, Line } { topK, From, K} In the second version the master sends single characters and an end-of-line: { topK, From, K} eol Char - Secondly, the clients must use the following simple approximate algorithm, which is a simplification of the so-called SpaveSaving algorithm [http://en.scientificcommons.org/43576982] - Initialize N counters - Process WORD: if WORD has a current counter, increment it else find lowest counter, replace its word with WORD, and increment If enough counters are used, there is a high probability that the results are correct. In your experiments, run your system for K = 25, and use a) 50 counters in each process, and b) 250 counters in each process. Compare the result with your results from Assignment 4: are 50 counters enough for a fully correct result (all top 25 are correct)? Are 250 counters enough? PS: do not use "counter.erl". Use a list of {word,count} tuples instead, init to [ {"",0}, {"",0}, ... ] Or use a more sophisticated data structure, if you want. ============================================================ What to submit: - source code - output from one run for K=2 (and 20 counters/process) for both the line-based and the char-based version, to verify correctness. - 7 runs for K = 25, in four different settings: 50 counters/process | 250 counters/process line-based protocal | char-based protocal Report the median runtime of all 7 runs, and whether the results were fully correct, or contained some error, e.g. (purely hypothetical): | 50 | 250 -----+--------------+----------- line | 2mins errors | 10mins ok char | 1mins errors | 5mins ok ============================================================