COMP313A-2011 Assignment 6 This assignment is due on Monday June 13, 23.55pm. This assignment is worth 10% of your total 313 marks. ============================================================ The first step to a Haskell-based OS :-) Use Haskell to implement replacements for the following Unix shell utilities: [1marks] - cat [1marks] - head [2marks] - tail [2marks] - uniq [4marks] - sort Look at my "wc" code for ideas on handling IO, plus see the separation of "pure" functional code from IO actions, and using pure code inside an IO performing main method. cat is the only command that reads in a file, all other four commands read from stdin and write to stdout. For these four commands you can achieve perfect separation of pure and impure code by using the "interact" function. See my wc.hs and wc1.hs examples. Also the lines" and "unlines" functions might be handy. [ http://www.haskell.org/ghc/docs/6.12.2/html/libraries/base-4.2.0.1/Prelude.html ] Instead of implementing the full set of options for all of the Unix commands, you only need to do the following: ============================================================ cat cat This is a completely stripped down version, which takes exactly one command line argument, the FILENAME, and outputs this file to standard output. Make sure it works for files of all sizes, including the tweet file and larger. ============================================================ head head -n Reads in text from stdin and outputs the first lines to stdout. Make sure it works for files of all sizes, including the tweets file and larger. If your file has fewer than , just print the full file then. ============================================================ tail tail -n Reads in text from stdin and outputs the last lines to stdout. Make sure that our implementation can deal with arbitrary large input: e.g. even if the input has a billion lines, when asking for the last 100 lines, your program should not break. If your file has fewer than , just print the full file then. ============================================================ uniq [-c] This handy little utility filters out repeated lines, but only if they are consecutive. It reads from stdin and writes to stdout. So input like: 1 1 2 3 3 3 will generate 1 2 3 but 1 2 3 3 2 2 1 will generate 1 2 3 2 1 If the -c option is on, it also count occurences and outputs these counts as well. So a a a b b c will generate: 3 a 2 b 1 c and a a b a a will generate: 2 a 1 b 2 a Again, like with "tail", make sure that your implementation can handle arbitrary large inputs, e.g. 1 billion lines of "a", followed by another billion of "b", etc ... ============================================================ sort sort [-n] [-k ] Sort reads its input from stdin and outputs to stdout. Its default behaviour is to take each line of input as a string and sort these strings. The options work as follows: -k one_based_field_ID if supplied, extract the ID field (e.g. using the words function, see wc), and use that field as the key for sorting (but still output the full line) -n if supplied, the strings are sorted "numerically", and not "alphanumerically" Examples are: assume the following three lines of input: size, date and file name 4250 17/2/2011 fileA 5123 11/3/2011 fileB 9300 15/2/2011 fileC "sort" yields: 4250 17/2/2011 fileA 5123 11/3/2011 fileB 9300 15/2/2011 fileC "sort -k 2" (sorting on the date field) yields: 5123 11/3/2011 fileB 9300 15/2/2011 fileC 4250 17/2/2011 fileA ============================================================ What to submit: - source code - output produced by the following Linux command line > time ( ./cat /home/ml/datasets/tweets/someTweets.txt | ./sort | ./uniq -c | ./sort -n -k 1 | ./tail -n 10 ) or if that takes too long or needs too much memory, use > time ( ./cat /home/ml/datasets/tweets/fewerTweets.txt | ./sort | ./uniq -c | ./sort -n -k 1 | ./tail -n 10 ) or simply use the smallest of the three files > time ( ./cat /home/ml/datasets/tweets/first60kTweets.txt | ./sort | ./uniq -c | ./sort -n -k 1 | ./tail -n 10 ) ============================================================