COMP317-09A Assignment 3 This assignment is due: Friday 22 May, 5pm. Computing a phylogenetic tree You are given 2339 amino acid sequences (hiv.txt), which are all variants/mutations of one specifically interesting protein forming part of the hull of the HI virus. A phylogenetic tree groups/clusters together sequences according to their similarity. One way of doing this is so-called bottom-up agglomerative clustering. Given a set of clusters, in every step the two most similar clusters are merged together to form a new cluster: WHILE num_of_clusters > 1 DO merge two most similar clusters Cluster similarity can be defined in lots of different ways. Here we use so-called "single-linkage": the similarity of two clusters C1 and C2 is the maximal similarity for any pair of an element of C1 and one of C2. The elements are the amino acid sequences, so the similarity for these elements is simply the length of their longest common subsequence: sim(s1,s2) = LCS(s1,s2). [Note: Biologists usually use more complex measures.] To start clustering, every single sequence forms its own single-element cluster. So here is what you need to do: - read in sequences from a file: there is one sequence per line, each sequence starts with a unique id, and then a number of single-letter codes for the various different amino acids, single blanks are used as separating characters - implement LCS and use it to precompute all pairwise sequence similarities and store this in an appropriate data structure - initialise all clusters - merge clusters until only one is left, using the precomputed similarity info. Make sure to keep track of which clusters were merged how, so that you can print a "tree" at the end - output the final cluster tree For the 10 example file the tree output of my implementation looks like so: c8 sim = 411 c7 sim = 411 c6 sim = 412 c5 sim = 418 c2 sim = 440 D0Cd0009 D0Cd0009.1 c3 sim = 424 c1 sim = 440 c0 sim = 440 RefSeque B0Fr0003 B0Fr0003.1 D0Cd0010 c4 sim = 423 AeTh0071 AeTh0072 AdkCd003 C0In0022 Note that this output encodes tree by indentation, uses the sequence id for single element clusters, and numbers the largest clusters in the order they were formed, plus outputs the similarity value of the respective two subclusters. This takes 0.5 seconds on my desktop machine. The 100 example file needs 25 seconds, and the full file needs two hours! Remember, you are encouraged to work in pairs on this assignment! What to submit: - Java Source code, - textual output for the 10 example case - a short textual description of your clustering algorithm: - use pseudo-code notation to describe your implementation - what is the complexity of your algorithm as a function of the number N of sequences given and the average sequence length L: O( ??? ) - report runtimes for at least the 10 example input file and the 100 example input file, plus for the full 2339 example input file. Do your numbers agree with your complexity estimation? How to submit: Moodle, will be activated later this week (or email, if necessary). Make sure to properly identify both group members by Name and ID.