Draft—not finalised

Assignment Two — Uncovering Genetic Causes of Illness

Due: 18 October 2013


Many illnesses are linked to genetic disorders; often manifest as inherited diseases (e.g. drepanocytosis). One of the first steps in understanding these disorders is to find the exact cause. What gene or combination of genes is responsible? What variations in DNA lead to the disease arising in some people but not others?

The search for answers begins by finding differences in the DNA of people who have the disease and those who don't, then eliminating differences that do not correlate with the illness. Since an arbitrary pair of humans will often have a large number of genetic differences, the search can be made easier by looking at people who are closely related because their differences will be fewer and relevant variations simpler to find.

Genome Sequencing Centres (GSCs) can find all variations (SNPs and indels) between some reference sequence and any number of additional sequences, and this information is recorded in a VCF (Variant Call Format) file. Each line of data in the body of the VCF file reports where a variation was detected (i.e. which chromosome and what position on that chromosome), what reference call was expected, what variations were observed, and which sequences had which variation—along with a number of other bits of information relating to such things as the nature and quality of the data. A header section is found at the start of a VCF file with meta-data that explains each field/code.

For this assignment, you will be given a VCF file with variations found in the DNA sequences for a family of seven (two parents and five children). You will be told which family members exhibit which disorders, and your goal is to determine which genetic variations are most likely responsible for each illness.

As before, this assignment is to be completed by teams, but the teams are different than for the previous assignment. The teams are as follows:
team namebiologistcomputer scientists
Alpha Michael Anderson Yoni Villamor, Michael Barrett, Daniel McLaren, Brian Hardyment, Sam Button
Beta Erin Doyle Reuben Bell, Wanying Yang, Matthew Broatch, Simon Mills
Gamma Sarah Appleby Jack Elliott, Sara Schaare, Keith Vincent, Jason Grinter
Delta Laura Bell Selina Gyde, Michael Fowlie, Sam Prescott, Thomas Verstappen
There are no constraints on which team member does which job(s), nor on which programming language is used for data processing. Just solve the problem as best you can.


There are four illnesses of increasing complexity to be accounted for: 1) Sickle Cell Anaemia, 2) Retinitis Pigmentosa, 3) Severe Skeletal Dysplasia, and 4) Spastic Paraplegia.

You will be given the following information:

The ultimate aim is to come up with precisely which variations cause each disease. This will require that you use some combination of literature search for genes and variants known to be associated with the disease and a computer based process that sifts out candidate variants that fit the known family inheritance. The shorter your final list (assuming that it includes the actual result) the better your grades will be.

For each disease, you should construct VCF files according to the following criteria:


Submit via Moodle a compressed folder that contains the final VCF files you produce for each disease, plus a well-written well-formatted report summarizing what you found for each disease. The report need not be lengthy provided it includes the following:

Any additional relevant data that you wish to include that doesn't fit in the body of the report may be added as appendices, or (if quite large) as separate files. Make sure each VCF file can be clearly associated with the disease for which it provides an account.

As with the previous assignment, each student is to also hand in an assessment of their team members, detailing how each member contributed to completing the assignment along with any comments relevant to explaining how well the team functioned and any problems that developed. This assessment is to be well-formatted, no more than one or two pages, and submitted either personally to me (hard-copy in an envelope with your name on it) or by email (if you're not concerned about plausible deniability). As before, if I don't receive this team review, then you won't get a mark for the assignment.