Site prediction in protein sequence data

"Proteins are the active elements of cells. They aid and control the chemical reactions that make the cell work. They receive signals from outside of the cell. They control the processes by which proteins are made from the instructions in the genes. They also form the scaffolding that gives cells their shape and as well as parts of the linkages that stick cells together into tissues and organs." (from University of Oxford, Laboratory of Molecular Biophysics)

A protein is a long sequence of amino acids (between about 50 and 5,000) linked together to form a peptide chain. There are 20 different standard amino acids, each with particular electrochemical properties that cause the protein to fold up into a complex three-dimensional shape. The shape of the molecule, and in particular the electrochemical properties of its surface, determine how the protein interacts with other molecules in a living organism, and thus what functions of that organism the protein plays a role in.

Apart from the impact that the surface of the protein has on its biological role, there are a number of characteristics of individual amino residues in a protein that influence the protein's overall function. For example, the leading 50 to 250 residues of a protein form what is know as the signal peptide, which plays a significant part in determing what membranes the protein can pass through, and therefore what parts of the organism the protein can influence.

As a protein passes through a membrane, the signal peptide is cleaved off. Thus, being able to know where the signal peptide ends and where the mature protein begins a priori can help a biochemist predict what sort of roles the protein is likely to play in the biological processes of an organism.

Another important site-prediction problem is determining where sugar molecules attach to a protein. These so-called glycosylation sites affect things like protein folding, antigenicity, solubility, and biological activity (amongst other things), and being able to predict glycosylation sites is very useful for biochemistry and pharmaceutical research. (E.g. glycosylation site prediction using neural nets)

Summer research project

Neural networks have been the dominant technology for addressing site-prediction problems in protein data. However, I have recently had notable success using other machine learning techniques to address this problem. The results are preliminary and the research needs to be verified, improved and extended. To that end, I need a summer student with an interest in machine learning and computational biology (bioinformatics) to help me complete this research.