The SIG.red data file has data for 1,417 peptides for which the cleavage site is known. The data for each peptide is expressed over three lines, where the first line has the name of the protein, the second line has the primary sequence expression of the protein, and the third line is an encoding of the primary sequence saying whether each amino residue in the sequence is part of the signal peptide (S), is the cleavage site (C), or is part of the mature protein (M). The entirety of the signal peptide is provided, plus the first 30 residues of the mature protein (or the whole mature protein, whichever is shorter).

E.g.

 50 11S3_HELAN     20 11S GLOBULIN SEED STORAGE PROTEIN G3 PRECURSOR (HELIANTH
MASKATLLLAFTLLFATCIARHQQRQQQQNQCQLQNIEALEPIEVIQAEA
SSSSSSSSSSSSSSSSSSSSCMMMMMMMMMMMMMMMMMMMMMMMMMMMMM

The goal is to be able to predict which amino residue is the cleavage site. (i.e. predict where the C goes.)