Burrows-Wheeler Transformation (conceptually)


List all rotations of the string:

Sort the rotations:

The transformation string is the last column of the sorted matrix:
The primary index is the row in the sorted matrix that is the original string:
	  3   (indexes start from zero)

   If we right rotate the sorted matrix then the transformation string
   becomes the first column, and the second column is all letters in the
   transformation string sorted in alphabetic order:
	   0 1
	0  n a
	1  n a
	2  b a
	3  a b
	4  a n
	5  a n

   The first occurrence of each symbol in the second column
   (proceeding downwards) is the same instance of that symbol as the
   first occurrence of that symbol in the first column (because the
   strings are sorted by the second column in the original matrix).
   More generally, the i-th instance of a symbol in column 2
   corresponds to the i-th instance of that symbol in column 1; thus
   we can work out for each symbol in column 2 which symbol will
   follow it in column 3 by finding it in column 1 and copying the
   symbol after it from column 2.  This gives the transformation

   More plainly, for the i-th column and the j-th row, we get the
   (i+1)-th symbol from the i-th column as per the corresponding index
   in the transformation vector.  Thus for row 0, the next letter in
   column 2 comes from row 3 column 1, which is "b".  The index for
   row 3 says get the next symbol from row 2 column 1, which is "a".
   The index for row 2 says get the next letter from row 5, "n", and
   the fifth index says get the next symbol from row 1, "a", and the
   next from row 4, "n" and finally from row 0, "a" .. giving us the
   reconstrcuted string "nabana".  To find the original string, we
   create the sorted matrix of all rotations of the reconstructed
   string (which gives an identical matrix to the original sorted
   matrix) and use the primary index to identify the original string:

Borrows Wheeler Transform (in practice)

Eliminating the primary index:

   If we insert a special end-of-string marker, then we can get rid of
   the primary index.  This marker is considered to precede all other
   characters in the alphabet.  Using "$" as this special marker, "banana"
   and the sorted matrix is
   and the transformation string (i.e. the last column) is
   which is not quite as sorted as the transformation string when a
   primary index is used, but for long enough strings the cost
   is negligible.

   The transformation vector is calculated the same way and used to
   reconstruct the string following the same procedure, except now the
   sorted matrix does not have to be recreated because the
   end-of-string marker gives the correct rotation to give the
   original string.

Eliminating the "sort"

   Sorting the rotations is a limiting factor because the operation is
   typically n log n for sorting strings.  But because the strings are
   not arbitrary, rather they are specifically rotations of the same
   string, there is a linear sort algorithm.  Simply insert all
   suffixes of the string (with its end-of-string marker) into a
   suffix trie, recording at each leaf the starting position of each
   corresponding suffix.  An in-order depth-first traversal of the trie gives a
   vector of numbers, each of which is the successor of the index
   where the corresponding rotation would occur in the sorted matrix.
   Thus, subtracting 1 from each index during an in-order traversal of
   the suffix trie gives the index of the order of the characters in
   the transformation string.  The suffix trie can be constructed in
   linear time (Ukkonen's Algorithm).