Burrows-Wheeler Transformation (conceptually)
E.g.
"banana"
List all rotations of the string:
banana
abanan
nabana
anaban
nanaba
ananab
Sort the rotations:
abanan
anaban
ananab
banana
nabana
nanaba
The transformation string is the last column of the sorted matrix:
"nnbaaa"
The primary index is the row in the sorted matrix that is the original string:
4 (indexes start from zero)
Detransformation
If we right rotate the sorted matrix then the transformation string
becomes the first column, and the second column is all letters in the
transformation string sorted in alphabetic order:
0 1
0 n a
1 n a
2 b a
3 a b
4 a n
5 a n
The first occurrence of each symbol in the second column
(proceeding downwards) is the same instance of that symbol as the
first occurrence of that symbol in the first column (because the
strings are sorted by the second column in the original matrix).
More generally, the i-th instance of a symbol in column 2
corresponds to the i-th instance of that symbol in column 1; thus
we can work out for each symbol in column 2 which symbol will
follow it in column 3 by finding it in column 1 and copying the
symbol after it from column 2. This gives the transformation
vector:
<3,4,5,2,0,1>
More plainly, for the i-th column and the j-th row, we get the
(i+1)-th symbol from the i-th column as per the corresponding index
in the transformation vector. Thus for row 0, the next letter in
column 2 comes from row 3 column 1, which is "b". The index for
row 3 says get the next symbol from row 2 column 1, which is "a".
The index for row 2 says get the next letter from row 5, "n", and
the fifth index says get the next symbol from row 1, "a", and the
next from row 4, "n" and finally from row 0, "a" .. giving us the
reconstrcuted string "nabana". To find the original string, we
create the sorted matrix of all rotations of the reconstructed
string (which gives an identical matrix to the original sorted
matrix) and use the primary index to identify the original string:
"banana".
Borrows Wheeler Transform (in practice)
Eliminating the primary index:
If we insert a special end-of-string marker, then we can get rid of
the primary index. This marker is considered to precede all other
characters in the alphabet. Using "$" as this special marker, "banana"
becomes
banana$
and the sorted matrix is
$banana
a$banan
ana$ban
anana$b
banana$
na$bana
nana$ba
and the transformation string (i.e. the last column) is
"annb$aa"
which is not quite as sorted as the transformation string when a
primary index is used, but for long enough strings the cost
is negligible.
The transformation vector is calculated the same way and used to
reconstruct the string following the same procedure, except now the
sorted matrix does not have to be recreated because the
end-of-string marker gives the correct rotation to give the
original string.
Eliminating the "sort"
Sorting the rotations is a limiting factor because the operation is
typically n log n for sorting strings. But because the strings are
not arbitrary, rather they are specifically rotations of the same
string, there is a linear sort algorithm. Simply insert all
suffixes of the string (with its end-of-string marker) into a
suffix trie, recording at each leaf the starting position of each
corresponding suffix. An in-order depth-first traversal of the trie gives a
vector of numbers, each of which is the successor of the index
where the corresponding rotation would occur in the sorted matrix.
Thus, subtracting 1 from each index during an in-order traversal of
the suffix trie gives the index of the order of the characters in
the transformation string. The suffix trie can be constructed in
linear time (Ukkonen's Algorithm).