PPM

Stochastic compression models (e.g. arithmetic coding, Huffman coding, Shannon-Fano coding, etc) can encode higher probability symbols with fewer bits than lower probability symbols; and the best overall compression is achieved when estimated probabilities for symbols reflect their true underlying (or posterior) probabilities—because there is no wasted probability mass.

For written language, higher probabilities can often be assigned to a letter when some number of preceding symbols are used as conditioning context. That is, while the probability for the letter 'e' might be about 9% in isolation (for English), it is more like 11% if the previous letter is a 't', and more like 13% if the preceding pair of letters is "th".

PPM attempts to leverage higher conditional probabilities by maintaining distributions for each letter given some context. If two preceding symbols form the context then this is called an order-2 model. If only one preceding symbol is the context then this is an order-1 model. An order-0 model predicts a symbol based on its independent probability (i.e. no preceding context) and a default model assumes every possible symbol has equal probability, hence a symbol encoded with the default model is output explicitly using its fixed-length code (e.g. ASCII or unicode).

Only probabilities for observations are maintained. That is, if a letter has been seen by the model, in the current context, then its probability is based on its frequency in that context (and its frequency is then incremented for subsequent uses because obviously it has been seen one more time). If it has never been seen in the current context, then a shorter context is used to predict/encode it (i.e. a lower order model).

If a letter to be encoded has never been seen before, then it is coded explicitly with the default model. It is then added to the order-0 model for subsequent encoding. It is also added as a new observation for the preceding context(s). Moreover, it will obviously form part of some new contexts to predict whatever letter comes next, so these new contexts are added at their respective levels.

To signal moving from one level of model to a lower one (say, from order-1 to order-0) during encoding or decoding, an escape symbol is used. Every context model utilizes such an escape symbol. The escape symbol starts out with probability 1/1 (because no symbol has yet been seen in this context). When a letter is seen in a given context for the first time then it is added to the set of possible symbols for this context. The probabilities are adjusted using the Laplace Rule of succession. That is, the number of times the current context has been seen is increased by one (so that any symbol with probability k/n in the current context is updated to have probability k/(n+1) because the context has been seen once more) and the frequency of the observed symbol in this context is incremented by one (so that a symbol that had probability k/n is updated to have probability (k+1)/(n+1)). In this way, the probability of the escape symbol diminishes over time, and the probability of each letter observed in the current context gradually converges on its true posterior probability.

Because a letter may have a higher probability in a lower order model, PPM works out the probabilities of the next symbol at all model levels and then chooses the encoding for whichever context offers the highest probability, thus minimizing the number of bits required to code the output.

Example

After encoding the sequence "aabaabbb", the complete set of models for PPM given maximum order-2 context is as follows:
order context count p(a|context) p(b|context) p(esc|context)
2 aa 2 0 2/3 1/3
2 ab 2 1/3 1/3 1/3
2 ba 1 1/2 0 1/2
2 bb 1 0 1/2 1/2
1 a 4 2/5 2/5 1/5
1 b 3 1/4 2/4 1/4
0   8 4/9 4/9 1/9