### PPM

Stochastic compression models (e.g. arithmetic coding, Huffman coding, Shannon-Fano coding, etc) can encode higher probability symbols with fewer bits than lower probability symbols; and the best overall compression is achieved when estimated probabilities for symbols reflect their true underlying (or posterior) probabilities—because there is no wasted probability mass.

For written language, higher probabilities can often be assigned to a letter when some number of preceding symbols are used as conditioning context. That is, while the probability for the letter 'e' might be about 9% in isolation (for English), it is more like 11% if the previous letter is a 't', and more like 13% if the preceding pair of letters is "th".

PPM attempts to leverage higher conditional probabilities by maintaining distributions for each letter given some context. If two preceding symbols form the context then this is called an order-2 model. If only one preceding symbol is the context then this is an order-1 model. An order-0 model predicts a symbol based on its independent probability (i.e. no preceding context) and a default model assumes every possible symbol has equal probability, hence a symbol encoded with the default model is output explicitly using its fixed-length code (e.g. ASCII or unicode).

Only probabilities for observations are maintained. That is, if a letter has been seen by the model, in the current context, then its probability is based on its frequency in that context (and its frequency is then incremented for subsequent uses because obviously it has been seen one more time). If it has never been seen in the current context, then a shorter context is used to predict/encode it (i.e. a lower order model).

If a letter to be encoded has never been seen before, then it is coded explicitly with the default model. It is then added to the order-0 model for subsequent encoding. It is also added as a new observation for the preceding context(s). Moreover, it will obviously form part of some new contexts to predict whatever letter comes next, so these new contexts are added at their respective levels.

To signal moving from one level of model to a lower one (say, from order-1 to order-0) during encoding or decoding, an escape symbol is used. Every context model utilizes such an escape symbol. The escape symbol starts out with probability 1/1 (because no symbol has yet been seen in this context). When a letter is seen in a given context for the first time then it is added to the set of possible symbols for this context. The probabilities are adjusted using the Laplace Rule of succession. That is, the number of times the current context has been seen is increased by one (so that any symbol with probability k/n in the current context is updated to have probability k/(n+1) because the context has been seen once more) and the frequency of the observed symbol in this context is incremented by one (so that a symbol that had probability k/n is updated to have probability (k+1)/(n+1)). In this way, the probability of the escape symbol diminishes over time, and the probability of each letter observed in the current context gradually converges on its true posterior probability.

Because a letter may have a higher probability in a lower order model, PPM works out the probabilities of the next symbol at all model levels and then chooses the encoding for whichever context offers the highest probability, thus minimizing the number of bits required to code the output.

#### Example

After encoding the sequence "aabaabbb", the complete set of models for PPM given maximum order-2 context is as follows:
order context count p(a|context) p(b|context) p(esc|context)
2 aa 2 0 2/3 1/3
2 ab 2 1/3 1/3 1/3
2 ba 1 1/2 0 1/2
2 bb 1 0 1/2 1/2
1 a 4 2/5 2/5 1/5
1 b 3 1/4 2/4 1/4
0   8 4/9 4/9 1/9

• In the above table, the denominator of any probability is 1 plus whatever the count is for the current context; and the numerator is the count of that symbol in that context. The escape symbol is always assumed to be seen at least once, which is why we have the "1 plus" in the denominator.
• Note also in the above table that even while "bb" has been seen twice in the input, it has only been used once as an order-2 context (because we haven't yet seen what comes after the second "bb"), and "b" has only been used 3 times as an order-1 context (for the same reason).
• Note also that the probabilities in the order-0 model are converging on one-half each for "a" and "b", but a little bit of probability is set aside for the escape because there is always some non-zero probability that we will see a letter we've never seen before.
• The probabilities in the table aren't exactly what they would be for the input sequence because some of the escapes would have actually been used a few times during the encoding ... but I would've had to take you step by step through the process, so I simplified the table so as to make the overall idea more understandable.