Document Recognition Technologies, Inc., Palo Alto, CA
Computer Science Seminar Room, G1.15
There is a considerable amount of technology available for processing documents in character coded form, and somewhat considerably less for processing of document images. The usual transformation made to the document image is to process it using Optical Character Recognition (OCR). But there are instances where knowledge of the document is required before a good job of OCR can be performed, and others where the computational overhead of OCR may not be justified.
We have developed a simple, robust and computationally inexpensive method of characterizing the shape of Roman characters. While not nearly as information rich as OCR output, a number of applications are adequately served by use of the character shape codes and their associated word shape tokens. I will describe three classes of applications: language identification, information retrieval and document style characterization.
Using word shape tokens, we have developed an automated means of detecting which of 24 languages is represented in a document image. This is particularly useful as a pre-process for OCR.
We have found the concatenation of character shape tokens as a novel index of document images and find that we can search databases of document images for the presence of keywords rapidly and robustly.
Additionally, we have done some work on part-of-speech tagging of documents encoded using this technique. Since the mapping of source characters to shape codes is (almost) one-to-one, traditional measures of document content such as length, number of words, average word length, etc. are preserved.