UpComing Seminars

Recent Seminars

Seminar Archive

Ms Elizabeth Styron
Colliding Fronts: Using XML and the TEI to build a full Text Digital Library at the New Zealand Electronic Text Centre

Mr Te Taka Keegan
Reo Maori kei te Ipurangi 2002 - Findings of a recent Maori Language Web Survey

Dr Andy Cockburn
Spatial Memory in 2D and 3D Pysical and Virtual Environments

Dr Alistair Knott
Grounding Linguistic Structure in a model of Sensorimotor Cognition

Professor Ian Witten
Greenstone in Practice

Mr Shaochun Wang
Formal Object-oriented Specification in Standard Z

Mr Greg Reeve
Complete Refinement Rules for Microcharts

Mr Andreas Ribbrock
Groups Acting on Sets - A Powerful Approach to Content Based Retrieval

Mr Matthew Luckie
Measurement and Analysis of One-Way Internet Packet Dynamics

Mrs Beryl Plimmer
FreeFrom: An Interface Design Environment for Novice Programmers

Dr Steve Reeves
Results on Formal Stepwise Design in Z

Dr Steve Reeves
mChart-Based Specification and Refinement

Dr Tyrone Grandison
Trust Management : The SULTAN Perspective

Dr Timothy A Budd
Multiparadigm Programming in J/MP

Dr Matt Jones
Sorting out Searching on Small Screen Devices

Mr John McPherson
Forming a corpus of voice queries for music information retrieval

Professor John Cleary
Optimising Tabling Structures for Bottom-Up Logic Programming

Mr Jarred Potter
Mercy Corps

Professor Nobuo Saito
Cyber Education Across Several Countries

Dr Richard Nelson
Improving Mobile IP Handovers

Dr Tony Smith
The Application of Unstructured Learning Techniques to Bioinformatics and Conceptual Biology

Dr Stefan Kramer
Inductive Databases for Bioinformatics and Predictive Toxicology

Dr Richard Dearden
AI on Mars: Autonomy for Planetary Rovers

Ms Colleen Shannon
Code Red: Spread and Victims of an Internet Worm

Mr David Moore
Fundamental Limits on Blocking Self Propagating Code

Dr Yong Wang
Modeling for Optimal Probability Prediction

Professor Raymond J Mooney
Text Mining with Information Extraction

TSG (Linux)
SCMS Unix Infrastructure

Mr Cameron Esslemont
A Sustainable Infrastructure in Support of Digital Libraries for Remote Communities

Geoff Holmes and Mark Utting
Waikato Visits China

Professor John Cleary
Starlog Group Research Talk: Optimization and Compliation of Data Structures

Dr David Streader
Symbolic model simplification

Ms Anette Steel
The Use of Auditory Feedback in Call Centre CHHI

Mr Thomas Olsson
Information management and software engineering research at Lund University

Dr Larry Spitz
Applications of Character Shape Coding

Dr Balachander Krishnamurthy
Web and Internet Measurement Research

Dr Steve Jones
Interactive Document Summarisation Using Automatically Extracted Keyphrases

Seminar Archive >> 2002
Dr Larry Spitz - Applications of Character Shape Coding

Document Recognition Technologies, Inc., Palo Alto, CA


Computer Science Seminar Room, G1.15

There is a considerable amount of technology available for processing documents in character coded form, and somewhat considerably less for processing of document images. The usual transformation made to the document image is to process it using Optical Character Recognition (OCR). But there are instances where knowledge of the document is required before a good job of OCR can be performed, and others where the computational overhead of OCR may not be justified.

We have developed a simple, robust and computationally inexpensive method of characterizing the shape of Roman characters. While not nearly as information rich as OCR output, a number of applications are adequately served by use of the character shape codes and their associated word shape tokens. I will describe three classes of applications: language identification, information retrieval and document style characterization.

Using word shape tokens, we have developed an automated means of detecting which of 24 languages is represented in a document image. This is particularly useful as a pre-process for OCR.

We have found the concatenation of character shape tokens as a novel index of document images and find that we can search databases of document images for the presence of keywords rapidly and robustly.

Additionally, we have done some work on part-of-speech tagging of documents encoded using this technique. Since the mapping of source characters to shape codes is (almost) one-to-one, traditional measures of document content such as length, number of words, average word length, etc. are preserved.

  2007 FCMS. The University of Waikato - Te Whare Wananga o Waikato