Ian H. Witten

CitizenshipBritish, Canadian, and New Zealand
Family StatusMarried, two children


1976Ph.D.University of Essex, England
(Electrical Engineering Science)
1973M.A. (unearned)University of Cambridge, England
1970M.Sc.University of Calgary, Canada
(Mathematics, Statistics, and Computer Science)
1969B.A. (First Class Honours)University of Cambridge, England
1976Chartered EngineerInstitute of Electrical Engineers, London


2014-presentEmeritus ProfessorDepartment of Computer Science
University of Waikato, New Zealand
1992-presentProfessorDepartment of Computer Science
University of Waikato, New Zealand
1980-1991ProfessorDepartment of Computer Science
University of Calgary, Canada
1982-1985HeadDepartment of Computer Science
University of Calgary, Canada
1979-1980Senior LecturerDepartment of Electrical Engineering Science
University of Essex, England
1970-1979LecturerDepartment of Electrical Engineering Science
University of Essex, England
1969-1970Graduate AssistantDepartment of Mathematics, Statistics, and
Computer Science, University of Calgary, Canada

Visiting appointments and scholarships

2009, JunSanta Chiara ChairUniversity of Siena, Italy
2008, Jul-SepUniversity of Siena, Italy
2008, Sep-DecEuropean Media Research Lab, Germany
2007, JunUniversity of Siena, Italy
2005, JulUniversity of Siena, Italy
2005, Aug-SepÉcole Nationale Supérieure des Télécommunications, France
2005, Oct-NovGoogle Inc., New York, USA
2005, Nov-DecUniversity of Cape Town, South Africa
2004, Jul-OctUniversity of Lethbridge, Canada
2003, Jul-OctUniversity of Lethbridge, Canada
2001, Oct-DecGoogle Inc., California, USA
1997, Jul-DecVisiting ProfessorUniversity of Calgary, Canada
1996, JunSERC Visiting FellowMiddlesex University, London
1989, Jun-JulSERC Visiting FellowUniversity of Stirling, Scotland
1986, Jan-AprErskine FellowUniversity of Canterbury, New Zealand
1985, Jul-SepVisiting ProfessorTechnical University of Graz, Austria
1977, Sep-DecVisiting ConsultantBell-Northern Research, Montreal, Canada
1977, May-SepVisiting ProfessorUniversity of Calgary, Canada
1977, Jan-MayErskine FellowUniversity of Canterbury, New Zealand
1969-1970Commonwealth ScholarUniversity of Calgary, Canada
1968-1969College ScholarGonville and Caius College, Cambridge, England
1966-1968College ExhibitionerGonville and Caius College, Cambridge, England

Membership in professional societies and editorial boards

FellowRoyal Society of New Zealand
Association for Computing Machinery (ACM)
MemberAmerican Society for Information Science (ASIS)
Institute of Electrical Engineers (IEE)
Institution of Electrical and Electronic Engineers (IEEE)
New Zealand Computer Society (NZCS)


An unusual aspect of my work over the last two decades is a strong emphasis on producing open source software that both establishes research advances in a way that is scientifically reproducible, and lets others capitalize on the techniques developed. Note that all the contributions mentioned below have had significant involvement from graduate students and other academic colleagues.

Machine learning/data mining

With colleagues, I have made many contributions to this area, including several novel algorithms. In 1993 I began the Waikato Machine Learning project, which created the Weka Machine Learning Workbench, now widely used internationally. I later devolved leadership to a colleague, but remain involved. The (co-authored) book Data Mining: Practical Machine Learning Tools and Techniques, now in its fourth edition, has sold around 50,000 copies and attracted countless academic citations.

Digital library systems

Work on text and index compression (below) led to the Waikato Digital Library project, which created the Greenstone Digital Library Software, also widely used internationally—particularly in developing countries. I began the project in 1995 and recently devolved leadership to a colleague. We partnered with UNESCO in 2000, and the interface has since been translated (by volunteers) into 60 languages. Through this, I have become heavily involved in promoting the use of digital libraries in developing countries. The (co-authored) book How to Build a Digital Library, now in its second edition, is widely used in digital library education internationally.

Computer-assisted language learning

My students and I have pioneered new ways to define and identify collocations by analyzing huge databases of text fragments from the Web (supplied by Google). This has led to several journal and conference publications. Along with other techniques, it forms the basis of the FLAX (Flexible Language Acquisition) system for second language learning, which is in regular use in China, Japan, Vietnam and Timor Leste. FLAX forms the basis of a recent Chinese “Three Brothers” project between Waikato and the universities of Yunnan and Fudan.

Wikipedia mining

In recent years I have become interested in applying this enormous online encyclopedia to conceptual document representation. With others, I have written a survey on mining meaning from Wikipedia and developed a novel technique for linking documents to relevant encyclopedia articles, which other researchers are beginning to use. This has been released as the Wikipedia Miner Workbench to make it easier for others to capitalize on this research.

Text compression

Data compression is modeling and prediction at work (see article in Abacus, 1986). In 1982/83, Dr Cleary and I discovered some fundamental results concerning adaptive modeling for data compression (published in IEEE Transactions on Information Theory, 1984), and invented a text compression method which outperforms virtually all known techniques by a large margin\(emit typically takes only 2.2 bits/character for mixed-case English modeling (IEEE Transactions on Communications, 1984). We have also published an influential study of arithmetic coding in Communications of the ACM (1987), a survey of modeling for text compression (Computing Surveys, 1989), and a major research book on adaptive text compression (Text Compression, Prentice Hall, 1990). Much of this work is with Dr Bell of the University of Canterbury, New Zealand, a former postdoctoral fellow at Calgary, with whom I have worked on models for natural language text (International Journal of Man-Machine Studies, 1990) and the zero-frequency problem (IEEE Transactions on Information Theory, 1991).

Index compression

Our work in text compression has led to advances in compressed full-text retrieval systems. My colleagues and I were the first to realize that full-text indexes can be effectively compressed, which immediately made obsolete the then-standard technique of signature tables. We developed new techniques for compressing a concordance (reported in the Journal of Information Science, 1991), which is an inverted index of all words in a text, and it transpired that a compressed concordance gives substantial advances over the method of signature files, which represent the previous state of the art in compressed information systems (Journal of the American Society for Information Science, 1993). Along with research on the compression of document images (Proceedings of the IEEE, 1994), this work has culminated in a major book, Managing Gigabytes: compressing and indexing documents and images. This is a substantial (430 pp.) research text, which, along with a public-domain software system, was influential and widely recognized. It was required reading for early Google employees, and the second edition, published in 1999, is still in demand, and still used as a graduate text.

Melody extraction and indexing

Colleagues and I undertook seminal research on melody extraction from audio waveforms and approximate matching against notated music, which has spawned a whole new research area (now with its own conference). In separate projects, I co-invented multiple-viewpoint prediction of music, and performed the first assessment of human ability to predict melody. We have also created digital libraries of music with novel facilities for searching and browsing.

Other research projects

In addition to the above projects, the results of several smaller research projects has been released as open source software, and in some cases have become widely used. Two systems for keyphrase extraction, KEA and its later relative MAUI, both represent significant advances in the state of the art, and are now well known. Our Realistic Books project is spearheading a resurrection of the codex or book form, simulated by computer—not just the computer graphics effects of turning the pages, but the actual book form itself, coupled with the advantages of electronic searching and incorporation of multimedia. The KATOA software for document similarity has recently been made available and is likely to be come similarly influential.

Theory of adaptive systems

This is the subject of my Ph.D. thesis (1976). It encompasses the so-called ``two-armed bandit'' problem, of which I published a major survey (Journal of the Franklin Institute, 1976), resulting in an invitation by Professor Cover, the world's leading authority in the area, to give a seminar at Stanford University (1977).
The two-armed bandit problem epitomizes the conflict between system identification and control, where the cost of exploratory activity must be balanced against the cost of poor system knowledge. In bald form, it concerns an experimenter who is given two biased coins, whose biases are unknown, and must make tosses of either coin with the promise of getting $1 for heads but nothing for tails. His dilemma is that on the basis of a finite sample he can guess which coin is favored, but he may be wrong, and to increase confidence in his guess he must choose the apparently inferior coin sometimes. Constraints of finite time, finite memory (in the sense of remembering the results of the last few tosses only), and finite state make the problem an interesting system-theoretic study, with potential application to any man-machine system which demands simultaneous exploration and control.
It also includes adaptive control systems, the study of which led to an important theoretical paper in Information and Control (1977), an invited paper in the Journal of Cybernetics and Systems Science (1977), the award of an Erskine Fellowship to study in New Zealand (1977), and an invitation to contribute to the International Conference on Mathematical Learning Models (Germany, 1982).

Speech synthesis

My contributions to this field have been diverse. I have worked on the design of speech synthesizers, for which I was consultant to the Essex Electronics Centre (1973-1974) and attracted an SRC research grant (1975-1977). My major contribution has been in the area of synthesis of prosodic features, which led to a key paper in Language and Speech (1977) and attracted another SRC grant (1979). Much of this work has proceeded in collaboration with the Phonetics Department at the University of Edinburgh, the Department of Computer Science at the University of Calgary, Bell-Northern Research at Montreal, the Polish Academy of Sciences in Poznan, and the Department of Experimental Psychology at the University of Sussex. A major research book, Principles of Computer Speech (Academic Press, 1982), contains the results of my own studies, as well as those of others. One of my papers was selected for inclusion in an IEEE Computer Society collection entitled End user facilities in the 1980's (1982).

Digital signal processing

Work on speech synthesis naturally brought me into contact with speech analysis in the form of digital processing of signals. As well as developing special-purpose architectures for speech analysis, for which I was awarded a grant from the Government Communications Headquarters (1979-1981), I have worked on novel algorithms for signal processing, with a major paper in IEEE Transactions on Circuits and Systems (1981); and on new techniques for picture display using space-filling curves (IEEE Computer Graphics and Applications, 1982).

Linear and non-linear identification techniques

My separate interests in adaptive systems and speech converge in the area of identification of time series. I have applied linear techniques to unusual problems such as the identification of curves used in handwriting. As well as contributing to the development and analysis of algorithms for non-linear behavior sequence identification (International Journal of General Systems, 1979, 1980), I have used the techniques in human-computer interfaces which predict user entries (see below). A survey of methods for modeling behavior sequences appears in Future Computer Systems (1987).

Adaptive user interfaces

Out of work on adaptive systems and behavior sequence identification sprang a strong interest in the potential of adaptation to simplify the user interface. A series of theoretical and experimental investigations demonstrated that (contrary to popular belief) interfaces can be designed to adapt automatically and successfully to the needs of individual users (reported in International Journal of Man-Machine Studies, 1984, and Behaviour and Information Technology, 1985). Out of this developed a general interest in user interface interaction styles (Oxford Surveys in Information Technology, 1985). A notable adaptive interface is the ``Reactive Keyboard,'' a device that accelerates typewritten communication with a computer system by predicting what the user is going to type next (reported in Interacting with Computers, 1991). Macintosh (reported in IEEE Computer, 1990), PC, and Unix versions have been distributed worldwide as a computer interface for disabled people, and, according to users, have dramatically altered the way they use computers. A monograph on the system (The Reactive Keyboard,\c 1992) has been published by Cambridge University Press.

Programming by example

Programming by example is the inference of programs from examples of their execution. We have created a novel system called ``Metamouse,'' a user interface agent that learns to perform editing tasks within a drawing program (reported in Computer Graphics, 1989). The user specifies a procedure by supplying an example execution trace, manipulating objects directly on the screen and creating graphical tools where necessary to help make constraints explicit. Metamouse induces the procedure, identifying key features of individual program steps and disregarding coincidental events, and uses it to predict upcoming actions and perform them for the user, thereby reducing the tedium of repetitive graphical editing. The scheme has attracted a good deal of interest, including, for example, unsolicited donations from Apple computer to further the research. We have extended the idea to repetitive text editing, and have developed a theoretical approach to evaluating programs formed by example.

Machine learning

A number of technical advances in machine learning have been made, including a new algorithm for similarity-based learning (reported in the 1988 Machine Learning conference); a simplified presentation of explanation-based learning that removes it from its rather specialized niche and places it in the context of expert systems, problem-solving, and planning (Journal of Experimental and Theoretical Artificial Intelligence, 1989); and a new scheme for accelerating search in function induction which gives several orders of magnitude performance increase over the standard search technique (Journal of Experimental and Theoretical Artificial Intelligence, 1990). Contributions have also been made towards the rapprochement of knowledge acquisition and machine learning (reported in International Journal of Man-Machine Studies, 1988, and IEEE Transactions on Systems, Man and Cybernetics, 1989).

Documentation systems

In 1980 my book Communicating with Microcomputers was published by Academic Press from camera-ready copy. During the production process I became intrigued with the manifest problems of existing document preparation systems and have developed a strong interest not only in batch document formatting languages and typography (papers in Software\(emPractice and Experience, 1982, and International Journal of Man-Machine Studies, 1985), but also in on-line document browsing and dynamic books (Communications of the Association for Computing Machinery, 1985). This interest has extended to interactive systems which select documentation or ``advice'' appropriate to each individual user's needs, based on long-term observations of their behavior, in the manner of an automated coach (International Journal of Man-Machine Studies, 1985).

Further research areas

These include a (serious, SRC-funded) project on positional play in chess (International Journal of Man-Machine Studies, 1975, 1976), which has excited considerable interest; Chinese-language computer systems, funded by Monotype International (1979); semiotics and computers; and text processing. An interesting accomplishment was the development of a hand-controlled speech toy (published in Wireless World, 1978, 1979) which attracted a large number of requests for circuit cards by would-be constructors and has been used by a developmental psychologist as a stimulus for autistic children. During 1982-1985 I was heavily involved in Jade, a large group project at Calgary which aimed to construct an environment for the development of distributed software (ACM Operating Systems Review, 1983). I took an early interest in computer viruses (see article in Abacus, 1987), and in 1989 co-invented a scheme called ``Liveware'' which enables the distribution of databases, and the maintenance of distributed databases, using the transmission technique of computer viruses (reported in International Journal of Man-Machine Studies, 1991). I have also worked on adaptively formed models for music prediction (Interface: Journal of New Music Research, 1990).

Publications in

ACM Operating Systems Review
ACM SIGOA Newsletter
AI and Society
Behaviour and Information Technology
Canadian Artificial Intelligence
Communications of the ACM
Computer Bulletin
Computer Graphics
Computer Journal
Computer Music Journal
Computer Science Education
Computers and Security
Computing Surveys
Digital Processes
Electronics Letters
Future Computing Systems
Human-Computer Interaction
IEE Proceedings
IEE Transactions on Computers and Digital Techniques
IEE Journal on Software and Microsystems
IEEE Computer
IEEE Computer Graphics and Applications
IEEE Transactions on Circuits and Systems
IEEE Transactions on Communications
IEEE Transactions on Computers
IEEE Transactions on Information Theory
IEEE Transactions on Systems, Man and Cybernetics
Information and Control
Information Processing and Management
Interacting with Computers
Interface: Journal of New Music Research
International Journal of General Systems
International Journal of Intelligent Systems
International Journal of Man-Machine Studies
Journal of Cybernetics and Information Science
Journal of Experimental and Theoretical Artificial Intelligence
Journal of Information Science
Journal of the Acoustical Society of America
Journal of the American Society for Information Systems
Journal of the Association for Computing Machinery
Journal of the Franklin Institute
Language and Speech
Leonardo Music Journal
Literary and Linguistic Computing
Machine Learning
New Zealand Journal of Computing
Software\(emPractice and Experience
Wireless World