All

[1]

A Tool for Model Generation and Knowledge Acquisition

S.J. Cunningham and P. Denize. A tool for model generation and knowledge acquisition. In Proc International Workshop on Artificial Intelligence and Statistics, pages 213-222, Fort Lauderdale, Florida, USA, 1993.
[ bib | .ps | .pdf ]
Tools to automatically induce domain descriptions from examples are valuable aids for the knowledge acquisition stage of expert system construction. This paper presents a description of an algorithm that induces two domain descriptions: a conceptual model, which gives a broad understanding of variable interactions and their effect on the system, and a predictive model, which determines the system value associated with variable values input to the model. This induction algorithm is based on Entropy Data Analysis (EDA), which builds linear equations to approximate the system described by the training data.

[2]

Using Data Mining to Support the Construction and Maintenance of Expert Systems

G. Holmes and S.J. Cunningham. Using data mining to support the construction and maintenance of expert systems. In Proc Artificial Neural Networks and Expert Systems, pages 156-159, Dunedin, New Zealand, 1993.
[ bib | .pdf ]
Many expert systems are constructed from and allied with a large collection of databases that are continually being updated. In this paper we address the issues of how such a knowledge base can be constructed using tools that search the databases for significant, unexpected correlations and present them to the knowledge engineer for review. Once a knowledge base has been constructed, there is the problem of keeping it consistent with changing conditions in the real world. Concepts which are embodied in the data may drift in time, and so some mechanism for updating the knowledge base must be developed. As it would prove costly to periodically revisit the knowledge acquisition phase, we propose to monitor the domain database automatically for significant concept change.

[3]

Expert Systems Development Using Data Mining

G. Holmes and S.J. Cunningham. Expert systems development using data mining. In Proc Expert Systems '93, pages 213-222, Cambridge, England, 1993.
[ bib | .ps | .pdf ]
This paper presents a model of expert system construction beginning with an initial set of data chosen by a domain expert. The domain expert creates the knowledge base both through direct knowledge transfer (or through the construction of example sets for machine learning) and through mining one or more of the databases. As construction progresses, the domain expert may alter the composition of the databases by directing that additional or different data be collected. Interactively guided by the expert, the mining tool efficiently organises the search for findings in the database. These findings are then evaluated for inclusion in the knowledge base.

[4]

Practical Machine Learning and its Potential Application to Problems in Agriculture

I.H. Witten, S. Cunningham, G. Holmes, R.J. McQueen, and L.A. Smith. Practical machine learning and its potential application to problems in agriculture. In Proc New Zealand Computer Conference, volume 1, pages 308-325, Auckland, New Zealand, 1993.
[ bib | .ps.gz | .pdf ]
One of the most exciting and potentially far-reaching developments in contemporary computer science is the invention and application of methods of machine learning. These have evolved from simple adaptive parameter-estimation techniques to ways of (a) inducing classification rules from examples, (b) using prior knowledge to guide the interpretation of new examples, (c) using this interpretation to sharpen and refine the domain knowledge, and (d) storing and indexing example cases in ways that highlight their similarities and differences. Such techniques have been applied in domains ranging from the diagnosis of plant disease to the interpretation of medical test data. This paper reviews selected methods of machine learning with an emphasis on practical applications, and suggests how they might be used to address some important problems in the primary production industries, particularly agriculture.

[5]

Complexity-based Induction

D. Conklin and I.H. Witten. Complexity-based induction. Machine Learning, 16(3):203-225, 1994.
[ bib | .ps.gz | .pdf ]
A central problem in inductive logic programming is theory evaluation. Without some sort of preference criterion, any two theories that explain the same set of examples are equally acceptable. This paper presents a scheme for evaluating alternative inductive theories based on an objective preference criterion. It strives to extract maximal redundancy from examples, transforming structure into randomness. A major strength of the method is its application to learning problems where negative examples of concepts are scarce or unavailable. A new measure called model complexity is introduced, and its use is illustrated and compared with a proof complexity measure on relational learning tasks. The complementarity of the model and proof complexity parallels that of model and proof-theoretic semantics. Model complexity, where applicable, seems to be an appropriate measure for evaluating inductive logic theories.

[6]

WEKA Machine Learning Project: Cow Culling

R.E. De War and D.L. Neal. Weka machine learning project: Cow culling. Technical report, The University of Waikato, Computer Science Department, Hamilton, New Zealand, 1994.
[ bib | .ps.gz | .pdf ]
This document describes the results of applying the WEKA machine learning workbench to a large database of dairy cows. The aim of the project was to induce usable rules for culling decisions from the data, and hence find which attributes were most important to farmers in determining whether an animal should be culled.

[7]

WEKA: A Machine Learning Workbench

G. Holmes, A. Donkin, and I.H. Witten. Weka: A machine learning workbench. In Proc Second Australia and New Zealand Conference on Intelligent Information Systems, Brisbane, Australia, 1994.
[ bib | .ps.gz | .pdf ]
WEKA is a workbench for machine learning that is intended to aid in the application of machine learning techniques to a variety of real-world problems, in particular, those arising from agricultural and horticultural domains. Unlike other machine learning projects, the emphasis is on providing a working environment for the domain specialist rather than the machine learning expert. Lessons learned include the necessity of providing a wealth of interactive tools for data manipulation, result visualization, database linkage, and cross-validation and comparison of rule sets, to complement the basic machine learning tools.

[8]

Knowledge-rich Induction of Classification Rules

B. Martin. Knowledge-rich induction of classification rules. In Proc Canadian Machine Learning Workshop, University of Calgary, Alberta, Canada, 1994.
[ bib | .ps.gz | .pdf ]
The purpose of this research was to produce a machine learning system that can take advantage of many forms of background knowledge to guide the induction of classification rules. This system will be used for knowledge discovery in databases (also known as database mining), as the use of background knowledge can considerably reduce the search space of a database knowledge search. A new system, MARVIN++, is introduced, that attempts to satisfy this aim.

[9]

The WEKA Machine Learning Workbench : Its Application to a Real World Agricultural Database

R.J. McQueen, D.L. Neal, R.E. DeWar, S.R. Garner, and C.G. Nevill-Manning. The weka machine learning workbench : Its application to a real world agricultural database. In Proc Canadian Machine Learning Workshop, Banff, Alberta, Canada, 1994.
[ bib | .ps.gz | .pdf ]
Numerous techniques have been proposed for learning rules and relationships from diverse data sets, in the hope that machines can help in the often tedious and error-prone process of knowledge acquisition. While these techniques are plausible and theoretically well-founded, they stand or fall on their ability to make sense of real-world data. This paper describes a project that aims to apply a range of learning strategies to problems in primary industry, in particular agriculture and horticulture.

[10]

Geometric Comparison of Classifications and Rule Sets

T.J. Monk, R.S. Mitchell, L.A. Smith, and G. Holmes. Geometric comparison of classifications and rule sets. In Proc AAAI Workshop on Knowledge Discovery in Databases, pages 395-406, Seattle, Washington, USA, 1994.
[ bib | .ps.gz | .pdf ]
We present a technique for evaluating classifications by geometric comparison of rules. Rules are represented as objects in an n-dimensional hyperspace. The similarity of classes is computed from the overlap of the geometric class descriptions. The system produces a correlation matrix that indicates the degree of similarity between each pair of classes. The technique can be applied to classification generated by different algorithms, with different numbers of classes and different attribute sets. Experimental results from a case study in a medical domain are included.

[11]

Modelling Sequences Using Grammars and Automata

C.G. Nevill-Manning and D.L. Maulsby. Modelling sequences using grammars and automata. In Proc Canadian Machine Learning Workshop, University of Calgary, Alberta, Canada, 1994.
[ bib | .pdf ]
This paper presents two sequence modelling techniques. The first induces a context-free deterministic grammar from a sequence. It was motivated by a specific machine learning problem, that of modelling a sequence of actions performed by a computer user, but it can also be applied to other machine learning problems, and its performance as a data compression technique is the best in its class. The second technique induces a push-down finite-state automaton from a sequence. It was designed to derive an executable program from a program execution trace expressed in high-level language statements, and is capable of recognising branches and loops, as well as recursive and non-recursive procedure calls. The inductive capabilities of these two techniques are complementary, and following their description we examine how they can be combined into a more powerful system.

[12]

Compression by Induction of Hierarchical Grammars

C.G. Nevill-Manning, I.H. Witten, and D.L. Maulsby. Compression by induction of hierarchical grammars. In J.A. Storer and M. Cohn, editors, Proc Data Compression Conference, pages 244-253, Los Alamitos, CA, 1994. IEEE Press.
[ bib | .pdf ]
This paper describes a technique that constructs models of symbol sequences in the form of small, human-readable, hierarchical grammars. The grammars are both semantically plausible and compact. The technique can induce structure from a variety of different kinds of sequence, and examples are given of models derived from English text, C source code and a sequence of terminal control codes.

[13]

Data Transformation: A Semantically-based Approach to Function Discovery

T.H. Phan and I.H. Witten. Data transformation: A semantically-based approach to function discovery. Technical report, The University of Waikato, Computer Science Department, Hamilton, New Zealand, 1994.
[ bib | .ps.gz | .pdf ]
This paper presents the method of data transformation for discovering numeric functions from their examples. Based on the idea of transformations between functions, this method can be viewed as a semantic counterpart to the more common approach of formula construction used in most previous discovery systems. Advantages of the new method include a flexible implementation through the design of transformation rules, and a sound basis for rigorous mathematical analysis to characterize what can be discovered. The method has been implemented in a discovery system called Linus , which can identify a wide range of functions: rational functions, quadratic relations, and many transcendental functions, as well as those that can be transformed to rational functions by combinations of diferentiation, logarithm and function inverse operations.

[14]

Function Discovery using Data Transformation

T.H. Phan and I.H. Witten. Function discovery using data transformation. In Proc International Symposium on Artificial Intelligence and Mathematics, Florida, USA, 1994.
[ bib | .ps.gz | .pdf ]
Function discovery is the problem of finding a symbolic formula for an unknown function from examples of the function's value on certain arguments. In most previous discovery systems, the description language has been restricted to rational functions so that symbolic descriptions can be easily enumerated. This papers shows how data transformation can be used as the basis of a far more comprehensive description language that includes all functions that can be transformed to rational functions by differentiation and logarithm operations. the main contribution of this paper is to define a transformation-based description language and characterize its representational power. We also briefly sketch a practical implementation of a function induction system that uses this approach.

[15]

The WEKA Machine Learning Workbench (Video)

WEKA Machine Learning Group. The weka machine learning workbench (video), 1994.
[ bib ]
This is a 6-minute video tape giving a brief introduction to the WEKA project and the WEKA workbench.

[16]

Trans-Pacific Machine Learning Research: the Calgary/Waikato Axis

I.H. Witten. Trans-pacific machine learning research: the calgary/waikato axis. In Proc Canadian Machine Learning Workshop, Calgary, Alberta, Canada, 1994.
[ bib | .ps.gz | .pdf ]
The following five contributions summarize selected research projects on machine learning that are being undertaken in the Computer Science Departments at the Universities of Calgary and Waikato:
(a) The WEKA machine learning workbench (Bob McQueen et al.);
(b) Knowledge-rich induction of classification rules (Brent Martin);
(c) LINUS: a transformation-based system for function discovery (Thong Phan);
(d) Modeling sequences (Craig Nevill-Manning et al.);
(e) Instructible agents (Dave Maulsby).

[17]

Experiments on the zero frequency problem

J.G. Cleary and W.J. Teahan. Experiments on the zero frequency problem. In J.A. Storer and M. Cohn, editors, Proc Data Compression Conference, page 480, Los Alamitos, CA, 1995. IEEE Press.
[ bib | .ps.gz | .pdf ]
The best algorithms for lossless compression of text are those which adapt to the text being compressed [1]. Two classes of such adaptive techniques are commonly used. One class matches the text against a dictionary of strings seen and transforms the text into a list of indices into the dictionary. These techniques are usually formulated as a variant on Ziv-Lempel (LZ) compression. While LZ compressors do not give the best compression they are widely used because of their simplicity and low execution overhead.

[18]

K*: An Instance-Based Learner Using an Entropic Distance Measure

J.G. Cleary and L.E. Trigg. K*: An instance-based learner using an entropic distance measure. In Proc Machine Learning Conference, pages 108-114, Tahoe City, CA, USA, 1995.
[ bib | .ps.gz | .pdf ]
The use of entropy as a distance measure has several benefits. Amongst other things it provides a consistent approach to handling of symbolic attributes, real valued attributes and missing values. The approach of taking all possible transformation paths is discussed. We describe K*, an instance-based learner which uses such a measure, and results are presented which compare favourably with several machine learning algorithms.

[19]

Unbounded length contexts for PPM

J.G. Cleary, W.J. Teahan, and I.H. Witten. Unbounded length contexts for ppm. In J.A. Storer and M. Cohn, editors, Proc Data Compression Conference, pages 52-61, Los Alamitos, CA, 1995. IEEE Press.
[ bib | .ps.gz | .pdf ]
The PPM data compression scheme has set the performance standard in lossless compression of text throughout the past decade. The original algorithm was first published in 1984 by Cleary and Witten [3], and a series of improvements was described by Moffat [6], culminating in a careful implementation, called PPMC, which has become the benchmark version. This still achieves results superior to virtually all other compression methods, despite many attempts to better it. Other methods such as those based on Ziv-Lempel coding [9] are more commonly used in practice, but their attractiveness lies in their relative speed rather than any superiority in compression-indeed, their compression performance generally falls distinctly below that of PPM in practical benchmark tests [1].

[20]

Machine Learning and Statistics: a Matter of Perspective

S.J. Cunningham. Machine learning and statistics: a matter of perspective. New Zealand J Computing, 6(1a):69-73, August 1995.
[ bib | .ps.gz | .pdf ]
Information has become an important commercial commodity-indeed, possibly the most important product of the future. While we have well-developed technologies to store data, the analysis to extract information is time-consuming and requires skilled human intervention. Machine learning algorithms augment statistical analysis by providing mechanisms that automate the information discovery process. These algorithms also tend to be more accessible to end-users and domain experts. The two analysis methods are converging, and the fields have much to offer each other.

[21]

Analysis of Cow Culling with a Machine Learning Workbench

R.E. DeWar and R.J. McQueen. Analysis of cow culling with a machine learning workbench. Technical Report 95/1, University of Waikato, Computer Science Department, Hamilton, New Zealand, January 1995.
[ bib | .ps.gz | .pdf ]
This report discusses the use of machine learning tools to examine datasets from a database of dairy cows. The objective of the study was to investigate whether these machine learning tools could extract meaningful rules from the datasets, and in the process, understand more about how these tools manipulate data.

[22]

WEKA: The Waikato Environment for Knowledge Analysis

S.R. Garner. Weka: The waikato environment for knowledge analysis. In Proc New Zealand Computer Science Research Students Conference, pages 57-64, University of Waikato, Hamilton, New Zealand, 1995.
[ bib | .ps.gz | .pdf ]
WEKA is a workbench designed to aid in the application of machine learning technology to real world data sets, in particular, data sets from New Zealand's agricultural sector. In order to do this a range of machine learning techniques are presented to the user in such a way as to hide the idiosyncrasies of input and output formats, as well as allow an exploratory approach in applying the technology. The system presented is a component based one that also has application in machine learning research and education.

[23]

Applying a Machine Learning Workbench: Experience with Agricultural Databases

S.R. Garner, S.J. Cunningham, G. Holmes, C.G. Nevill-Manning, and Witten I.H. Applying a machine learning workbench: Experience with agricultural databases. In Proc Machine Learning in Practice Workshop, Machine Learning Conference, pages 14-21, Tahoe City, CA, USA, 1995.
[ bib | .ps.gz | .pdf ]
This paper reviews our experience with the application of machine learning techniques to agricultural databases. We have designed and implemented a machine learning workbench, WEKA, which permits rapid experimentation on a given dataset using a variety of machine learning schemes, and has several facilities for interactive investigation of the data: preprocessing attributes, evaluating and comparing the results of different schemes, and designing comparative experiments to be run off-line. We discuss the partnership between agricultural scientist and machine learning researcher that our experience has shown to be vital to success. We review in some detail a particular agricultural application concerned with the culling of dairy herds.

[24]

Machine learning from agricultural databases: practice and experience

S.R. Garner, G. Holmes, R.J. McQueen, and I.H. Witten. Machine learning from agricultural databases: practice and experience. New Zealand J Computing, 6(1a):69-73, August 1996.
[ bib | .ps.gz | .pdf ]
Per capita, New Zealand is reputed to be one of the world's leading collectors of information in databases. The country, quite rightly, places a substantial investment in stored knowledge. As databases grow, however, interest tends to shift from techniques for retrieving individual items to large-scale analysis of the data as a whole. Data can be analysed to view trends, pick out anomalies, discover relationships, check that policy is turned into practice, and so on. With large databases, such analyse scan be costly.

At Waikato University we are examining these issues in the context of the agricultural sector of the economy. Our project centers around techniques of data mining or machine learning, which automatically analyse large bodies of data to discover hidden dependencies. Other methods of exploratory data analysis-many of them interactive-are being investigated as an adjunct to the machine learning algorithms. This paper reviews the software that has been developed to support the application of machine learning techniques, and comments on our success with particular agricultural databases. We discuss those aspects of the databases that hinder the analysis, to assist with future data collection activities.

[25]

Selection of attributes for modeling Bach chorales by a genetic algorithm

M.A. Hall. Selection of attributes for modeling bach chorales by a genetic algorithm. In Proc Second New Zealand International Two-Stream Conference on Artificial Neural Networks and Expert Systems, pages 182-185, Dunedin, NZ, 1995. IEEE Computer Society Press.
[ bib | .ps | .pdf ]
A genetic algorithm selected combinations of attributes for a machine learning system. The algorithm used 90 Bach chorale melodies to train models and randomly selected sets of 10 chorales for evaluation. Compression of pitch was used as the fitness evaluation criterion. The best models were used to compress a different test set of chorales and their performance compared to human generated models. G.A. models outperformed the human models, improving compression by 10 percent.

[26]

Feature selection via the discovery of simple classification rules

G. Holmes and C.G. Nevill-Manning. Feature selection via the discovery of simple classification rules. In Proc Symposium on Intelligent Data Analysis (IDA-95), pages 75-79, Baden-Baden, Germany, 1995.
[ bib | .ps.gz | .pdf ]
It has been our experience that in order to obtain useful results using supervised learning of real-world datasets it is necessary to perform feature subset selection and to perform many experiments using computed aggregates from the most relevant features. It is, therefore, important to look for selection algorithms that work quickly and accurately so that these experiments can be performed in a reasonable length of time, preferably interactively. This paper suggests a method to achieve this using a very simple algorithm that gives good performance across different supervised learning schemes and when compared to one of the most common methods for feature subset selection.

[27]

Document zone classification using machine learning

S. Inglis and I.H. Witten. Document zone classification using machine learning. In Proc Digital Image Computing: Techniques and Applications, pages 631-636, Brisbane, Australia, 1995.
[ bib | .pdf ]
When processing document images, an important step is classifying the zones they contain into meaningful categories such as text, halftone pictures, line drawings, and mathematical formulae. A character recognition system, for example, may confine its attention to zones that are classified as text, while in an image compressor may employ specialized techniques and models for zones such as halftone pictures. The penalty for incorrect classification may range from incorrect interpretation to reduced efficiency. In any case, the higher the classification accuracy, the better the results.

This classification problem is not new. For example, Wang and Srihari [1] described a method for zone classification that gave 100% accuracy on the images they used for testing. But these images contained just 41 zones-and one of the categories occurred only once. Previous approaches to the problem generally describe a new set of features and assign classes using some linear weighted formula or nonlinear heuristic. In contrast, our work uses predefined features and investigates the application of standard machine learning methods, using a large publicly-available document database as a source of training and test data.

[28]

Applications of machine learning in information retrieval

J.N. Littin. Applications of machine learning in information retrieval. Technical Report 0657.555, University of Waikato, Computer Science Department, Hamilton, New Zealand, 1995. Directed Research Project.
[ bib | .pdf ]
[29]

Instance-Based Learning: Nearest Neighbour with Generalisation

B.I. Martin. Instance-based learning: Nearest neighbour with generalisation. Master's thesis, University of Waikato, Computer Science Department, Hamilton, New Zealand, 1995.
[ bib | .ps.gz | .pdf ]
[30]

Interactive concept learning for end-user applications

D. Maulsby and I.H. Witten. Interactive concept learning for end-user applications. Technical Report 95/4, University of Waikato, Computer Science Department, Hamilton, New Zealand, 1995.
[ bib | .pdf ]
Personalizable software agents will learn new tasks from their users. This implies being able to learn from instructions users might give: examples, yes/no responses, and ambiguous, incomplete hints. Agents should also exploit background knowledge customized for applications such as drawing, word processing and form-filling. The task models that agents learn describe data, actions and their context. Learning about data from examples and hints is the subject of this paper.

The Cima learning system combines evidence from examples, task knowledge and user hints to form Disjunctive Normal Form (DNF) rules for classifying, generating or modifying data. Cima's dynamic bias manager generates candidate features (attribute values, functions or relations), from which its DNF learning algorithm selects relevant features and forms the rules. The algorithm is based on a classic greedy method, with two enhancements. First, the standard learning criterion, correct classification, is augmented with a set of utility and instructional criteria. Utility criteria ensures that descriptions are properly formed for use in actions, whether to classify, search for, generate or modify data. Instructional criteria ensure that descriptions include features that users suggest and avoid those that users reject. The second enhancement is to augment the usual statistical metric for selecting relevant attributes with a set of heuristics, including beliefs based on user suggestions and application-specific background knowledge. Using multiple heuristics increases the justification for selecting features; more important, it helps the learner choose among alternative interpretations of hints.

When tested on dialogues observed in a prior user study on a simulated interface agent, the learning algorithm achieves 95% of the learning efficiency standard established in that study.

[31]

Learning to describe data in actions

D. Maulsby and I.H. Witten. Learning to describe data in actions. In Proc PBD Workshop, Machine Learning Conference, pages 65-73, Tahoe City, CA, July 1995.
[ bib | .pdf ]
Traditional machine learning algorithms have failed to serve the needs of systems for Programming by Demonstration (PBD), which require interaction with a user (a teacher) and a task environment. We argue that traditional learning algorithms fail for two reasons: they do not cope with the ambiguous instructions that users provide in addition to examples; and their learning criterion requires only that concepts classify examples to some degree of accuracy, ignoring the other ways in which an active agent might use concepts. We show how a classic concept learning algorithm can be adapted for use in PBD by replacing the learning criterion with a set of instructional and utility criteria, and by replacing a statistical preference bias with a set of heuristics that exploit user hints and background knowledge to focus attention.

[32]

Applying Machine Learning to Agricultural Data

R.J. McQueen, S.R. Garner, C.G. Nevill-Manning, and I.H. Witten. Applying machine learning to agricultural data. Computing and Electronics in Agriculture, 12:275-293, 1995.
[ bib | .ps.gz | .pdf ]
Many techniques have been developed for abstracting, or learning, rules and relationships from diverse data sets, in the hope that machines can help in the often tedious and error-prone process of acquiring knowledge from empirical data. While these techniques are plausible, theoretically well-founded, and perform well on more or less artificial test data sets, they stand or fall on their ability to make sense of real-world data. This paper describes a project that is applying a range of learning strategies to problems in primary industry, in particular agriculture and horticulture. We briefly survey some of the more readily applicable techniques that are emerging from the machine learning research community, describe a software workbench that allows users to experiment with a variety of techniques on real-world data sets, and detail the problems encountered and solutions developed in a case study of dairy herd management in which culling rules were inferred from a medium-sized database of herd information.

[33]

Application of Machine Learning Techniques to Time-Series Data

R.S. Mitchell. Application of machine learning techniques to time-series data. Technical Report 95/15, University of Waikato, Computer Science Department, Hamilton, New Zealand, April 1995.
[ bib | .ps.gz | .pdf ]
This report documents new methods for discovering knowledge in real world time-series data. Two complementary approaches were investigated: 1) manipulation of the original dataset into a form that is usable by conventional similarity-based learners; and 2) using sequence identification techniques to learn the concepts embedded in the database. Experimental results obtained from applying both techniques to a large agricultural database are presented and analysed.

[34]

Detecting sequential structure

C.G. Nevill-Manning and I.H. Witten. Detecting sequential structure. In Proc Programming by Demonstration Workshop, Machine Learning Conference, pages 49-56, Tahoe City, CA, USA, 1995.
[ bib | .ps.gz | .pdf ]
Programming by demonstration requires detection and analysis of sequential patterns in a user's input, and the synthesis of an appropriate structural model that can be used for prediction. This paper describes SEQUITUR, a scheme for inducing a structural description of a sequence from a single example. SEQUITUR integrates several different inference techniques: identification of lexical subsequences or vocabulary elements, hierarchical structuring of such subsequences, identification of elements that have equivalent usage patterns, inference of programming constructs such as looping and branching, generalisation by unifying grammar rules, and the detection of procedural substructure. Although SEQUITUR operates with abstract sequences, a number of concrete illustrations are provided.

[35]

Learning from experience

C.G. Nevill-Manning. Learning from experience. In Proc New Zealand Computer Science Research Students Conference, pages 193-201, University of Waikato, Hamilton, New Zealand, 1995.
[ bib | .ps | .pdf ]
Life is a sequence of events; some of them predictable, some unexpected. We instinctively look for structure in the stream of events we observe, and learning to anticipate these events is an important part of functioning successfully in the world. Machine learning emphasises learning from unordered facts, but for machines to be helpful in our day-to-day activities, they also need to be able to recognise patterns in the sequences they observe.

The sequence of events that constitute our lives is often monotonous; we end up repeating certain sequences of actions again and again. We have built machines to relieve some of this monotony, but they often fail to be flexible enough to free us of all but the most common repetitive tasks.

SEQUITUR is a system which infers structure from a sequence, and uses the structure to explain and extrapolate the sequence. It is capable of recognising structure in a variety of real-world sequences, but is of particular utility in automating repetitive computing tasks.

[36]

The Development of Holte's 1R Classifier

C.G. Nevill-Manning, G. Holmes, and I.H. Witten. The development of holte's 1r classifier. In Proc Artificial Neural Networks and Expert Systems, pages 239-242, Dunedin, NZ, 1995.
[ bib | .ps.gz | .pdf ]
The 1R procedure for machine learning is a very simple one that proves surprisingly effective on the standard datasets commonly used for evaluation. This paper describes the method and discusses two areas that can be improved: the way that intervals are formed when discretizing continuously-valued attributes, and the way that missing values are treated. Then we show how the algorithm can be extended to avoid a problem endemic to most practical machine learning algorithms-their frequent dismissal of an attribute as irrelevant when in fact it is highly relevant when combined with other attributes.

[37]

A genetic algorithm for the induction of natural language grammars

T.C. Smith and I.H. Witten. A genetic algorithm for the induction of natural language grammars. In Proc IJCAI-95 Workshop on New Approaches to Learning for Natural Language Processing, pages 17-24, Montreal, Canada, 1995.
[ bib | .ps.gz | .pdf ]
Strict pattern-based methods of grammar induction are often frustrated by the apparently inexhaustible variety of novel word combinations in large corpora. Statistical methods offer a possible solution by allowing frequent well-formed expressions to overwhelm the infrequent ungrammatical ones. They also have the desirable property of being able to construct robust grammars from positive instances alone. Unfortunately, the zero-frequency problem entails assigning a small probability to all possible word patterns, thus ungrammatical n-grams become as probable as unseen grammatical ones. Further, such grammars are unable to take advantage of inherent lexical properties that should allow infrequent words to inherit the syntactic properties of the class to which they belong.

This paper describes a genetic algorithm (GA) that adapts a population of hypothesis grammars towards a more effective model of language structure. The GA is statistically sensitive in that the utility of frequent patterns is reinforced by the persistence of efficient substructures. It also supports the view of language learning as a bootstrapping problem, a learning domain where it appears necessary to simultaneously discover a set of categories and a set of rules defined over them. Results from a number of tests indicate that the GA is a robust, fault-tolerant method for inferring grammars from positive examples of natural language.

[38]

Probability-driven lexical classification: a corpus-based approach

T.C. Smith and I.H. Witten. Probability-driven lexical classification: a corpus-based approach. In Proc Pacific Association for Computational Linguistics Conference, pages 271-283, Brisbane, Australia, 1995.
[ bib | .ps.gz | .pdf ]
Successful grammatical inference from a corpus of linguistic material rests largely on the ability to tag the words of the corpus with appropriate lexical categories. Static tagging methods, such as dictionary lookup, often misclassify words associated with multiple categories, and adaptive statistical taggers or context-based corrective taggers may still have errorrates of 3 or 4 percent. Even a small proportion of lexical misclassification may lead a syntax induction mechanism to produce an extraordinarily large number of special-case rules.

By treating grammar induction as a bootstrapping problem in which it is necessary to simultaneously discover a set of categories and a set of rules defined over them, lexical tagging is relieved of constraints imposed by traditional notions of syntactic categories. Categories are created based on salient features in the corpus under study.

This paper describes a statistical corpus-based approach to lexical inference that uses linguistic notions of syntactic function to guide its analysis of a given text. Unlike other probability-driven inference techniques, it is simple enough to apply to the complete vocabularies of very large texts. It has been successfully incorporated into a syntactic inference system whose resulting grammars have been shown superior to two standard stochastic CFGs.

[39]

Subset selection using rough numeric dependency

T.C. Smith and G. Holmes. Subset selection using rough numeric dependency. In Proc Symposium on Intelligent Data Analysis (IDA-95), pages 191-195, Baden-Baden, Germany, 1995.
[ bib | .ps.gz | .pdf ]
A large percentage of computer systems are presently dedicated to the collection of an incomprehensibly vast amount of information. The daily world-wide accrual of data has long since exceeded the ability for human analysis, and machine processing has been largely limited to collation, summation, sorting and a variety of basic statistical analyses.

Artificial intelligence research in the form of expert systems has attempted to relieve much of the burden of analysis from the shoulders of human beings by developing automated reasoning systems that can respond quickly to the flood of incoming data. Computer users have countered with the collection of an even broader range of information for which expert systems have either not yet been built or whose domains have proven as yet too difficult to formalise.

[40]

PBD systems: when will they ever learn?

I.H. Witten. Pbd systems: when will they ever learn? In Proc Programming by Demonstration Workshop, Machine Learning Conference, pages 1-9, Tahoe City, CA, USA, 1995.
[ bib | .ps.gz | .pdf ]
We trace the tragicomic courtship between programming by demonstration and machine learning, two fields that seem made for each other but have never quite got together. A long-term historical perspective illuminates the twin roles of interaction and sequence learning in programming by demonstration, and reveals the parallel growth of machine learning as a research area in its own right. A view of the present shows a few-but only a few-intimations that these two fields are beginning to develop a meaningful relationship. The long-term prognosis is not so much wedded bliss as assimilation-ultimately PBD and ML shall, in the words of Genesis, cleave unto each other: and they shall be one flesh.

[41]

Intelligent data analysis using the WEKA workbench

I.H. Witten, S.J. Cunningham, and G. Holmes. Intelligent data analysis using the weka workbench. In Conference on Artificial Neural Networks and Expert Systems, Dunedin, NZ, 1995. Tutorial Notes.
[ bib | .ps | .pdf ]
[42]

Active Learning of Soft Rules for System Modelling

Eibe Frank and Klaus-Peter Huber. Active learning of soft rules for system modelling. In Proc 2nd European Congress on Intelligent Techniques and Soft Computing, Aachen, Germany, pages 1430-1434, Aachen, 1996. Verlag Mainz.
[ bib | .ps.gz | .pdf ]
[43]

A Computer Model of Blues Music and its Evaluation

Mark Andrew Hall and Lloyd Smith. A computer model of blues music and its evaluation. Journal of the Acoustical Society of America, 100(2):1163-1167, 1996.
[ bib ]
[44]

An MDL estimate of the significance of rules

J.G. Cleary, S. Legg, and I.H. Witten. An mdl estimate of the significance of rules. In Proc ISIS: Information, Statistics, and Induction in Science, pages 43-53, Melbourne, Australia, August 1996.
[ bib | .ps | .pdf ]
This paper proposes a new method for measuring the performance of models-whether decision trees or sets of rules-inferred by machine learning methods. Inspired by the minimum description length (MDL) philosophy and theoretically rooted in information theory, the new method measures the complexity of test data with respect to the model. It has been evaluated on rule sets produced by several different machine learning schemes on a large number of standard data sets. When compared with the usual percentage correct measure, it is shown to agree with it in restricted cases. However, in other more general cases taken from real data sets-for example, when rule sets make multiple or no predictions-it disagrees substantially. It is argued that the MDL measure is more reasonable in these cases and represents a better way of assessing the significance of a rule set's performance. The question of the complexity of the rule set itself is not addressed in the paper.

[45]

MetaData for database mining

Cleary J.G., G. Holmes, S.J. Cunningham, and I.H. Witten. Metadata for database mining. In Proc IEEE Metadata Conference, Silver Spring, MD, April 1996.
[ bib | .ps | .pdf ]
At present, a machine learning application is accomplished by carefully crafting asingle table from an often complex, multi-table database. The metadata necessary to create this table is rarely formally recorded, and is sometimes implicit in the structure of the database or the typing of the attributes. We categorize the types of metadata that we have encountered in our work with machine learning applications in agriculture, and describe a first generation tool that we have built to aid in the recording and use of metadata in database mining.

[46]

Dataset cataloguing metadata for machine learning applications and research

S.J. Cunningham. Dataset cataloguing metadata for machine learning applications and research. Technical Report 96/26, University of Waikato, Computer Science Department, Hamilton, New Zealand, October 1996.
[ bib ]
As the field of machine learning (ML) matures, two types of data archives are developing: collections of benchmark data sets used to test the performance of new algorithms, and data stores to which machine learning/data mining algorithms are applied to create scientific or commercial applications. At present, the catalogs of these archives are ad hoc and not tailored to machine learning analysis. This paper considers the cataloging metadata required to support these two types of repositories, and discusses the organizational support necessary for archive catalog maintenance.

[47]

Understanding what Machine Learning produces Part I: Representations and their comprehensibility

S.J. Cunningham, M.C. Humphrey, and I.H. Witten. Understanding what machine learning produces part i: Representations and their comprehensibility. Technical Report 96/21, University of Waikato, Computer Science Department, Hamilton, New Zealand, October 1996.
[ bib | .ps | .pdf ]
The aim of many machine learning users is to comprehend the structures that are inferred from a dataset, and such users may be far more interested in understanding the structure of their data than in predicting the outcome of new test data. Part I of this paper surveys representations based on decision trees, production rules and decision graphs, that have been developed and used for machine learning. These representations have differing degrees of expressive power, and particular attention is paid to their comprehensibility for non-specialist users. The graphic form in which a structure is portrayed also has a strong effect on comprehensibility, and Part II of this paper develops knowledge visualization techniques that are particularly appropriate to help answer the questions that machine learning users typically ask about the structures produced.

[48]

Selecting multiway splits in decision trees

E. Frank and I.H. Witten. Selecting multiway splits in decision trees. Technical Report 96/31, University of Waikato, Department of Computer Science, Hamilton, New Zealand, December 1996.
[ bib | .ps | .pdf ]
Decision trees in which numeric attributes are split several ways are more comprehensible than the usual binary trees because attributes rarely appear more than once in any path from root to leaf. There are efficient algorithms for finding the optimal multiway split for a numeric attribute, given the number of intervals in which it is to be divided. The problem we tackle is how to choose this number in order to obtain small, accurate trees.

We view each multiway decision as a model and a decision tree as a recursive structure of such models. Standard methods of choosing between competing models include resampling techniques (such as cross-validation, holdout, or bootstrap) for estimating the classification error; and minimum description length techniques. However, the recursive situation differs from the usual one, and may call for new model selection methods.

This paper introduces a new criterion for model selection: a resampling estimate of the information gain. Empirical results are presented for building multiway decision trees using this new criterion, and compared with criteria adopted by previous authors. The new method generates multiway trees that are both smaller and more accurate than those produced previously, and their performance is comparable with standard binary decision trees.

[49]

Experimental comparison of optimisation techniques: computer optimisation of dairy farm management

R. Hart. Experimental comparison of optimisation techniques: computer optimisation of dairy farm management. Master's thesis, University of Waikato, Department of Computer Science, Hamilton, New Zealand, 1996.
[ bib ]
[50]

Understanding what Machine Learning produces Part II: Knowledge visualization techniques

M.C. Humphrey, S.J. Cunningham, and I.H. Witten. Understanding what machine learning produces part ii: Knowledge visualization techniques. Technical Report 96/22, University of Waikato, Computer Science Department, Hamilton, New Zealand, October 1996.
[ bib | .ps | .pdf ]
Researchers in machine learning use decision trees, production rules, and decision graphs for visualizing classification data. Part I of this paper surveyed these representations, paying particular attention to their comprehensibility for non-specialist users. Part II turns attention to knowledge visualization the graphic form in which a structure is portrayed and its strong influence on comprehensibility. We analyze the questions that, in our experience, end users of machine learning tend to ask of the structures inferred from their empirical data. By mapping these questions onto visualization tasks, we have created new graphical representations that show the flow of examples through a decision structure. These knowledge visualization techniques are particularly appropriate in helping to answer the questions that users typically ask, and we describe their use in discovering new properties of a data set. In the case of decision trees, an automated software tool has been developed to construct the visualizations.

[51]

Learning relational ripple-down rules

J.N. Littin. Learning relational ripple-down rules. Master's thesis, University of Waikato, Computer Science Department, Hamilton, New Zealand, 1996.
[ bib | .ps | .pdf ]
[52]

An investigation into the use of machine learning for the determining of oestrus in cows

R.S. Mitchell, R.A. Sherlock, and L.A. Smith. An investigation into the use of machine learning for the determining of oestrus in cows. Computing and Electronics in Agriculture, 15:195-215, 1996.
[ bib | .ps | .pdf ]
A preliminary investigation of the application of two well-known machine learning schemes-C4.5 and FOIL-to detection of oestrus in dairy cows has been made. This is a problem of practical economic significance as each missed opportunity for artificial insemination results in 21 days lost milk production. Classifications were made on normalised diviations of milk volume production and milking order time series data. The best learning scheme was C4.5 which was able to detect 69% of oestrus events, albeit with an unacceptably high rate of false positives (74%). Several directions for further work and improvements are identified.

[53]

Machine learning in programming by demonstration: lessons learned from CIMA

D. Maulsby and I.H. Witten. Machine learning in programming by demonstration: lessons learned from cima. In Y. Gil, editor, Acquisition, learning and demonstration: Automating tasks for users (Proc AAAI Symposium, Stanford), pages 66-72, Menlo Park, CA, 1996. AAAI Press.
[ bib | .ps | .pdf ]
Programming-by-demonstration (PBD) systems learn tasks by watching the user perform them. CIMA is an interactive learning system for modeling the data selected and modified by a user as he or she undertakes a task. Part of a PBD system, CIMA is invoked when user actions are matched to find a common description of their operands. Although the system's interfaces to users and applications are still too rough-hewn to permit field trials, its performance on recorded dialogs between users and a simulated agent meets the design goals established prior to its implementation.

The contributions of this work lie in three areas:

(a) a design methodology for PBD systems;
(b) a framework for user actions in PBD;
(c) novel methods of interaction with the user.

These are discussed in separate sections below. First, however, it is necessary to convey the flavor of what it is like to use the CIMA system, and this is done in the following section.

[54]

Detecting sequential structure

C.G. Nevill-Manning. Detecting sequential structure. PhD thesis, University of Waikato, Department of Computer Science, Hamilton, New Zealand, 1996.
[ bib | .ps | .pdf ]
[55]

Compressing semi-structured text using hierarchical phrase identification

C.G. Nevill-Manning, I.H. Witten, and D.R. Olsen. Compressing semi-structured text using hierarchical phrase identification. In J.A. Storer and M. Cohn, editors, Proc Data Compression Conference, pages 63-72, Los Alamitos, CA, 1996. IEEE Press.
[ bib | .ps | .pdf ]
Many computer files contain highly-structured, predictable information interspersed with information which has less regularity and is therefore less predictable-such as free text. Examples range from word-processing source files, which contain precisely-expressed formatting specifications enclosing tracts of natural-language text, to files containing a sequence of filled-out forms which have a predefined skeleton clothed with relatively unpredictable entries. These represent extreme ends of a spectrum. Word-processing files are dominated by free text, and respond well to general-purpose compression techniques. Forms generally contain database-style information, and are most appropriately compressed by taking into account their special structure. But one frequently encounters intermediate cases. For example, in many email messages the formal header and the informal free-text content are equally voluminous. Short SGML files often contain comparable amounts of formal structure and informal text. Although such files may be compressed quite well by general-purpose adaptive text compression algorithms, which will soon pick up the regular structure during the course of normal adaptation, better compression can often be obtained by methods that are equipped to deal with both formal and informal structure.

[56]

Automatic oestrus detection from milking data - a preliminary investigation

R.A. Sherlock, L.A. Smith, and R.S. Mitchell. Automatic oestrus detection from milking data - a preliminary investigation. In Proc NZ Society for Animal Production, volume 65, pages 228-229, Hamilton, New Zealand, February 1996.
[ bib | .ps | .pdf ]
Efficient oestrus detection is important economically, particularly with seasonal-calving herds employing artificial insemination, as every missed oestrus effectively costs 21 days milk production for that cow in that season which is about 8% of the total. Traditionally, oestrus detection relies mainly on visual observation of animal behaviour - a cow in oestrus will stand and allow herself to be mounted by herdmates. Such events are readily noted when they take place amongst assembled cows, such as at milking times, but may be missed entirely in herds of free-grazing animals which are only brought in for milking twice a day, as is usual in New Zealand style systems. A common low-cost aid to oestrus detection is the use of tail-paint (Macmillan and Curnow, 1977), and although this can be very effective when properly executed the regular visual inspection and repainting is labour-intensive.

[57]

Learning language using genetic algorithms

T.C. Smith and I.H. Witten. Learning language using genetic algorithms, volume 1040 of LNCS: Connectionist, statistical and symbolic approaches to learning for natural language processing, pages 132-145. Springer-Verlag, New York, 1996.
[ bib | .ps | .pdf ]
Strict pattern-based methods of grammar induction are often frustrated by the apparently inexhaustible variety of novel word combinations in large corpora. Statistical methods offer a possible solution by allowing frequent well-formed expressions to overwhelm the infrequent ungrammatical ones. They also have the desirable property of being able to construct robust grammars from positive instances alone. Unfortunately, the zero-frequency problem entails assigning a small probability to all possible word patterns, thus ungrammatical n-grams become as probable as unseen grammatical ones. Further, such grammars are unable to take advantage of inherent lexical properties that should allow infrequent words to inherit the syntactic properties of the class to which they belong.

This paper describes a genetic algorithm (GA) that adapts a population of hypothesis grammars towards a more effective model of language structure. The GA is statistically sensitive in that the utility of frequent patterns is reinforced by the persistence of efficient substructures. It also supports the view of language learning as a bootstrapping problem, a learning domain where it appears necessary to simultaneously discover a set of categories and a set of rules defined over them. Results from a number of tests indicate that the GA is a robust, fault-tolerant method for inferring grammars from positive examples of natural language.

[58]

The entropy of English using PPM-based models

W.J. Teahan and J.G. Cleary. The entropy of english using ppm-based models. In J.A. Storer and M. Cohn, editors, Proc Data Compression Conference, pages 53-62, Los Alamitos, CA, 1996. EEE Press.
[ bib | .ps | .pdf ]
Over 45 years ago Claude E. Shannon estimated the entropy of English to be about 1 bit per character [16]. He did this by having human subjects guess samples of text, letter by letter. From the number of guesses made by each subject he estimated upper and lower bounds of 1.3 and 0.6 bits per character (bpc) for the entropy of English. Shannon's methodology was not improved upon until 1978 when Cover and King [6] used a gambling approach to estimate the upper bound to be 1.25 bpc from the same text. In the cryptographic community n-gram analysis suggests 1.5 bpc as the asymptotic limit for 26-letter English (Tilbourg [19]).

[59]

Machine Learning applied to fourteen agricultural datasets

K. Thomson and R.J. McQueen. Machine learning applied to fourteen agricultural datasets. Technical Report 96/18, University of Waikato, Computer Science Department, Hamilton, New Zealand, September 1996.
[ bib | .ps | .pdf ]
This document reports on an investigation conducted between November, 1995 and March, 1996 into the use of machine learning on 14 sets of data supplied by agricultural researchers in New Zealand. Our purpose here is to collect together short reports on trials with these datasets using the WEKA machine learning workbench, so that some understanding of the applicability and potential application of machine learning to similar datasets may result.

[60]

Decision combination based on the characterization of predictive accuracy

K.M. Ting. Decision combination based on the characterization of predictive accuracy. In R.S. Michalski and J. Wnek, editors, Proc 3rd International Workshop on Multistrategy Learning, pages 191-202, Fairfax, VA, 1996.
[ bib ]
[61]

The characterization of predictive accuracy and decision combination

K.M. Ting. The characterization of predictive accuracy and decision combination. In Proc International Conference on Machine Learning, pages 498-506, Bari, Italy, 1996.
[ bib | .ps | .pdf ]
In this paper, we first explore an intrinsic problem that exists in the theories induced by learning algorithms. Regardless of the selected algorithm, search methodology and hypothesis representation by which the theory is induced, one would expect the theory to make better predictions in some regions of the description space than others. We term the fact that an induced theory will have some regions of relatively poor performance the problem of locally low predictive accuracy. Having characterised the problem of locally low predictive accuracy in Instance-Based and Naive Bayesian classifiers, we propose to counter this problem using a composite learner that incorporates both classifiers. The strategy is to select an estimated better performing classifier to do the final prediction during classification. Empirical results show that the strategy is capable of partially overcoming the problem and at the same time improving the overall performance of its constituent classifiers. We provide explanations of why the proposed composite learner performs better than the cross-validation method and the better of its constituent classifiers.

[62]

Theory combination: an alternative to data combination

K.M. Ting and B.T. Low. Theory combination: an alternative to data combination. Technical Report 96/19, University of Waikato, Department of Computer Science, Hamilton, New Zealand, 1996.
[ bib | .ps | .pdf ]
The approach of combining theories learned from multiple batches of data provide an alternative to the common practice of learning one theory from all the available data (i.e., the data combination approach). This paper empirically examines the base-line behaviour of the theory combination approach in classification tasks. We find that theory combination can lead to better performance even if the disjoint batches of data are drawn randomly from a larger sample, and relate the relative performance of the two approaches to the learning curve of the classifier used.

The practical implication of our results is that one should consider using theory combination rather tan data combination, especially when multiple batches of data for the same task are readily available.

Another interesting result is that we empirically show that the near-asymptotic performance of a single theory, in some classification task, can be significantly improved by combining multiple theories (of the same algorithm) if the constituent theories are substantially different and there is some regularity in the theories to be exploited by the combination method used. Comparisons with known theoretical results are also provided.

[63]

Interacting with learning agents: implications for ML from HCI

I.H. Witten, C.G. Nevill-Manning, and D.L. Maulsby. Interacting with learning agents: implications for ml from hci. In Workshop on Machine Learning meets Human-Computer Interaction, ML'96, pages 51-58, Bari, Italy, 1996.
[ bib | .pdf ]
Computers excel at repetitive tasks. But automating them usually involves programming, which is beyond the reach of most non-specialist users. One solution is for machines to learn procedures by observing users at work-and if this enhanced users' productivity and sense of achievement, they might even be persuaded to help the system by supplying some additional information. In principle, combining machine learning with instructional interaction should increase the speed with which tasks are acquired, and enhance reliability too.

[64]

Applications of Machine Learning on two agricultural datasets

S. Yeates and K. Thomson. Applications of machine learning on two agricultural datasets. In Proc New Zealand Conference of Postgraduate Students in Engineering and Technology, pages 495-496, Christchurch, New Zealand, 1996.
[ bib | .ps | .pdf ]
The induction of decision trees from tabulated data is a field of machine learning which has been demonstrated successfully in several practical applications. This paper looks at the application of this technology to two datasets in the agricultural domain, and show why it was not possible to achieve the success obtained in other domains.

[65]

Tutorial-Weka: the Waikato environment of knowledge analysis

Waikato ML group. Tutorial-weka: the waikato environment of knowledge analysis, November 1996.
[ bib ]
[66]

Dataset cataloging metadata for machine learning applications and research

S.J. Cunningham. Dataset cataloging metadata for machine learning applications and research. In Proc International Artificial Intelligence and Statistics Workshop, Ft. Lauderdale, Florida, 1997.
[ bib | .ps | .pdf ]
As the field of machine learning (ML) matures, two types of data archives are developing: collections of benchmark data sets used to test the performance of new algorithms, and data stores to which machine learning/data mining algorithms are applied to create scientific or commercial applications. At present, the catalogs of these archives are ad hoc and not tailored to machine learning analysis. This paper considers the cataloging metadata required to support these two types of repositories, and discusses the organizational support necessary for archive catalog maintenance.

[67]

Machine learning applications in anthropology: automated discovery over kinship structures

S.J. Cunningham. Machine learning applications in anthropology: automated discovery over kinship structures. Computers and the Humanities, 30:401-406, 1997.
[ bib | .ps | .pdf ]
A common problem in anthropological field work is generalizing rules governing social interactions and relations (particularly kinship) from a series of examples. One class of machine learning algorithms is particularly well-suited to this task: inductive logic programming systems, as exemplified by FOIL. A knowledge base of relationships among individuals is established, in the form of a series of single-predicate facts. Given a set of positive and negative examples of a new relationship, the machine learning programs build a Horn clause description of the target relationship. The power of these algorithms to derive complex hypotheses is demonstrated for a set of kinship relationships drawn from the anthropological literature. FOIL extends the capabilities of earlier anthropology-specific learning programs by providing a more powerful representation for induced relationships, and is better able to learn in the face of noisy or incomplete data.

[68]

Brain-Like Computing and Intelligent Information Systems

S.J. Cunningham, G. Holmes, J. Littin, R. Beale, and I.H. Witten. Brain-Like Computing and Intelligent Information Systems, chapter Applying connectionist models to information retrieval, pages 435-457. Springer-Verlag, 1997.
[ bib | .ps | .pdf ]
Adaptive information retrieval (IR) systems based on connectionist architectures have captured the attention of researchers over the past decade. This paper provides a review of connectionist IR research, including the major models for connectionist document and query representation, techniques to enhance query re-formulation, dynamic document routing (information filtering), and connectionist techniques for document clustering.

[69]

Applications of machine learning in information retrieval

S.J. Cunningham, J.N. Littin, and I.H. Witten. Applications of machine learning in information retrieval. Technical Report 97/6, University of Waikato, Department of Computer Science, Hamilton, New Zealand, February 1997.
[ bib | .ps | .pdf ]
Information retrieval systems provide access to collections of thousands, or millions, of documents, from which, by providing an appropriate description, users can recover any one. Typically, users iteratively refine the descriptions they provide to satisfy their needs, and retrieval systems can utilize user feedback on selected documents to indicate the accuracy of the description at any stage. The style of description required from the user, and the way it is employed to search the document database, are consequences of the indexing method used for the collection. The index may take different forms, from storing keywords with links to individual document, to clustering documents under related topics.

Much of the work in information retrieval can be automated. Processes such as document indexing and query refinement are usually accomplished by computer, while document classification and index term selection are more often performed manually. However, manual development and maintenance of document databases is time-consuming, tedious, and error-prone. Algorithms that mine documents for indexing information, and model user interests to help them formulate queries, reduce the workload and can ensure more consistent behavior. Such algorithms are based in machine learning, a dynamic, burgeoning area of computer science which is finding application in domains ranging from expert systems, where learning algorithms supplement-or even supplant-domain experts for generating rules and explanations (Langley and Simon, 1995), to intelligent agents, which learn to play particular, highly-specialized, support roles for individual people and are seen by some to herald a new renaissance of artificial intelligence in information technology (Hendler, 1997).

[70]

Feature subset selection: a correlation based filter approach

M.A. Hall and L.A. Smith. Feature subset selection: a correlation based filter approach. In N. Kasabov and et al., editors, Proc Fourth International Conference on Neural Information Processing and Intelligent Information Systems, pages 855-858, Dunedin, New Zealand, 1997.
[ bib | .ps | .pdf ]
Recent work has shown that feature subset selection can have a positive affect on the performance of machine learning algorithms. Some algorithms can be slowed or their performance adversely affected by too much data, some of which may be irrelevant or redundant to the learning task. Feature subset selection, then, is a method for enhancing the performance of learning algorithms, reducing the storage requirement. This paper describes a feature subset selector that uses a correlation based heuristic to determine the goodness of feature subsets, and evaluates its effectiveness with three common ML algorithms: a decision tree inducer (C4.5), a naive Bayes classifier, and an instance based learner (IB1). Experiments using a number of standard data sets drawn from real and artificial domains are presented. Feature subset selection gave significant improvement for all three algorithms; C4.5 generated smaller decision trees.

[71]

Discovering inter-attribute relationships

G. Holmes. Discovering inter-attribute relationships. In N. Kasabov and et al., editors, Proc Fourth International Conference on Neural Information Processing and Intelligent Information Systems, pages 914-917, Dunedin, New Zealand, 1997.
[ bib | .ps | .pdf ]
It is important to discover relationships between attributes being used to predict a class attribute in supervised learning situations for two reasons. First, any such relationship will be potentially interesting to the provider of a dataset in its own right. Second, it would simplify a learning algorithm's search space, and the related irrelevant feature and subset selection problem, if the relationships were removed from datasets ahead of learning. An algorithm to discover such relationships is presented in this paper. The algorithm is described and a surprising number of inter-attribute relationships are discovered in datasets from the University of California at Irvine (UCI) repository.

[72]

Mining for causes of cancer: machine learning experiments at various levels of detail

S. Kramer, B. Pfahringer, and C. Helma. Mining for causes of cancer: machine learning experiments at various levels of detail. In Proc International Conference on Knowledge Discovery and Data Mining, pages 223-226, Menlo Park, 1997. AAAI Press.
[ bib | .ps | .pdf ]
This paper presents first results of an interdisciplinary project in scientific data mining. We analyze data about the carcinogenicity of chemicals derived from the carcinogenesis bioassay program performed by the US National Institute of Environmental Health Sciences. The database contains detailed descriptions of 6823 tests performed with more than 330 compounds and animals of different species, strains and sexes. The chemical structures are described at the atom and bond level, and in terms of various relevant structural properties. The goal of this paper is to investigate the effects that various levels of detail and amounts of information have on the resulting hypotheses, both quantitatively and qualitatively. We apply relational and propositional machine learning algorithms to learning problems formulated as regression or as classification tasks. In addition, these experiments have been conducted with two learning problems which are at different levels of detail. Quantitatively, our experiments indicate that additional information not necessarily improves accuracy. Qualitatively, a number of potential discoveries have been made by the algorithm for Relational Regression because it can utilize all the information contained in the relations of the database as well as in the numerical dependent variable.

[73]

CIMA: An interactive concept learning system for end-user applications

D. Maulsby and I.H. Witten. Cima: An interactive concept learning system for end-user applications. Applied Artificial Intelligence, 11(7-8):653-671, 1997.
[ bib | .ps | .pdf ]
Personalizable software agents will learn new tasks from their users. In many cases the most appropriate way for users to teach is to demonstrate examples. Learning complex concepts from examples alone is hard, but agents can exploit other forms of instruction that users might give, ranging from yes/no responses to ambiguous, incomplete hints. Agents can also exploit background knowledge customized for applications such as drawing, word processing and form-filling.

The Cima system learns generalized rules for classifying, generating, and modifying data, given examples, hints, and background knowledge. It copes with the ambiguity of user instructions by combining evidence from these sources. A dynamic bias manager generates candidate features (attribute values, functions or relation) from which the learning algorithm selects relevant ones and forms appropriate rules. When tested on dialogues observed in a prior user study on a simulated interface agent, the system achieved 95% of the learning efficiency observed in that study.

[74]

Learning agents: from user study to implementation

D. Maulsby and I.H. Witten. Learning agents: from user study to implementation. IEEE Computer, 30(11):36-44, 1997.
[ bib | .ps | .pdf ]
This paper describes the design, implementation and evaluation of an agent that learns to automate repetitive tasks within the human-computer interface. In contrast to most Artificial Intelligence projects, the design centers on a user study, with a human-simulated agent used to discover the interactions that people find natural. This study shows the ways in which users instinctively communicate via hints, or partially-specified, ambiguous, instructions. Hints may be communicated using speech, by pointing, or by selecting from menus. We develop a taxonomy of such instructions. The implementation demonstrates that computers can learn from examples and ambiguous hints.

[75]

Compression and explanation using hierarchical grammars

C.G. Nevill-Manning and I.H. Witten. Compression and explanation using hierarchical grammars. Computer Journal, 40(2/3):103-116, 1997.
[ bib | .ps.gz | .pdf ]
This paper describes an algorithm, called SEQUITUR, that identifies hierarchical structure in sequences of discrete symbols and uses that information for compression. On many practical sequences it performs well at both compression and structural inference, producing comprehensible descriptions of sequence structure in the form of grammar rules. The algorithm can be stated concisely in the form of two constraints on a context-free grammar. Inference is performed incrementally, the structure faithfully representing the input at all times. It can be implemented efficiently and operates in a time that is approximately linear in sequence length. Despite its simplicity and efficiency, SEQUITUR succeeds in inferring a range of interesting hierarchical structures from naturally occurring sequences.

[76]

Identifying hierarchical structure in sequence: a linear-time algorithm

C.G. Nevill-Manning and I.H. Witten. Identifying hierarchical structure in sequence: a linear-time algorithm. Artificial Intelligence Research, 7:67-82, 1997.
[ bib | .ps | .pdf ]
SEQUITUR is an algorithm that infers a hierarchical structure from a sequence of discrete symbols by replacing repeated phrases with a grammatical rule that generates the phrase, and continuing this process recursively. The result is a hierarchical representation of the original sequence, which offers insights into its lexical structure. The algorithm is driven by two constraints that reduce the size of the grammar, and produce structure as a by-product. SEQUITUR breaks new ground by operating incrementally. Moreover, the method's simple structure permits a proof that it operates in space and time that is linear in the size of the input. Our implementation can process 50,000 symbols per second and has been applied to an extensive range of real world sequences.

[77]

Linear-time, incremental hierarchy inference for compression

C.G. Nevill-Manning and I.H. Witten. Linear-time, incremental hierarchy inference for compression. In J.A. Storer and M. Cohn, editors, Proc Data Compression Conference, pages 3-11, Los Alamitos, CA, 1997. IEEE Press.
[ bib | .ps | .pdf ]
Data compression and learning are, in some sense, two sides of the same coin. If we paraphrase Occam's razor by saying that a small theory is better than a larger theory with the same explanatory power, we can characterize data compression as a preoccupation with small, and learning as a preoccupation with better. Nevill-Manning et al. (1994) presented an algorithm, since dubbed SEQUITUR, that presents both faces of the compression/learning coin. Its performance as a data compression scheme outstrips other dictionary schemes, and the structures that it learns from sequences as diverse as DNA and music are intuitively compelling.

In this paper, we present three new results that characterize SEQUITUR's computational and compression performance. First, we prove that SEQUITUR operates in time linear in n, the length of the input sequence, despite its ability to build a hierarchy as deep as log(n). Second, we show that a sequence can be compressed incrementally, improving on the non-incremental algorithm that was described by Nevill-Manning et al. (1994), and making on-line compression feasible. Third, we present an intriguing result that emerged during benchmarking; whereas PPMC (Moffat, 1990) outperforms SEQUITUR on most files in the Calgary corpus, SEQUITUR regains the lead when tested on multi-megabyte sequences. We make some tentative conclusions about the underlying reasons for this phenomenon, and about the nature of current compression benchmarking.

[78]

Compression-based pruning of decision lists

B. Pfahringer. Compression-based pruning of decision lists. In Proc European Conference on Machine Learning, volume 1224 of LNAI, pages 199-212, Prague, Czech Republic, 1997. Springer-Verlag.
[ bib | .ps | .pdf ]
We define a formula for estimating the coding costs of decision lists for propositional domains. This formula allows for multiple classes and both categorical and numerical attributes. For artificial domains the formula performs quite satisfactory, whereas results are rather mixed and inconclusive for natural domains. Further experiments lead to a principled simplification of the original formula which is robust in both artificial and natural domains. Simple hill-climbing search for the most compressive decision list significantly reduces the complexity of a given decision list while not impeding and sometimes even improving its predictive accuracy.

[79]

Probabilistic unification grammars

T.C. Smith and J.G. Cleary. Probabilistic unification grammars. In Workshop notes: ACSC'97 Australasian Natural Language Processing Summer Workshop, pages 25-32, 1997.
[ bib | .ps | .pdf ]
Recent research has shown that unification grammars can be adapted to incorporate statistical information, thus preserving the processing benefits of stochastic context-free grammars while offering an efficient mechanism for handling dependencies. While complexity studies show that a probabilistic unification grammar achieves an appropriately lower entropy estimate than an equivalent PCFG, the problem of parameter estimation prevents results from reflecting the empirical distribution. This paper describes how a PUG can be implemented as a Prolog DCG annotated with weights, and how the weights can be interpreted to give accurate entropy estimates. An algorithm for learning correct weights is provided, along with results from some complexity analyses.

[80]

Models of English text

W.J. Teahan and J.G. Cleary. Models of english text. In J.A. Storer and M. Cohn, editors, Proc Data Compression Conference, pages 12-21, Los Alamitos, CA, 1997. IEEE Press.
[ bib | .ps | .pdf ]
The problem of constructing models of English text is considered. A number of applications of such models including cryptology, spelling correction and speech recognition are reviewed. The best current models for English text have been the result of research into compression. Not only is this an important application of such models but the amount of compression provides a measure of how well such models perform. Three main classes of models are considered: character based models, word based models, and models which use auxiliary information in the form of parts of speech. These models are compared in terms of their memory usage and compression.

[81]

Inducing cost-sensitive trees via instance-weighting

K.M. Ting. Inducing cost-sensitive trees via instance-weighting. Technical Report 97/22, University of Waikato, Department of Computer Science, Hamilton, New Zealand, September 1997.
[ bib ]
We introduce an instance-weighting method to induce cost-sensitive trees in this paper. It is a generalization of the standard tree induction process where only the initial instance weights determine the type of tree (i.e., minimum error trees or minimum cost trees) to be induced. We demonstrate that it can be easily adopted to an existing tree learning algorithm.

Previous research gave insufficient evidence to support the fact that the greedy divide-and-conquer algorithm can effectively induce a truly cost-sensitive tree directly from the training data. We provide this empirical evidence in this paper. The algorithm employing the instance-weighting method is found to be comparable to or better than both C4.5 and C5 in terms of total misclassification costs, tree size and the number of high cost errors. The instance-weighting method is also simpler and more effective in implementation than a method based on altered priors.

[82]

Model combination in the multiple-data-batched scenario

K.M. Ting and B.T. Low. Model combination in the multiple-data-batched scenario. In Proc European Conference on Machine Learning, volume 1224 of LNAI, pages 250-265, Prague, Czech Republic, 1997. Springer-Verlag.
[ bib | .ps | .pdf ]
The approach of combining models learned from multiple batches of data provide an alternative to the common practice of learning one model from all the available data (i.e., the data combination approach). This paper empirically examines the base-line behaviour of the model combination approach in this multiple-data-batches scenario. We find that model combination can lead to better performance even if the disjoint batches of data are drawn randomly from a larger sample, and relate the relative performance of the two approaches to the learning curve of the classifier used.

The practical implication of our results is that one should consider using model combination rather than data combination, especially when multiple batches of data for the same task are readily available.

Another interesting result is that we empirically show that the near-asymptotic performance of a single model, in some classification task, can be significantly improved by combining multiple models (derived from the same algorithm) if the constituent models are substantially different and there is some regularity in the models to be exploited by the combination method used. Comparisons with known theoretical results are also provided.

[83]

Learning from batched data: model combination vs data combination

K.M. Ting, B.T. Low, and I.H. Witten. Learning from batched data: model combination vs data combination. Technical Report 97/14, University of Waikato, Department of Computer Science, Hamilton, New Zealand, May 1997.
[ bib ]
When presented with multiple batches of data, one can either combine them into a single batch before applying a machine learning procedure or learn from each batch independently and combine the resulting models. The former procedure, data combination, is straightforward; this paper investigates the latter, model combination. Given an appropriate combination method, one might expect model combination to prove superior when the data in each batch was obtained under somewhat different conditions or when different learning algorithms were used on the batches. Empirical results show that model combination often outperforms data combination even when the batches are drawn randomly from a single source of data and the same learning method is used on each. Moreover, this is not just an artifact of one particular method of combining models: it occurs with several different combination methods.

We relate this phenomenon to the learning curve of the classifiers being used. Early in the learning process when the learning curve is steep there is much to gain from data combination, but later when it becomes shallow there is less to gain and model combination achieves a greater reduction in variance and hence a lower error rate.

The practical implication of these results is that one should consider using model combination rather than data combination, especially when multiple batches of data for the same task are readily available. It is often superior even when the batches are drawn randomly from a single sample, and we expect its advantage to increase if genuine statistical differences between the batches exist.

[84]

Stacked generalization: when does it work?

K.M. Ting and I.H. Witten. Stacked generalization: when does it work? In Proc International Joint Conference on Artificial Intelligence, pages 866-871, Japan, 1997.
[ bib | .ps | .pdf ]
Stacked generalization is a general method of using a high-level model to combine lower-level models to achieve greater predictive accuracy. In this paper we address two crucial issues which have been considered to be a 'black art' in classification tasks ever since the introduction of stacked generalization in 1992 by Wolpert: the type of generalizer that is suitable to derive the higher-level model, and the kind of attributes that should be used as its input. We demonstrate the effectiveness of stacked generalization for combining three different types of learning algorithms.

[85]

Stacking bagged and dagged models

K.M. Ting and I.H. Witten. Stacking bagged and dagged models. In Proc International Conference on Machine Learning, pages 367-375, Tennessee, USA, 1997.
[ bib | .ps | .pdf ]
In this paper, we investigate the method of stacked generalization in combining models derived from different subsets of a training dataset by a single learning algorithm, as well as different algorithms. The simplest way to combine predictions from competing models is majority vote, and the effect of the sampling regime used to generate training subsets has already been studied in this context-when bootstrap samples are used the method is called bagging, and for disjoint samples we call it dagging. This paper extends these studies to stacked generalization, where a learning algorithm is employed to combine the models. This yields new methods dubbed bag-stacking and dag-stacking.

We demonstrate that bag-stacking and dag-stacking can be effective for classification tasks even when the training samples cover just a small fraction of the full dataset. In contrast to earlier bagging results, we show that bagging and bag-stacking work for stable as well as unstable learning algorithms, as do dagging and dag-stacking. We find that bag-stacking (dag-stacking) almost always has higher predictive accuracy than bagging (dagging), and we also show that bag-stacking models derived using two different algorithms is more effective than bagging.

[86]

Induction of model trees for predicting continuous classes

Y. Wang and I.H. Witten. Induction of model trees for predicting continuous classes. In Proc European Conference on Machine Learning Poster Papers, pages 128-137, Prague, Czech Republic, 1997.
[ bib | .ps | .pdf ]
Many problems encountered when applying machine learning in practice involve predicting a class that takes on a continuous numeric value, yet few machine learning schemes are able to do this. This paper describes a rational reconstruction of M5, a method developed by Quinlan (1992) for inducing trees of regression models. In order to accommodate data typically encountered in practice it is necessary to deal effectively with enumerated attributes and with missing values, and techniques devised by Breiman et al. (1984) are adapted for this purpose. The resulting system seems to outperform M5, based on the scanty published data that is available.

[87]

User manual - Weka: the Waikato environment for knowledge analysis

Waikato ML group. User manual - weka: the waikato environment for knowledge analysis, June 1997. This is out of date and refers to the old C version of WEKA, which we no longer support or distribute.
[ bib ]
[88]

Experience with OB1: an optimal Bayes decision tree learner

J.G. Cleary and L. Trigg. Experience with ob1: an optimal bayes decision tree learner. Technical Report 98/10, University of Waikato, Department of Computer Science, Hamilton, New Zealand, May 1998.
[ bib | .ps | .pdf ]
Machine learning algorithms for inferring decision trees typically choose a single best tree to describe the training data. Recent research has shown that classification performance can be significantly improved by voting predictions of multiple, independently produced decision trees. This paper describes an algorithm, OB1, that makes a weighted sum over many possible models. We describe one instance of OB1, that includes all possible decision trees as well as naive Bayesian models. OB1 is compared with a number of other decision tree and instance based learning alogrithms on some of the data sets from the UCI repository. Both an information gain and an accuracy measure are used for the comparison. On the information gain measure OB1 performs significantly better than all the other algorithms. On the accuracy measure it is significantly better than all the algorithms except naive Bayes which performs comparably to OB1.

[89]

Generating Accurate Rule Sets Without Global Optimization

Eibe Frank and Ian H. Witten. Generating accurate rule sets without global optimization. In Proc 15th International Conference on Machine Learning, Madison, Wisconsin, pages 144-151. Morgan Kaufmann, 1998. Also available as Working Paper 98/2, Department of Computer Science, University of Waikato; January.
[ bib | .ps | .pdf ]
The two dominant schemes for rule-learning, C4.5 and RIPPER, both operate in two stages. First they induce an initial rule set and then they refine it using a rather complex optimization stage that discards (C4.5) or adjusts (RIPPER) individual rules to make them work better together. In contrast, this paper shows how good rule sets can be learned one rule at a time, without any need for global optimization. We present an algorithm for inferring rules by repeatedly generating partial decision trees, thus combining the two major paradigms for rule generation-creating rules from decision trees and the separate-and-conquer rule-learning technique. The algorithm is straightforward and elegant: despite this, experiments on standard datasets show that it produces rule sets that are as accurate as and of similar size to those generated by C4.5, and more accurate than RIPPER's. Moreover, it operates efficiently, and because it avoids postprocessing, does not suffer the extremely slow performance on pathological example sets for which the C4.5 method has been criticized.

[90]

Using a Permutation Test for Attribute Selection in Decision Trees

Eibe Frank and Ian H. Witten. Using a permutation test for attribute selection in decision trees. In Proc 15th International Conference on Machine Learning, Madison, Wisconsin, pages 152-160. Morgan Kaufmann, 1998.
[ bib | .ps | .pdf ]
Most techniques for attribute selection in decision trees are biased towards attributes with many values, and several ad hoc solutions to this problem have appeared in the machine learning literature. Statistical tests for the existence of an association with a prespecified significance level provide a well-founded basis for addressing the problem. However, many statistical tests are computed from a chi-squared distribution, which is only a valid approximation to the actual distribution in the large-sample case-and this patently does not hold near the leaves of a decision tree. An exception is the class of permutation tests. We describe how permutation tests can be applied to this problem. We choose one such test for further exploration, and give a novel two-stage method for applying it to select attributes in a decision tree. Results on practical datasets compare favorably with other methods that also adopt a pre-pruning strategy.

[91]

Using Model Trees for Classification

Eibe Frank, Yong Wang, Stuart Inglis, Geoffrey Holmes, and Ian H. Witten. Using model trees for classification. Machine Learning, 32(1):63-76, 1998. Also available as Working Paper 97/12, Department of Computer Science, University of Waikato; April.
[ bib | .ps | .pdf ]
Model trees, which are a type of decision tree with linear regression functions at the leaves, form the basis of a recent successful technique for predicting continuous numeric values. They can be applied to classification problems by employing a standard method of transforming a classification problem into a problem of function approximation. Surprisingly, using this simple transformation the model tree inducer M5', based on Quinlan's M5, generates more accurate classifiers than the state-of-the-art decision tree learner C5.0, particularly when most of the attributes are numeric.

[92]

Practical feature subset selection for machine learning

Mark Andrew Hall and Lloyd Smith. Practical feature subset selection for machine learning. In Proc 21st Australian Computer Science Conference, pages 181-191, Perth, Australia, 1998. Springer.
[ bib | .ps | .pdf ]
Machine learning algorithms automatically extract knowledge from machine readable information. Unfortunately, their success is usually dependant on the quality of the data that they operate on. If the data is inadequate, or contains extraneous and irrelevant information, machine learning algorithms may produce less accurate and less understandable results, or may fail to discover anything of use at all. Feature subset selectors are algorithms that attempt to identify and remove as much irrelevant and redundant information as possible prior to learning. Feature subset selection can result in enhanced performance, a reduced hypothesis search space, and, in some cases, reduced storage requirement. This paper describes a new feature selection algorithm that uses a correlation based heuristic to determine the goodness of feature subsets, and evaluates its effectiveness with three common machine learning algorithms. Experiments using a number of standard machine learning data sets are presented. Feature subset selection gave significant improvement for all three algorithms.

[93]

Optimisation techniques for a computer simulation of a pastoral dairy farm

Hart R., M.T. Larcombe, R.A. Sherlock, and L.A. Smith. Optimisation techniques for a computer simulation of a pastoral dairy farm. Journal Computing and Electronics in Agriculture, 19:129-153, 1998.
[ bib | .ps | .pdf ]
This paper compares different methods of optimising the management variables in UDDER, a commercially-available computer simulation model of a pastoral dairy farm. The emphasis is on identifying the best optimisation strategy for this complex multi-dimensional system, taking the simulation model as a given constant. The optimisation methods studied are based on significantly different principles, with differing strengths and weaknesses: two hill-climbing algorithms (Nelder-Mead Simplex and Powell's Direction Set), and a genetic algorithm. Rather than examine all facets of dairy farm management, a single problem is optimised-that of maximising milkfat production while maintaining the health of the herd and pasture.

The results show that while the genetic algorithm can determine good regions within the search space quickly, it is considerably slower than either hill-climber at finding the optimal point within that region. The hillclimbers, in contrast, are fast but have a tendency to get trapped on local maxima and thus fail to find the true optimum. This led to the development of a hybrid algorithm which utilises the initial global search of the genetic algorithm, followed by the more efficient local search of a hill-climber. This hybrid algorithm discovered near-optimal points much more quickly than the genetic algorithm, and with more reliability than the hill-climber.

[94]

Predicting apple bruising using machine learning

Holmes G., S.J. Cunningham, B.T. Dela Rue, and A.F. Bollen. Predicting apple bruising using machine learning. In L.M.M. Tijskens and M.L.A.T.M. Hertog, editors, Model-IT Conference, Acta Horticulturae, volume 476, pages 289-296, The Netherlands, 1998. Also available as Working Paper 98/7, Department of Computer Science, University of Waikato; April.
[ bib | .ps | .pdf ]
Many models have been used to describe the influence of internal or external factors on apple bruising. Few of these have addressed the application of derived relationships to the evaluation of commercial operations. From an industry perspective, a model must enable fruit to be rejected on the basis of a commercially significant bruise and must also accurately quantify the effects of various combinations of input features (such as cultivar, maturity, size, and so on) on bruise prediction. Input features must in turn have characteristics which are measurable commercially; for example the measure of force should be impact energy rather than energy absorbed. Further, as the commercial criteria for acceptable damage levels change, the model should be versatile enough to regenerate new bruise thresholds from existing data.

Machine learning is a burgeoning technology with a vast range of potential applications particularly in agriculture where large amounts of data can be readily collected [1]. The main advantage of using a machine learning method in an application is that the models built for prediction can be viewed and understood by the owner of the data who is in a position to determine the usefulness of the model, an essential component in a commercial environment.

Machine Learning software recently developed at Waikato University [2] offers potential as a prediction tool for the classification of bruising based on historical data. It gives the user the opportunity to select any number of measured input attributes and determine the influence of that combination on a range of bruise size categories. The user may require a high degree of accuracy in the classification and therefore prune the attributes or bruise classes accordingly, or alternatively seek to discover trends in the dataset (in which case a lower level of accuracy often clarifies implicit structures in the data).

Models such as the theory of elasticity suggest that impact energy and radius of curvature will have a significant effect on the bruise surface area. Cell structure is also thought to contribute to variation in bruise size. The experiment described in this paper uses the machine learning programs C5.4 [3] and M5' [4] to determine the influence of impact energy, radius of curvature and impact site location on bruise area.

[95]

Knowledge visualization techniques for machine learning

M. Humphrey, S.J. Cunningham, and I.H. Witten. Knowledge visualization techniques for machine learning. Intelligent Data Analysis, 2(1-4):333-347, 1998.
[ bib ]
Researchers in machine learning primarily use decision trees, production rules, and decision graphs for visualizing classification data, with the graphic form in which a structure is portrayed as having a strong influence on comprehensibility. We analyze the questions that, in our experience, end users of machine learning tend to ask of the structures inferred from their empirical data. By mapping these questions onto visualization tasks, we have created new graphical representations that show the flow of examples through a decision structure. The knowledge visualization techniques are particularly appropriate in helping to answer the questions that users typically ask, and we describe their use in discovering new properties of a data set. In the case of decision trees, an automated software tool has been developed to construct the visualizations.

[96]

Objective measurement of mushroom quality

N. Kusabs, F. Bollen, L. Trigg, G. Holmes, and S. Inglis. Objective measurement of mushroom quality. In Proc New Zealand Institute of Agricultural Science and the New Zealand Society for Horticultural Science Annual Convention, page 51, Hawke's Bay, New Zealand, 1998.
[ bib ]
This paper describes a methodology for establishing an objective measurement of mushroom quality, based on a set of measured physical attributes. These attributes were used by a machine learning tool to model quality based on the classification of professional graders (experts).

Four experts visually evaluate 300 mushrooms and graded them into three major and eight subclasses of commercial quality. Weight, firmness and images of the top and bottom of each mushroom were then measured. These physical parameters were used to construct a model of the grades assigned by the four experts. Grader consistency was also assessed by repeated classification (four repetitions) of two 100-mushroom sets. Grader repeatability ranged from 6 to 15% misclassification.

Misclassification by the model increased with grading complexity (16-32% for the three major grades and 17-63% for the eight subclasses). This depended heavily on grader classifications and the variables used for training the model (weight, firmness, estimated gill opening and red/green/blue histograms from the image). Accuracy of classification using various combinations of physical attributes, which indicate the relative weight placed on each attribute by the different graders, are detailed in the paper.

[97]

Predicting query times

McNab R., Y. Wang, I.H. Witten, and C. Gutwin. Predicting query times. In W.B. Croft and et al., editors, Proc ACM SIGIR Conference on Research and Development in Information Retrieval, pages 355-356, Melbourne, Australia, 1998. ACM Press.
[ bib ]
We outline the need for search engines to provide user feedback on the expected time for a query, describe a scheme for learning a model of query time by observing sample queries, and discuss the results obtained for a set of actual user queries on a document collection using the MG search engine.

[98]

User perceptions of machine learning

R.J. McQueen and G. Holmes. User perceptions of machine learning. In E.D. Hoadley and I. Benbasat, editors, Proc Association of Information Systems Conference, pages 180-182, Maryland, Baltimore, 1998. Association for Information Systems, Atlanta, GA.
[ bib | .pdf ]
Machine learning has potential use in the understanding of information hidden in large datasets, but little is known about user's perceptions about the use of the technology. In this study, a number of datasets were solicited from agricultural researchers and processed using a machine learning workbench. The results were reported to the researchers, and then interviews were conducted with some of them to determine their perceptions about the use of machine learning as an additional analysis technique to traditional statistical analysis. An number of themes about their satisfaction with this technique were constructed from the interview transcripts, which generally indicate that machine learning may be able to contribute to analysis and understanding of these kinds of datasets.

[99]

User satisfaction with machine learning as a data analysis method in agricultural research

R.J. McQueen, G. Holmes, and L. Hunt. User satisfaction with machine learning as a data analysis method in agricultural research. New Zealand Journal of Agricultural Research, 41(4):577-584, 1998.
[ bib | .pdf ]
Machine learning has potential use in the understanding of information hidden in large datasets, but little is know about user perceptions about the use of the technology. In this study, a number of datasets were solicited from agricultural researchers and processed using a machine learning workbench. The results were reported to the researchers, and then interviews were conducted with some of them to determine their perceptions about the use of machine learning as an additional analysis technique to traditional statistical analysis. An number of themes about their satisfaction with this technique were constructed from the interview transcripts, which generally indicate that machine learning may be able to contribute to analysis and understanding of these kinds of datasets.

[100]

Inferring lexical and grammatical structure from sequences

C.G. Nevill-Manning and I.H. Witten. Inferring lexical and grammatical structure from sequences. In B. Carpentieri and et al., editors, Proc Compression and Complexity of Sequences, pages 265-274, Los Alamitos, CA, 1997. IEEE Computer Society Press. Published as Nevill-Manning and Witten, 1998.
[ bib | .ps | .pdf ]
In a wide variety of sequences from various sources, from music and text to DNA and computer programs, two different but related kinds of structure can be discerned. First, some segments tend to be repeated exactly, such as motifs in music, words or phrases in text, identifiers and syntactic idioms in computer programs. Second, these segments interact with each other in variable but constrained ways. For example, in English text only certain syntactic word classes can appear after the word 'the'-many parts of speech (such as verbs) are necessarily excluded.

This paper shows how these kinds of structure can be inferred automatically from sequences. Let us make clear at the outset what aspects of sequence structure we are not concerned with. We take no account of numerical frequencies other than the 'more than once' that defines repetition. We do not consider any similarity metrics between the individual symbols that make up the sequence, nor between 'similar' subsequences such as transposed or transformed motifs in music. Finally, although we are certainly interested in nested repetitions, we do not analyze recursive structure in sequences-such as self-similarity in fractal sequences. All of these regularities are interesting ones that would be well worth taking into account, but lie beyond the scope of this paper.

[101]

Phrase hierarchy inference and compression in bounded space

C.G. Nevill-Manning and I.H. Witten. Phrase hierarchy inference and compression in bounded space. In J.A. Storer and M. Cohn, editors, Proc Data Compression Conference, pages 179-188, Los Alamitos, CA, 1998. IEEE Press.
[ bib | .ps | .pdf ]
Text compression by inferring a phrase hierarch from the input is a recent technique that shows promise both as a compression scheme and as a machine learning method that extracts some comprehensible account of the structure of the input text. Its performance as a data compression scheme outstrips other dictionary schemes, and the structures that it learns from sequences have been put to such eclectic uses as phrase browsing in digital libraries, music analysis, and inferring rules for fractal images.

[102]

Fast convergence with a greedy tag-phrase dictionary

T.C. Smith and R. Peeters. Fast convergence with a greedy tag-phrase dictionary. In J.A. Storer and M. Cohn, editors, Proc Data Compression Conference, pages 33-42, Los Alamitos, CA, 1998. IEEE Press.
[ bib | .ps | .pdf ]
The best general-purpose compression schemes make their gains by estimating a probability distribution over all possible next symbols given the context established by some number of previous symbols. Such context models typically obtain good compression results for plain text by taking advantage of regularities in character sequences. Frequent words and syllables can be incorporated into the model quickly and thereafter used for reasonably accurate prediction. However, the precise context in which frequent patterns emerge is often extremely varied, and each new word or phrase immediately introduces new contexts which can adversely affect the compression rate.

A great deal of the structural regularity in a natural language is given rather more by properties of its grammar than by the orthographic transcription of its phonology. This implies that access to a grammatical abstraction might lead to good compression. While grammatical models have been used successfully for compressing computer programs [4], grammar-based compression of plain text has received little attention, primarily because of the difficulties associated with constructing a suitable natural language grammar. But even without a precise formulation of the syntax of a language, there is a linguistic abstraction which is easily accessed and which demonstrates a high degree of regularity which can be exploited for compression purposes-namely, lexical categories.

[103]

Learning feature-value grammars from plain text

T.C. Smith. Learning feature-value grammars from plain text. In David M.-W. Powers, editor, Proc Joint Conference on New Methods in Language Processing and Computational Natural Language Learning, pages 291-294, Sydney, Australia, 1998.
[ bib | .ps | .pdf ]
This paper outlines preliminary work on learning feature-value grammars from plain text. Common suffixes are gleaned from a word suffix tree and used to form a first approximation of how regular inflection is marked. Words are generalised into lexical categories according to regularities in how these suffixes appear in trigram context. The categories are expressed as a lexical feature whose value is given by the most frequent suffix for similar trigrams. The trigrams are subsequently used to infer agreement dependencies which are captured through the creation of additional feature structures. Agreement and linear precedence are preserved through the iterative creation of unification rules for pairs of terms.

[104]

Boosting trees for cost-sensitive classification

K.M. Ting and Z. Zheng. Boosting trees for cost-sensitive classification. In Proc European Conference on Machine Learning, volume 1398 of LNAI, pages 190-195, Berlin, 1998.
[ bib | .ps | .pdf ]
This paper explores two boosting techniques for cost-sensitive tree classifications in the situation where misclassification costs change very often. Ideally, one would like to have only one induction, and use the induced model for different misclassification costs. Thus, it demands robustness of the induced model against cost changes. Combining multiple trees gives robust predictions against this change. We demonstrate that the two boosting techniques are a good solution in different aspects under this situation.

[105]

An entropy gain measure of numeric prediction performance

L. Trigg. An entropy gain measure of numeric prediction performance. Technical Report 98/11, University of Waikato, Department of Computer Science, Hamilton, New Zealand, May 1998.
[ bib | .ps | .pdf ]
Categorical classifier performance is typically evaluated with respect to error rate, expressed as a percentage of test instances that were not correctly classified. When a classifier produces multiple classifications for a test instance, the prediction is counted as incorrect (even if the correct class was one of the predictions). Although commonly used in the literature, error rate is a coarse measure of classifier performance, as it is based only on a single prediction offered for a test instance. Since many classifiers can produce a class distribution as a prediction, we should use this to provide a better measure of how much information the classifier is extracting from the domain.

Numeric classifiers are a relatively new development in machine learning, and as such there is no single performance measure that has become standard. Typically these machine learning schemes predict a single real number for each test instance, and the error between the predicted and actual value is used to calculate a myriad of performance measures such as correlation coefficient, root mean squared error, mean absolute error, relative absolute error, and root relative squared error. With so many performance measures it is difficult to establish an overall performance evaluation.

The next section describes a performance measure for machine learning schemes that attempts to overcome the problems with current measures. In addition, the same evaluation measure is used for categorical and numeric classifier.

[106]

Improving Browsing in Digital Libraries with Keyphrase Indexes

Carl Gutwin, Gordon W. Paynter, Ian H. Witten, Craig G. Nevill-Manning, and Eibe Frank. Improving browsing in digital libraries with keyphrase indexes. Decision Support Systems, 27(1/2):81-104, 1999.
[ bib | .ps.gz ]
[107]

Developing innovative applications of machine learning

S.J. Cunningham and G. Holmes. Developing innovative applications of machine learning. In Proc Southeast Asia Regional Computer Confederation Conference, Singapore, 1999.
[ bib | .ps | .pdf ]
The WEKA (Waikato Environment for Knowledge Analysis) system provides a comprehensive suite of facilities for applying data mining techniques to large data sets. This paper discusses a process model for analyzing data, and describes the support that WEKA provides for this model. The domain model 'learned' by the data mining algorithm can then be readily incorporated into a software application. The WEKA-based analysis and application construction process is illustrated through a case study in the agricultural domain�mushroom grading.

[108]

Market Basket Analysis of Library Circulation Data

Sally Jo Cunningham and Eibe Frank. Market basket analysis of library circulation data. In T. Gedeon, P. Wong, S. Halgamuge, N. Kasabov, D. Nauck, and K. Fukushima, editors, Proc 6th International Conference on Neural Information Processing, volume II of Perth, Australia, pages 825-830. IEEE Service Center, 1999.
[ bib | .ps | .pdf ]
Market Basket Analysis algorithms have recently seen widespread use in analyzing consumer purchasing patterns�specifically, in detecting products that are frequently purchased together. We apply the Apriori market basket analysis tool to the task of detecting subject classification categories that co-occur in transaction records of books borrowed from a university library. This information can be useful in directing users to additional portions of the collection that may contain documents relevant to their information needs, and in determining a library's physical layout. These results can also provide insight into the degree of scatter that the classification scheme induces in a particular collection of documents.

[109]

Domain-Specific Keyphrase Extraction

Eibe Frank, Gordon W. Paynter, Ian H. Witten, Carl Gutwin, and Craig G. Nevill-Manning. Domain-specific keyphrase extraction. In Proc 16th International Joint Conference on Artificial Intelligence, Stockholm, Sweden, pages 668-673. Morgan Kaufmann, 1999.
[ bib | .ps | .pdf ]
Automatic keyphrase extraction is a promising area for applied machine learning because keyphrases are an important means for document summarization, clustering, and topic search. Only a small minority of documents have author-assigned keyphrases, and manually assigning keyphrases to existing documents is very laborious. Therefore, it is highly desirable to automate the keyphrase extraction process.

This paper presents a simple procedure for keyphrase extraction based on the naive Bayes learning scheme, which is shown to perform comparably to the state-of-the-art. It goes on to explain how the performance of this procedure can be boosted by automatically tailoring the extraction process to the particular document collection at hand. Results on a large collection of technical reports in computer science show that the quality of the extracted keyphrases improves significantly if domain-specific information is exploited.

[110]

Making Better Use of Global Discretization

Eibe Frank and Ian H. Witten. Making better use of global discretization. In Proc 16th International Conference on Machine Learning, Bled, Slovenia, pages 115-123. Morgan Kaufmann, 1999.
[ bib | .ps | .pdf ]
Before applying learning algorithms to datasets, practitioners often globally discretize any numeric attributes. If the algorithm cannot handle numeric attributes directly, prior discretization is essential. Even if it can, prior discretization often accelerates induction, and may produce simpler and more accurate classifiers.

As it is generally done, global discretization denies the learning algorithm any chance of taking advantage of the ordering information implicit in numeric attributes. However, a simple transformation of discretized data preserves this information in a form that learners can use. We show that, compared to using the discretized data directly, this transformation significantly increases the accuracy of decision trees built by C4.5, decision lists built by PART, and decision tables built using the wrapper method, on several benchmark datasets. Moreover, it can significantly reduce the size of the resulting classifiers.

This simple technique makes global discretization an even more useful tool for data preprocessing.

[111]

Correlation-based feature selection for machine learning

M.A. Hall. Correlation-based feature selection for machine learning. PhD thesis, University of Waikato, Department of Computer Science, Hamilton, New Zealand, April 1999.
[ bib | .ps | .pdf ]
A central problem in machine learning is identifying a representative set of features from which to construct a classification model for a particular task. This thesis addresses the problems of feature selection for machine learning through a correlation based approach. The central hypothesis is that good feature sets contain features that are highly correlated with the class, yet uncorrelated with each other. A feature evaluation formula, based on ideas from test theory, provides an operational definition of this hypothesis. CFS (Correlation based Feature Selection) is an algorithm that couples this evaluation formula with an appropriate correlation measure and a heuristic search strategy.

[112]

Feature selection for discrete and numeric class machine learning

M.A. Hall. Feature selection for discrete and numeric class machine learning. Technical Report 99/4, University of Waikato, Department of Computer Science, Hamilton, New Zealand, April 1999.
[ bib | .ps | .pdf ]
Algorithms for feature selection fall into two broad categories: wrappersuse the learning algorithm itself to evaluate the usefulness of features, while filtersevaluate features according to heuristics based on general characteristics of the data. For application to large databases, filters have proven to be more practical than wrappers because they are much faster. However, most existing filter algorithms only work with discrete classification problems.

This paper describes a fast, correlation-based filter algorithm that can be applied to continuous and discrete problems. Experiments using the new method as a preprocessing step for naive Bayes, instance-based learning, decision trees, locally weighted regression, and model trees show it to be an effective feature selector- it reduces the data in dimensionality by more than sixty percent in most cases without negatively affecting accuracy. Also, decision and model trees built from the pre-processed data are often significantly smaller.

[113]

Feature selection for machine learning: comparing a correlation-based filter approach to the wrapper

Mark Andrew Hall and Lloyd Smith. Feature selection for machine learning: comparing a correlation-based filter approach to the wrapper. In A. N. Kumar and I. Russel, editors, Proc Florida Artificial Intelligence Symposium, pages 235-239, Orlando, Florida, 1999. AAAI Press.
[ bib | .ps | .pdf ]
Feature selection is often an essential data processing step prior to applying a learning algorithm. The removal of irrelevant and redundant information often improves the performance of machine learning algorithms. There are two common approaches: a wrapperuses the intended learning algorithm itself to evaluate the usefulness of features, while a filterevaluates features according to heuristics based on general characteristics of the data. The wrapper approach is generally considered to produce better feature subsets but runs much more slowly than a filter. This paper describes a new filter approach to feature selection that uses a correlation based heuristic to evaluate the worth of feature subsets. When applied as a data preprocessing step for two common machine learning algorithms, the new method compares favourably with the wrapper but requires much less computation.

[114]

A diagnostic tool for tree based supervised classification learning algorithms

G. Holmes and L. Trigg. A diagnostic tool for tree based supervised classification learning algorithms. In Proc Sixth International Conference on Neural Information Processing (ICONIP'99), volume II, pages 514-519, Perth, Western Australia, November 1999.
[ bib | .ps | .pdf ]
The process of developing applications of machine learning and data mining that employ supervised classification algorithms includes the important step of knowledge verification. Interpretable output is presented to a user so that they can verify that the knowledge contained in the output makes sense for the given application. As the development of an application is an iterative process it is quite likely that a user would wish to compare models constructed at various times or stages.

One crucial stage where comparison of models is important is when the accuracy of a model is being estimated, typically using some form of cross-validation. This stage is used to establish an estimate of how well a model will perform on unseen data. This is vital information to present to a user, but it is also important to show the degree of variation between models obtained from the entire dataset and models obtained during cross-validation. In this way it can be verified that the cross-validation models are at least structurally aligned with the model garnered from the entire dataset.

This paper presents a diagnostic tool for the comparison of tree-based supervised classification models. The method is adapted from work on approximate tree matching and applied to decision trees. The tool is described together with experimental results on standard datasets.

[115]

Generating Rule Sets from Model Trees

Geoffrey Holmes, Mark Hall, and Eibe Frank. Generating rule sets from model trees. In Proc 12th Australian Joint Conference on Artificial Intelligence, Sydney, Australia, pages 1-12. Springer, 1999.
[ bib | .ps | .pdf ]
Model trees - decision trees with linear models at the leaf nodes - have recently emerged as an accurate method for numeric prediction that produces understandable models. However, it is known that decision lists - ordered sets of If-Then rules - have the potential to be more compact and therefore more understandable than their tree counterparts.

We present an algorithm for inducing simple, accurate decision lists from model trees. Model trees are built repeatedly and the best rule is selected at each iteration. This method produces rule sets that are as accurate but smaller than the model tree constructed from the entire dataset. Experimental results for various heuristics which attempt to find a compromise between rule accuracy and rule coverage are reported. We show that our method produces comparably accurate and smaller rule sets than the commercial state-of-the-art rule learning system Cubist.

[116]

Fitting a mixture model to three-mode three-way data with categorical and continuous variables

L.A. Hunt and K.E. Basford. Fitting a mixture model to three-mode three-way data with categorical and continuous variables. Journal of Classification, 16(2):283-296, 1999.
[ bib ]
The mixture likelihood approach to clustering is most often used with two-mode two-way data to cluster one of the modes (e.g., the entities) into homogeneous groups on the basis of the other mode (e.g., the attributes). In this case, the attributes can either be continuous or categorical. When the data set consists of a three-mode three-way array (e.g., attributes measured on entities in different situations), an analogous procedure is needed to enable the clustering of the entities (i.e., one of the modes) on the basis of both of the other modes simultaneously (i.e., the attributes measured in different situations). In this paper, it is shown that the finite mixture approach to clustering can be extended to analyze three-mode three-way data where some of the attributes care continuous and some are categorical. The methodology is illustrated by clustering the genotypes in a three-way soybean data set where various attributes were measured on genotypes grown in several environments.

[117]

Mixture model clustering using the MULTIMIX program

L. Hunt and M. Jorgensen. Mixture model clustering using the multimix program. Australian and New Zealand Journal of Statistics, 41(2):153-171, 1999.
[ bib ]
Hunt (1996) implemented the finite mixture model approach to clustering in a program called MULTIMIX. The program is designed to cluster multivariate data that have categorical and continuous variables and that possibly contain missing values. This paper describes the approach taken to design MULTIMIX and how some of the statistical problems were dealt with. As an example, the program is used to cluster an large medical dataset.

[118]

Issues in stacked generalization

K.M. Ting and I.H. Witten. Issues in stacked generalization. Journal of Artificial Intelligence Research, 10:271-289, May 1999.
[ bib | .ps | .pdf ]
Stacked generalization is a general method of using a high-level model to combine lower-level models to achieve greater predictive accuracy. In this paper we address two crucial issues which have been considered to be a 'black art' in classification tasks ever since the introduction of stacked generalization in 1992 by Wolpert: the type of generalizer that is suitable to derive the higher-level model, and the kind of attributes that should be used as its input. We find that best results are obtained when the higher-level model combines the confidence (and not just the predictions) of the lower-level ones.

We demonstrate the effectiveness of stacked generalization for combining three different types of learning algorithms for classification tasks. We also compare the performance of stacked generalization with majority vote and published results of arcing and bagging.

[119]

Clustering with finite data from semi-parametric mixture distributions

Y. Wang and I.H. Witten. Clustering with finite data from semi-parametric mixture distributions. In Proc Symposium on the Interface: Models, Predictions, and Computing, Schaumburg, Illinois, 1999.
[ bib ]
Existing clustering methods for the semi-parametric mixture distribution perform well as the volume of data increases. However, they all suffer from a serious drawback in finite-data situations: small outlying groups of data points can be completely ignored in the clusters that are produced, no matter how far away they lie from the major clusters. This can result in unbounded loss if the loss function is sensitive to the distance between clusters.

This paper proposes a new distance-based clustering method that overcomes the problem by avoiding global constraints. Experimental results illustrate its superiority to existing methods when small clusters are present in finite data sets; they also suggest that it is more accurate and stable than other methods even when there are no small clusters.

[120]

Pace regression

Y. Wang and I.H. Witten. Pace regression. Technical Report 99/12, University of Waikato, Department of Computer Science, Hamilton, New Zealand, September 1999.
[ bib | .ps | .pdf ]
This paper articulates a new method of linear regression, pace regression, that addresses many drawbacks of standard regression reported in the literature-particularly the subset selection problem. Pace regression improves on classical ordinary least squares (OLS) regression by evaluating the effect of each variable and using a clustering analysis to improve the statistical basis for estimating their contribution to the overall regression. As well as outperforming OLS, it also outperforms-in a remarkably general sense-other linear modeling techniques in the literature, including subset selection procedures, which seek a reduction in dimensionality that falls out as a natural byproduct of pace regression. The paper defines six procedures that share the fundamental idea of pace regression, all of which are theoretically justified in terms of asymptotic performance. Experiments confirm the performance improvement over other techniques.

[121]

KEA: Practical Automatic Keyphrase Extraction

Ian H. Witten, Gordon W. Paynter, Eibe Frank, Carl Gutwin, and Craig G. Nevill-Manning. KEA: Practical automatic keyphrase extraction. In Proc 4th ACM conference on Digital Libraries, Berkeley, CA, pages 254-255. ACM, August 1999.
[ bib | http | .ps | .pdf ]
Keyphrases provide semantic metadata that summarize and characterize documents. This paper describes Kea, an algorithm for automatically extracting keyphrases from text. Kea identifies candidate keyphrases using lexical methods, calculates feature values for each candidate, and uses a machine learning algorithm to predict which candidates are good keyphrases. The machine learning scheme first builds a prediction model using training documents with known keyphrases, and then uses the model to find keyphrases in new documents. We use a large test corpus to evaluate Kea's effectiveness in terms of how many author-assigned keyphrases are correctly identified. The system is simple, robust, and publicly available.

[122]

Weka: Practical Machine Learning Tools and Techniques with Java Implementations

Ian H. Witten, Eibe Frank, Len Trigg, Mark Hall, Geoffrey Holmes, and Sally Jo Cunningham. Weka: Practical machine learning tools and techniques with Java implementations. In Nikola Kasabov and Kitty Ko, editors, Proceedings of the ICONIP/ANZIIS/ANNES'99 Workshop on Emerging Knowledge Engineering and Connectionist-Based Information Systems, pages 192-196, 1999. Dunedin, New Zealand.
[ bib | .ps | .pdf ]
The Waikato Environment for Knowledge Analysis (Weka) is a comprehensive suite of Java class libraries that implement many state-of-the-art machine learning and data mining algorithms. Weka is freely available on the World-Wide Web and accompanies a new text on data mining [1] which documents and fully explains all the algorithms it contains. Applications written using the Weka class libraries can be run on any computer with a Web browsing capability; this allows users to apply machine learning techniques to their own data regardless of computer platform.

[123]

Text Categorization Using Compression Models

Eibe Frank, Chang Chui, and Ian H. Witten. Text categorization using compression models. In Data Compression Conference, Snowbird, Utah, page 555. IEEE Computer Society, 2000. Note: abstract only. Full paper is available as [132].
[ bib | .ps.gz | .pdf ]
[124]

Bottom-Up Propositionalization

Stefan Kramer and Eibe Frank. Bottom-up propositionalization. In J. Cussens and A. Frisch, editors, Proc Work-in-progress reports of the 10th International Conference on Inductive Logic Programming, pages 156-162. CEUR-WS.org, July 2000.
[ bib | .ps | .pdf ]
In this paper, we present a new method for propositionalization that works in a bottom-up, data-driven manner. It is tailored for biochemical databases, where the examples are 2-D descriptions of chemical compounds. The method generates all frequent fragments (i.e., linearly connected atoms) up to a user-specified length. A preliminary experiment in the domain of carcinogenicity prediction showed that bottom-up propositionalization is a promising approach to feature construction from relational data.

[125]

Pruning Decision Trees and Lists

Eibe Frank. Pruning Decision Trees and Lists. PhD thesis, Department of Computer Science, University of Waikato, 2000.
[ bib | .ps.gz | .pdf ]
[126]

Meta-Learning by Landmarking Various Learning Algorithms

B. Pfahringer, H. Bensusan, and C. Giraud-Carrier. Meta-learning by landmarking various learning algorithms. In P. Langley, editor, Proceedings of the 17th International Conference on Machine Learning (ICML-2000), July 2000.
[ bib | .ps | .pdf ]
Landmarking is a novel approach to describing tasks in meta-learning. Previous approaches to meta-learning mostly considered only statistics-inspired measures of the data as a source for the definition of meta-attributes. Contrary to such approaches, landmarking tries to determine the location of a specific learning problem in the space of all learning problems by directly measuring the performance of some simple and efficient learning algorithms themselves. In the experiments reported we show how such a use of landmark values can help to distinguish between areas of the learning space favouring different learners. Experiments, both with artificial and real-world databases, show that landmarking selects, with moderate but reasonable level of success, the best performing of a set of learning algorithms.

[127]

Learning to Use Operational Advice

J. Fuernkranz, B. Pfahringer, H. Kaindl, and S. Kramer. Learning to use operational advice. In W. Horn, editor, Proceedings of the 14th European Conference on Artificial Intelligence (ECAI 2000), pages 291-295, August 2000.
[ bib ]
[128]

A new approach to fitting linear models in high dimensional spaces

Yong Wang. A new approach to fitting linear models in high dimensional spaces. PhD thesis, University of Waikato, Department of Computer Science, Hamilton, New Zealand, 2000.
[ bib | .ps | .pdf ]
This thesis presents a new approach to fitting linear models, called 'pace regression', which also overcomes the dimensionality determination problem. Its optimality in minimizing the expected prediction loss is theoretically established, when the number of free parameters is infinitely large. In this sense, pace regression outperforms existing procedures for fitting linear models. Dimensionality determination, a special case of fitting linear models, turns out to be a natural by-product. A range of simulation studies are conducted; the results support the theoretical analysis.

Through the thesis, a deeper understanding is gained of the problem of fitting linear models. Many key issues are discussed. Existing procedures, namely OLS, AIC, BIC, RIC, CIC, CV(d), BS(m), RIDGE, NN-GAROTTE and LASSO, are reviewed and compared, both theoretically and empirically, with the new methods.

Estimating a mixing distribution is an indispensable part of pace regression. A measure-based minimum distance approach, including probability measures and nonnegative measures, is proposed, and strongly consistent estimators are produced. Of all minimum distance methods for estimating a mixing distribution, only the nonnegative-measure-based one solves the minority cluster problem, what is vital for pace regression.

Pace regression has striking advantages over existing techniques for fitting linear models. It also has more general implications for empirical modeling, which are discussed in the thesis.

[129]

Experiences with a weighted decision tree learner

J. G. Cleary, L. E. Trigg, G. Holmes, and M. A. Hall. Experiences with a weighted decision tree learner. In Proc 20th SGES International Conference on Knowledge Based Systems and Applied Artificial Intelligence, pages 35-47. Springer, 2000.
[ bib ]
[130]

Comparison of consumer and producer perceptions of mushroom quality

A.F. Bollen, N.J. Kusabs, G. Holmes, and M.A. Hall. Comparison of consumer and producer perceptions of mushroom quality. In W.J. Florkowski, S.E. Prussia, and R.L. Shewfelt, editors, Proc Integrated View of Fruit and Vegetable Quality International Multidisciplinary Conference, pages 303-311, Georgia, USA, 2000.
[ bib ]
The marketing of mushrooms in New Zealand is highly subjective. No detailed grading specifications exist and they are graded based on the experience of the growers (experts). The requirements of consumers are three or four steps removed in the value chain. The objective of this research has been to develop a quantitative set of descriptors which describe the quality grading criteria actually used by the graders, and to develop a similar set of criteria for the consumer. These two sets of descriptors were then compared to determine the difference in the two interpretations of quality.

Generally the consumers are classifying solely on visual attributes. There was disagreement between the consumer and the grower that suggested that the grower has adopted a quality profile which considerably exceeds that expected by the consumer group studied here.

The Machine Learning technique has been shown to produce a useful set of quality characterisation models for consumers. These have easily interpretable decision trees which are built on the objective attributes measured. The techniques help to provide an insight into the complex decisions made by consumers considering the purchase of mushrooms.

[131]

Technical Note: Naive Bayes for regression

E. Frank, L. Trigg, G. Holmes, and I.H. Witten. Technical note: Naive bayes for regression. Machine Learning, 41(1):5-26, 2000.
[ bib | .ps | .pdf ]
Despite its simplicity, the naive Bayes learning scheme performs well on most classification tasks, and is often significantly more accurate than more sophisticated methods. Although the probability estimates that it produces can be inaccurate, it often assigns maximum probability to the correct class. This suggests that its good performance might be restricted to situations where the output is categorical. It is therefore interesting to see how it performs in domains where the predicted value is numeric, because in this case, predictions are more sensitive to inaccurate probability estimates.

This paper shows how to apply the naive Bayes methodology to numeric prediction (i.e. regression) tasks, and compares it to linear regression, instance-based learning, and a method that produces model trees-decision trees with linear regression functions at the leaves. Although we exhibit an artificial dataset for which naive Bayes is the method of choice, on real-world datasets it is almost uniformly worse than model trees. The comparison with linear regression depends on the error measure: for one measure naive Bayes performs similarly, for another it is worse. Compared to instance-based learning, it performs similarly with respect to both measures. These results indicate that the simplistic statistical assumption that naive Bayes makes is indeed more restrictive for regression than for classification.

[132]

Text Categorization Using Compression Models

Chang Chui Eibe Frank and Ian H. Witten. Text categorization using compression models. Technical Report 00/02, Department of Computer Science, University of Waikato, January 2000.
[ bib | .ps | .pdf ]
Text categorization, or the assignment of natural language texts to predefined categories based on their content, is of growing importance as the volume of information available on the internet continues to overwhelm us. The use of predefined categories implies a supervised learning approach to categorization, where already-classified articles - which effectively define the categories - are used as training data to build a model that can be used for classifying new articles that comprise the test data. This contrasts with unsupervised learning, where there is no training data and clusters of like documents are sought amongst the test articles. With supervised learning, meaningful labels (such as keyphrases) are attached to the training documents, and appropriate labels can be assigned automatically to test documents depending on which category they fall into.

[133]

Benchmarking attribute selection techniques for data mining

M.A. Hall. Benchmarking attribute selection techniques for data mining. Technical Report 00/10, University of Waikato, Department of Computer Science, Hamilton, New Zealand, July 2000.
[ bib | .ps | .pdf ]
Data engineering is generally considered to be a central issue in the development of data mining applications. The success of many learning schemes, in their attempts to construct models of data, hinges on the reliable identification of a small set of highly predictive attributes. The inclusion of irrelevant, redundant and noisy attributes in the model building process phase can result in poor predictive performance and increased computation.

Attribute selection generally involves a combination of search and attribute utility estimation plus evaluation with respect to specific learning schemes. This leads to a large number of possible permutations and has led to a situation where very few benchmark studies have been conducted.

This paper presents a benchmark comparison of several attribute selection methods. All the methods produce an attribute ranking, a useful devise of isolating the individual merit of an attribute. Attribute selection is achieved by cross-validating the rankings with respect to a learning scheme to find the best attributes. Results are reported for a selection of standard data sets and two learning schemes C4.5 and naive Bayes.

[134]

Correlation-based feature selection for discrete and numeric class machine learning

Mark Andrew Hall. Correlation-based feature selection for discrete and numeric class machine learning. In Proc 17th International Conference on Machine Learning, pages 359-366. Morgan Kaufmann, June 2000.
[ bib | .ps | .pdf ]
Algorithms for feature selection fall into two broad categories: wrappersthat use the learning algorithm itself to evaluate the usefulness of features and filtersthat evaluate features according to heuristics based on general characteristics of the data. For application to large databases, filters have proven to be more practical than wrappers because they are much faster. However, most existing filter algorithms only work with discrete classification problems. This paper describes a fast, correlation-based filter algorithm that can be applied to continuous and discrete problems. The algorithm often out-performs the well-known ReliefF attribute estimator when used as a preprocessing step for naive Bayes, instance-based learning, decision trees, locally weighted regression, and model trees. It performs more feature selection than ReliefF does-reducing the data dimensionality by fifty percent in most cases. Also, decision and model trees built from the prepocessed data are often significantly smaller.

[135]

Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations

Ian H. Witten and Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Francisco, 2000.
[ bib | .html ]
This book complements the Weka software. It shows how to use Weka's Java algorithms to discern meaningful patterns in your data, how to adapt them for your specialized data mining applications, and how to develop your own machine learning schemes. It offers a thorough grounding in machine learning concepts as well as practical advice on applying machine learning tools and techniques in real-world data mining situations. Inside, you'll learn all you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining. If you're involved at any level in the work of extracting usable knowledge from large collections of data, this book will be a valuable resource.

[136]

Fitting a mixture model to three-mode three-way data with missing information

Lynette A Hunt and Kaye E Basford. Fitting a mixture model to three-mode three-way data with missing information. Journal of Classification, 18(2):209-226, 2001.
[ bib | http ]
When the data consist of certain attributes measured on the same set of items in different situations, they would be described as a three-mode three-way array. A mixture likelihood approach can be implemented to cluster the items (i.e., one of the modes) on the basis of both of the other modes simultaneously (i.e,, the attributes measured in different situations). In this paper, it is shown that this approach can be extended to handle three-mode three-way arrays where some of the data values are missing at random in the sense of Little and Rubin (1987). The methodology is illustrated by clustering the genotypes in a three-way soybean data set where various attributes were measured on genotypes grown in several environments.

[137]

Tag Insertion Complexity

Stuart Yeates, Ian H. Witten, and David Bainbridge. Tag insertion complexity. In Data Compression Conference, pages 243-252. IEEE Computer Society, 2001.
[ bib ]
[138]

Applications of machine learning in information retrieval

Sally Jo Cunningham, James Littin, and Ian H. Witten. Applications of machine learning in information retrieval. In M. E. Williams, editor, Annual Review of Information Science and Technology, pages 341-419. American Society for Information Science and Technology, 2001.
[ bib ]
[139]

Determining Progression in Glaucoma Using Visual Fields

Andrew Turpin, Eibe Frank, Mark Hall, Ian H. Witten, and Chris A. Johnson. Determining progression in glaucoma using visual fields. In Proc 5th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Hong Kong, China, pages 136-147. Springer, 2001.
[ bib | .ps | .pdf ]
The standardized visual field assessment, which measures visual function in 76 locations of the central visual area, is an important diagnostic tool in the treatment of the eye disease glaucoma. It helps determine whether the disease is stable or progressing towards blindness, with important implications for treatment. Automatic techniques to classify patients based on this assessment have had limited success, primarily due to the high variability of individual visual field measurements. The purpose of this paper is to describe the problem of visual field classification to the data mining community, and assess the success of datamining techniques on it. Preliminary results show that machine learning methods rival existing techniques for predicting whether glaucoma is progressing - though we have not yet been able to demonstrate improvements that are statistically significant. It is likely that further improvement is possible, and we encourage others to work on this important practical data mining problem.

[140]

Optimizing the Induction of Alternating Decision Trees

Bernhard Pfahringer, Geoffrey Holmes, and Richard Kirkby. Optimizing the induction of alternating decision trees. In D. Cheung, G.J. Williams, and Q. Li, editors, Proc 5th Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 477-487. Springer, April 2001.
[ bib | .ps | .pdf ]
The alternating decision tree brings comprehensibility to the performance enhancing capabilities of boosting. A single interpretable tree is induced wherein knowledge is distributed across the nodes and multiple paths are traversed to form predictions. The complexity of the algorithm is quadratic in the number of boosting iterations and this makes it unsuitable for larger knowledge discovery in database tasks. In this paper we explore various heuristic methods for reducing this complexity while maintaining the performance characteristics of the original algorithm. In experiments using standard, artificial and knowledge discovery datasets we show that a range of heuristic methods with log linear complexity are capable of achieving similar performance to the original method. Of these methods, the random walk heuristic is seen to outperform all others as the number of boosting iterations increases. The average case complexity of this method is linear.

[141]

Prediction of Ordinal Classes Using Regression Trees

Stefan Kramer, Gerhard Widmer, Bernhard Pfahringer, and Michael de Groeve. Prediction of ordinal classes using regression trees. Fundam. Inform., 47(1-2):1-13, 2001.
[ bib | .ps | .pdf ]
This paper is devoted to the problem of learning to predict ordinal (i.e., ordered discrete) classes using classification and regression trees. We start with S-CART, a tree induction algorithm, and study various ways of transforming it into a learner for ordinal classification tasks. These algorithm variants are compared on a number of benchmark data sets to verify the relative strengths and weaknesses of the strategies and to study the trade-off between optimal categorical classification accuracy (hit rate) and minimum distance-based error. Preliminary results indicate that this is a promising avenue towards algorithms that combine aspects of classification and regression.

[142]

A Simple Approach to Ordinal Classification

Eibe Frank and Mark Hall. A simple approach to ordinal classification. Technical Report 01/05, Department of Computer Science, University of Waikato, 2001.
[ bib | .ps | .pdf ]
This is an updated version of a paper with the same title that appeared at the European Conference on Machine Learning 2001, Freiburg, Germany. Springer-Verlag, pp. 145-165.

Machine learning methods for classification problems commonly assume that the class values are unordered. However, in many practical applications the class values do exhibit a natural order - for example, when learning how to grade. The standard approach to ordinal classification converts the class value into a numeric quantity and applies a regression learner to the transformed data, translating the output back into a discrete class value in a post-processing step. A disadvantage of this method is that it can only be applied in conjunction with a regression scheme. In this paper we present a simple method that enables standard classification algorithms to make use of ordering information in class attributes. By applying it in conjunction with a decision tree learner we show that it outperforms the naive approach, which treats the class values as an unordered set. Compared to special-purpose algorithms for ordinal classification our method has the advantage that it can be applied without any modification to the underlying learning scheme.

[143]

A Simple Approach to Ordinal Classification

Eibe Frank and Mark Hall. A simple approach to ordinal classification. In Proc 12th European Conference on Machine Learning, Freiburg, Germany, pages 145-156. Springer, 2001. Note: there is a small bug in the description of the algorithm. Please consult [142] instead.
[ bib ]
[144]

Interactive machine learning: letting users build classifiers

Malcolm Ware, Eibe Frank, Geoffrey Holmes, Mark Hall, and Ian H. Witten. Interactive machine learning: letting users build classifiers. Int. J. Hum.-Comput. Stud., 55(3):281-292, 2001.
[ bib | .ps | .pdf ]
According to standard procedure, building a classifier using machine learning is a fully automated process that follows the preparation of training data by a domain expert. In contrast, interactive machine learning engages users in actually generating the classifier themselves. This offers a natural way of integrating background knowledge into the modeling stage - so long as interactive tools can be designed that support efficient and effective communication. This paper shows that appropriate techniques can empower users to create models that compete with classifiers built by state-of-the-art learning algorithms. It demonstrates that users - even users who are not domain experts - can often construct good classifiers, without any help from a learning algorithm, using a simple two-dimensional visual interface. Experiments on real data demonstrate that, not surprisingly, success hinges on the domain: if a few attributes can support good predictions, users generate accurate classifiers, whereas domains with many high-order attribute interactions favor standard machine learning techniques. We also present an artificial example where domain knowledge allows an expert user to create a much more accurate model than automatic learning algorithms. These results indicate that our system has the potential to produce highly accurate classifiers in the hands of a domain expert who has a strong interest in the domain and therefore some insights into how to partition the data. Moreover, small expert-defined models offer the additional advantage that they will generally be more intelligible than those generated by automatic techniques.

[145]

(The Futility of) Trying to Predict Carcinogenicity of Chemical Compounds

B. Pfahringer. (the futility of) trying to predict carcinogenicity of chemical compounds. In The Predictive Toxicology Challenge Workshop, Twelfth European Conference on Machine Learning (ECML2001), Freiburg, Germany, 2001.
[ bib | .ps | .pdf ]
This paper describes my submission to one of the sub-problems formulated for the Predictive Toxicology Challenge 2001. The challenge is to predict the carcinogenicity of chemicals based on structural information only. I have only tackled such predictions for bioessays involving male rats. As we currently do not know the true predictions for the testset, all we can say is that one of the models supplied by us seems to be optimal over some subrange of the ROC spectrum. The successful model uses a voting approach based on most of the sets of structural features made available by various other contestants as well as the organizers in an earlier phase of the Challenge. The WEKA Machine Learning workbench served as the core learning utility. Based on a preliminary examination of our submission we conclude that reliable prediction of carcinogenicity is still a far away goal.

[146]

Wrapping Boosters against Noise

Bernhard Pfahringer, Geoffrey Holmes, and Gabi Schmidberger. Wrapping boosters against noise. In Proc 14th Australian Joint Conference on Artificial Intelligence, pages 402-413. Springer, 2001.
[ bib | .pdf | .ps ]
Wrappers have recently been used to obtain parameter optimizations for learning algorithms. In this paper we investigate the use of a wrapper for estimating the correct number of boosting ensembles in the presence of class noise. Contrary to the naive approach that would be quadratic in the number of boosting iterations, the incremental algorithm described is linear.

Additionally, directly using the k-sized ensembles generated during k-fold cross-validation search for prediction usually results in further improvements in classification performance. This improvement can be attributed to the reduction of variance due to averaging k ensembles instead of using only one ensemble. Consequently, cross-validation in the way we use it here, termed wrapping, can be viewed as yet another ensemble learner similar in spirit to bagging but also somewhat related to stacking.

[147]

Investigation of association models to describe consumer purchase patterns

Mark Andrew Hall, N. J. Kusabs, D. Gillgren, and A. F. Bollen. Investigation of association models to describe consumer purchase patterns. In Proc International Symposium on Applications of Modeling as an Innovative Technology in the Agri-Food-Chain, pages 167-173. ISHS, 2001.
[ bib ]
[148]

Food Process Modelling

G. Holmes and T. Smith. Food Process Modelling, chapter Data mining. Woodhead Publishing Ltd, Cambridge, UK, 2001.
[ bib ]
[149]

Accuracy bounds for ensembles under 0 - 1 loss

R.R. Bouckaert. Accuracy bounds for ensembles under 0 - 1 loss. Technical Report 04/02, University of Waikato, Computer Science Department, Hamilton, New Zealand, 2002.
[ bib | .ps | .pdf ]
This paper is an attempt to increase the understanding in the behavior of ensembles for discrete variables in a quantitative way. A set of tight upper and lower bounds for the accuracy of an ensemble is presented for wide classes of ensemble algorithms, including bagging and boosting. The ensemble accuracy is expressed in terms of the accuracies of the members of the ensemble.

Since those bounds represent best and worst case behavior only, we study typical behavior as well, and discuss its properties. A parameterized bound is presented which describes ensemble behavior as a mixture of dependent base classifier and independent base classifier areas. Some empirical results are presented to support our conclusions.

[150]

Low level information extraction: a Bayesian network based approach

R.R. Bouckaert. Low level information extraction: a bayesian network based approach. In Workshop on Text Learning (TextML-2002), 2002.
[ bib | .ps | .pdf ]
In this article, a contribution is made to information extraction and Bayesian network learning motivated by two practical information extraction tasks.

It is shown that some information extraction tasks can be approached as a classification problem where the text is split in tokens and each token is assigned a class. Hidden Markov models are a popular formalism for this task, however they do not deal with tokens having a set of attributes instead of a single one. A new algorithm for this task is presented using various Bayesian networks architectures that deals with multiple attributes per token. Experiments suggest most Bayesian networks architectures perform better than Naive Bayes in our problem domain.

Hopefully, this article helps in making Bayesian networks more accessible for the information extraction community.

[151]

Learning Structure from Sequences, with Applications in a Digital Library

Ian H. Witten. Learning structure from sequences, with applications in a digital library. In ALT, pages 42-56, 2002.
[ bib ]
[152]

Optimising tabling structures for bottom-up logic programming

R. J. Clayton, J. G. Cleary, B. Pfahringer, and B. M. Utting. Optimising tabling structures for bottom-up logic programming. In Proc International Workshop on Logic Based Program Development and Transformation LOPSTR'02, pages 57-75, Madrid, Spain, 2002. Fundacion General de la Universidad Politechica de Madrid, Madrid.
[ bib ]
[153]

Racing Committees for Large Datasets

Eibe Frank, Geoffrey Holmes, Richard Kirkby, and Mark Hall. Racing committees for large datasets. In S. Lange, K. Satoh, and C. H. Smith, editors, Proc 5th International Conference on Discovery Science, volume 2534 of LNCS, pages 153-164. Springer, 2002. Also published as Working Paper 03/02, Computer Science Department, University of Waikato, Hamilton.
[ bib | .ps | .pdf ]
This paper proposes a method for generating classifiers from large datasets by building a committee of simple base classifiers using a standard boosting algorithm. It permits the processing of large datasets even if the underlying base learning algorithm cannot efficently do so. The basic idea is to split incoming data into chunks and build a committee based on classifiers built from these individual chunks. Our method extends earlier work by introducing a method for adaptively pruning the committee. This is essential when applying the algorithm in practice because it dramatically reduces the algorithm's running time and memory consumption. It also makes it possible to efficently race committees corresponding to different chunk sizes. This is important because our empirical results show that the accuracy of the resulting committee can vary significantly with the chunk size. They also show that pruning is indeed crucial to make the method practical for large datasets in terms of running time and memory requirements. Surprisingly, the results demonstrate that pruning can also improve accuracy.

[154]

An experimental speech to graphics system

A. C. Golightly and T. C. Smith. An experimental speech to graphics system. In S. Jones and M. Masoodian, editors, Proc SIGCHI-NZ Symposium on Computer-Human Interaction, pages 91-96, Hamilton, New Zealand, 2002.
[ bib ]
[155]

Benchmarking attribute selection techniques for discrete class data mining

M. Hall and G. Holmes. Benchmarking attribute selection techniques for discrete class data mining. Technical Report 02/02, The University of Waikato, Department of Computer Science, Hamilton, New Zealand, 2002.
[ bib ]
Data engineering is generally considered to be a central issue in the development of data mining applications. The success of many learning schemes, in their attempts to construct models of data, hinges on the reliable identification of a small set of highly predictive attributes. The inclusion of irrelevant, redundant and noisy attributes in the model building process phase can result in poor predictive performance and increased computation.

Attribute selection generally involves a combination of search and attribute utility estimation plus evaluation with respect to specific learning schemes. This leads to a large number of possible permutation and has led to a situation where very few benchmark studies have been conducted.

This paper presents a benchmark comparison of several attribute selection methods for supervised classification. All the methods produce an attribute ranking, a useful devise for isolating the individual merit of an attribute. Attribute selection is achieved by cross-validating the attribute rankings with respect to a classification learner to find the best attributes. Results are reported for a selection of standard data sets and two diverse learning schemes C4.5 and na�ve Bayes.

[156]

A development environment for predictive modelling in foods

G. Holmes and M. Hall. A development environment for predictive modelling in foods. International Journal of Food Microbiology, 73(2-3):351-362, 2002.
[ bib ]
[157]

A logistic boosting approach to inducing multiclass alternating decision trees

G. Holmes, B. Pfahringer, E. T. Frank, R. B. Kirkby, and M. A. Hall. A logistic boosting approach to inducing multiclass alternating decision trees. Technical Report 01/02, The University of Waikato, Department of Computer Science, Hamilton, New Zealand, 2002.
[ bib ]
The alternating decision tree (ADTree) is a successful classification technique that combine decision trees with the predictive accuracy of boosting into a ser to interpretable classification rules. The original formulation of the tree induction algorithm restricted attention to binary classification problems. This paper empirically evaluates several methods for extending the algorithm to the multiclass case by splitting the problem into several two-class LogitBoost procedure to induce alternating decision trees directly. Experimental results confirm that this procedure is comparable with methods that are based on the original ADTree formulation in accuracy, while inducing much smaller trees.

[158]

Multiclass Alternating Decision Trees

Geoffrey Holmes, Bernhard Pfahringer, Richard Kirkby, Eibe Frank, and Mark Hall. Multiclass alternating decision trees. In T. Elomaa, H. Mannila, and H. Toivonen, editors, Proc 13th European Conference on Machine Learning, volume 2430 of LNCS, pages 161-172. Springer, 2002.
[ bib | .ps | .pdf ]
The alternating decision tree (ADTree) is a successful classification technique that combines decision trees with the predictive accuracy of boosting into a set of interpretable classification rules. The original formulation of the tree induction algorithm restricted attention to binary classification problems. This paper empirically evaluates several wrapper methods for extending the algorithm to the multiclass case by splitting the problem into several two-class problems. Seeking a more natural solution we then adapt the multiclass LogitBoost and AdaBoost.MH procedures to induce alternating decision trees directly. Experimental results confirm that these procedures are comparable with wrapper methods that are based on the original ADTree formulation in accuracy, while inducing much smaller trees.

[159]

Fragment Generation and Support Vector Machines for Inducing SARs

Stefan Kramer, Eibe Frank, and Christoph Helma. Fragment generation and support vector machines for inducing SARs. SAR and QSAR in Environmental Research, 13(5):509-523, 2002.
[ bib | http ]
We present a new approach to the induction of SARs based on the generation of structural fragments and support vector machines (SVMs). It is tailored for bio-chemical databases, where the examples are two-dimensional descriptions of chemical compounds. The fragment generator finds all fragments (i.e. linearly connected atoms) that satisfy user-specified constraints regarding their frequency and generality. In this paper, we are querying for fragments within a minimum and a maximum frequency in the dataset. After fragment generation, we propose to apply SVMs to the problem of inducing SARs from these fragments. We conjecture that the SVMs are particularly useful in this context, as they can deal with a large number of features. Experiments in the domains of carcinogenicity and mutagenicity prediction show that the minimum and the maximum frequency queries for fragments can be answered within a reasonable time, and that the predictive accuracy obtained using these fragments is satisfactory. However, further experiments will have to confirm that this is a viable approach to inducing SARs.

[160]

Modeling for optimal probability prediction

Y. Wang and I.H. Witten. Modeling for optimal probability prediction. In C. Sammut and A. Hoffmann, editors, Proceedings of the Nineteenth International Conference on Machine Learning, San Francisco, California, pages 650-657. Morgan Kaufmann, 2002.
[ bib | .ps | .pdf ]
We present a general modeling method for optimal probability prediction over future observations, in which model dimensionality is determined as a natural by-product. This new method yields several estimators, and we establish theoretically that they are optimal (either overall or under stated restrictions) when the number of free parameters is infinite. As a case study, we investigate the problem of fitting logistic models in finite-sample situations. Simulation results on both artificial and practical datasets are supportive.

[161]

Mixture model clustering for mixed data with missing information

Lynette Hunt and Murray Jorgensen. Mixture model clustering for mixed data with missing information. Computational Statistics & Data Analysis, 41(3–4):429 - 440, 2003.
[ bib | http ]
One difficulty with classification studies is unobserved or missing observations that often occur in multivariate datasets. The mixture likelihood approach to clustering has been well developed and is much used, particularly for mixtures where the component distributions are multivariate normal. It is shown that this approach can be extended to analyse data with mixed categorical and continuous attributes and where some of the data are missing at random in the sense of Little and Rubin (Statistical Analysis with Mixing Data, Wiley, New York).

[162]

Token Identification Using HMM and PPM Models.

Yingying Wen, Ian H. Witten, and Dianhui Wang. Token identification using hmm and ppm models. In Proc 16th Australian Joint Conference on Artificial Intelligence, pages 173-185. Springer, 2003.
[ bib ]
Hidden markov models (HMMs) and prediction by partial matching models (PPM) have been successfully used in language pro- cessing tasks including learning-based token identification. Most of the existing systems are domain- and language-dependent. The power of re- targetability and applicability of these systems is limited. This paper investigates the effect of the combination of HMMs and PPM on to- ken identification. We implement a system that bridges the two well known methods through words new to the identification model. The sys- tem is fully domain- and language-independent. No changes of code are necessary when applying to other domains or languages. The only re- quired input of the system is an annotated corpus. The system has been tested on two corpora and achieved an overall F-measure of 69.02% for TCC, and 76.59% for BIB. Although the performance is not as good as that obtained from a system with language-dependent components, our proposed system has power to deal with large scope of domain- and language-independent problem. Identification of date has the best result, 73% and 92% of correct tokens are identified for two corpora respectively. The system also performs reasonably well on people’s name with correct tokens of 68% for TCC, and 76% for BIB.

[163]

A probabilistic line breaking algorithm

R.R. Bouckaert. A probabilistic line breaking algorithm. In Proc Sixteenth Australian Joint Conference on Artificial Intelligence, 2003.
[ bib | .pdf ]
We show how a probabilistic interpretation of an ill defined problem, the problem of finding line breaks in a paragraph, can lead to an efficient new algorithm that performs well. The graphical model that results from the probabilistic interpretation has the advantage that it is easy to tune due to the probabilistic approach. Furthermore, the algorithm optimizes the probability a break up is acceptable over the whole paragraph, it does not show threshold effects and it allows for easy incorporation of subtle typographical rules. Thanks to the architecture of the Bayesian network, the algorithm is linear in the number of characters in a paragraph. Empirical evidence suggests that this algorithm performs closer to results published through desk top publishing than a number of existing systems.

[164]

Choosing between two learning algorithms based on calibrated tests

R.R. Bouckaert. Choosing between two learning algorithms based on calibrated tests. In T. Fawcett and N. Mishra, editors, Proc Twentieth International Conference on Machine Learning, pages 51-58, Washington, USA, 2003. Morgan Kaufmann.
[ bib | .pdf ]
Designing a hypothesis test to determine the best of two machine learning algorithms with only a small data set available is not a simple task. Many popular tests suffer from low power (5x2 cv), or high Type I error (Weka�s 10x10 cross validation). Furthermore, many tests show a low level of replicability, so that tests performed by different scientists with the same pair of algorithms, the same data sets and the same hypothesis test still may present different results. We show that 5x2 cv, resampling and 10 fold cv suffer from low replicability. The main complication is due to the need to use the data multiple times. As a consequence, independence assumptions for most hypothesis tests are violated. In this paper, we pose the case that reuse of the same data causes the effective degrees of freedom to be much lower than theoretically expected. We show how to calibrate the effective degrees of freedom empirically for various tests. Some tests are not calibratable, indicating another flaw in the design. However the ones that are calibratable all show very similar behavior. Moreover, the Type I error of those tests is on the mark for a wide range of circumstances, while they show a power and replicability that is a considerably higher than currently popular hypothesis tests.

[165]

Choosing learning algorithms using sign tests with high replicability

R.R. Bouckaert. Choosing learning algorithms using sign tests with high replicability. In Proc Sixteenth Australian Joint Conference on Artificial Intelligence, 2003.
[ bib | .ps ]
An important task in machine learning is determining which learning algorithm works best for a given data set. When the amount of data is small the same data needs to be used repeatedly in order to get a reasonable estimate of the accuracy of the learning algorithms. This results in violations of assumptions on which standard tests are based and makes it hard to design a good test.

In this article, we investigate sign tests to address the problem of choosing the best of two learning algorithms when only a small data set is available. Sign tests are conceptually simple and no assumption about underlying distributions is required. We show that simplistic sample generation can lead to flawed test outcomes. Furthermore, we identify a test that performs well based on Type I error (showing a difference between algorithms when there is none), power (showing a difference when it indeed exists) and replicability.

Replicability is a novel measure of a quality of a test that gives an indication of how likely it is that the test outcome will be the same when the same test on the same data with the same sampling scheme and same pair of algorithms is executed, but with a difference randomization of the data. A new definition of replicability is provided and its benefits highlighted. A new definition of replicability is provided and its benefits highlighted. Empirical evidence is provided to show the test is robust under a varied range of circumstances.

[166]

Applying Propositional Learning Algorithms to Multi-instance data

Eibe Frank and Xin Xu. Applying propositional learning algorithms to multi-instance data. Technical Report 06/03, Department of Computer Science, University of Waikato, 2003.
[ bib | .ps.gz | .pdf ]
Multi-instance learning is commonly tackled using special-purpose algorithms. Development of these algorithms has started because early experiments with standard propositional learners have failed to produce satisfactory results on multi-instance data-more specifically, the Musk data. In this paper we present evidence that this is not necessarily the case. We introduce a simple wrapper for applying standard propositional learners to multi-instance problems and present empirical results for the Musk data that are competitive with genuine multi-instance algorithms. The key features of our new wrapper technique are: (1) it discards the standard multi-instance assumption that there is some inherent difference between positive and negative bags, and (2) it introduces weights to treat instances from different bags differently. We show that these two modifications are essential for producing good results on the Musk benchmark datasets.

[167]

Locally Weighted Naive Bayes

Eibe Frank, Mark Hall, and Bernhard Pfahringer. Locally weighted naive Bayes. In U. Kjaerulff and C. Meek, editors, Proc 19th Conference in Uncertainty in Artificial Intelligence, Acapulco, Mexico, pages 249-256. Morgan Kaufmann, 2003. Also published as Working Paper 04/03, Department of Computer Science, The University of Waikato, Hamilton.
[ bib | .ps.gz | .pdf ]
Despite its simplicity, the naive Bayes classifier has surprised machine learning researchers by exhibiting good performance on a variety of learning problems. Encouraged by these results, researchers have looked to overcome naive Bayes' primary weakness-attribute independence-and improve the performance of the algorithm. This paper presents a locally weighted version of naive Bayes that relaxes the independence assumption by learning local models at prediction time. Experimental results show that locally weighted naive Bayes rarely degrades accuracy compared to standard naive Bayes and, in many cases, improves accuracy dramatically. The main advantage of this method compared to other techniques for enhancing naive Bayes is its conceptual and computational simplicity.

[168]

Predicting Library of Congress Classifications from Library of Congress Subject Headings

E. T. Frank and G. W. Paynter. Predicting library of congress classifications from library of congress subject headings. Technical Report 01/03, The University of Waikato, Department of Computer Science, Hamilton, New Zealand, 2003.
[ bib | .ps | .pdf ]
This paper addresses the problem of automatically assigning a Library of Congress Classification (LCC) to work given its set of Library of Congress Subject Headings (LCSH). LCC are organized in a tree: the root node of this hierarchy comprises all possible topics, and leaf nodes correspond to the most specialized topic areas defined. We describe a procedure that, given a resource identified by its LCSH, automatically places that resource in the LCC hierarchy. The procedure uses machine learning techniques and training data from a large library catalog to learn a classification model mapping from sets of LCSH to nodes in the LCC tree. We present empirical results for our technique showing its accuracy on an independent collection of 50,000 LCSH/LCC pairs.

[169]

Visualizing Class Probability Estimators

Eibe Frank and Mark Hall. Visualizing class probability estimators. In N. Lavrac and et al., editors, Proc 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, volume 2838 of LNCS, pages 168-179. Springer, 2003. Also published as Working Paper 02/03, Department of Computer Science, The University of Waikato, Hamilton.
[ bib | .ps.gz | .pdf ]
Inducing classifiers that make accurate predictions on future data is a driving force for research in inductive learning. However, also of importance to the users is how to gain information from the models produced. Unfortunately, some of the most powerful inductive learning algorithms generate black boxes�that is, the representation of the model makes it virtually impossible to gain any insight into what has been learned. This paper presents a technique that can help the user understand why a classifier makes the predictions that it does by providing a two-dimensional visualization of its class probability estimates. It requires the classifier to generate class probabilities but most practical algorithms are able to do so (or can be modified to this end).

[170]

Benchmarking Attribute Selection Techniques for Discrete Class Data Mining

Mark Andrew Hall and Geoffrey Holmes. Benchmarking attribute selection techniques for discrete class data mining. IEEE Transactions on Knowledge and Data Engineering, 15(3):1437-1447, November/December 2003.
[ bib | .pdf ]
Data engineering is generally considered to be a central issue in the development of data mining applications. The success of many learning schemes, in their attempts to construct models of data, hinges on the reliable identification of a small set of highly predictive attributes. The inclusion of irrelevant, redundant and noisy attributes in the model building process phase can result in poor predictive performance and increased computation.

Attribute selection generally involves a combination of search and attribute utility estimation plus evaluation with respect to specific learning schemes. This leads to a large number of possible permutations and has led to a situation where very few benchmark studies have been conducted.

This paper presents a benchmark comparison of several attribute selection methods for supervised classification. All the methods produce an attribute ranking, a useful devise for isolating the individual merit of an attribute. Attribute selection is achieved by cross-validating the attribute rankings with respect to a classification learner to find the best attributes. Results are reported for a selection of standard data sets and two diverse learning schemes C4.5 and naive Bayes.

[171]

Mining data streams using option trees

G. Holmes, B. Pfahringer, and R. B. Kirkby. Mining data streams using option trees. Technical Report 08/03, The University of Waikato, Department of Computer Science, Hamilton, New Zealand, 2003.
[ bib ]
The data stream model for data mining places harsh restrictions on a learning algorithm. A model must be induced following the briefest interrogation of the data, must use only available memory and must update itself over time within these constraints. Additionally, the model must be able to be used for data mining at any point in time. This paper describes a data stream classification algorithm using an ensemble of option trees. The ensemble of trees is induced by boosting and iteratively combined into a single interpretable model. The algorithm is evaluated using benchmark datasets for accuracy against state-of-the-art algorithms that make use of the entire dataset.

[172]

Logistic Model Trees

Niels Landwehr, Mark Hall, and Eibe Frank. Logistic model trees. In N. Lavrac and et al., editors, Proc 14th European Conference on Machine Learning, volume 2837 of LNCS, pages 241-252, Cavtat-Dubrovnik, Croatia, 2003. Springer.
[ bib | .ps | .pdf ]
Tree induction methods and linear models are popular techniques for supervised learning tasks, both for the prediction of nominal classes and continuous numeric values. For predicting numeric quantities, there has been work on combining these two schemes into �model trees�, i.e. trees that contain linear regression functions at the leaves. In this paper, we present an algorithm that adapts this idea for classification problems, using logistic regression instead of linear regression. We use a stagewise fitting process to construct the logistic regression models that can select relevant attributes in the data in a natural way, and show how this approach can be used to build the logistic regression models at the leaves by incrementally refining those constructed at higher levels in the tree. We compare the performance of our algorithm against that of decision trees and logistic regression on 32 benchmark UCI datasets, and show that it achieves a higher classification accuracy on average than the other two methods.

[173]

The ability for fourier transform infrared spectroscopy to classify persimmon genotypes by epicuticular leaf waxes

A.D. Mowat and G. Holmes. The ability for fourier transform infrared spectroscopy to classify persimmon genotypes by epicuticular leaf waxes. In Proc Proc Second International Persimmon Symposium, ISHS Acta Horticulturae 601, pages 65-69, Queensland, Australia, 2003.
[ bib ]
The ability of Fourier Transform Infrared (FTIR) spectroscopic measurements of epicuticular leaf waxes to classify five non-astringent persimmon cultivars ( Diospyros kaki L. cv. Fuyu, cv. Matsumoto wase Fuyu, cv. Hanna Fuyu, cv. Oku Gosho and cv. II-IQ-12) was investigated. Principal component analysis was used to reduce the FTIR spectral data to 20 variables. Machine learning models classified the cultivars into five groups. Three of the cultivars, Oku Gosho, Hanna Fuyu and II-IQ-12 were readily distinguished from each other and from Fuyu and Matsumoto wase Fuyu. However, the leaf wax data was unable to distinguished between two closely related persimmon genotypes, Fuyu and the early maturing L1 mutation of Fuyu, Matsumoto wase Fuyu. Here, epicuticular wax biosynthesis appears unaffected by the LI mutation found in Matsumoto wase Fuyu. When the classification was based on four groups, where Fuyu and Matsumoto wase Fuyu samples were combined, the best performing machine learning model correctly classified 98.7% of the samples.

[174]

Propositionalization through Stochastic Discrimination

B. Pfahringer and G. Holmes. Propositionalization through stochastic discrimination. In Proceedings of the Work-in-Progress Track at the 13th International Conference on Inductive Logic Programming, pages 60-68. Department of Informatics, University of Szeged, Hungary, September 2003.
[ bib | .ps | .pdf ]
A simple algorithm based on the theory of stochastic discrimination is developed for the fast extraction of sub-graphs with potential discriminative power from a given set of pre-classified graphs. A preliminary experimental evaluation indicates the potential of the approach. Limitations are discussed as well as directions for future research.

[175]

Text Categorisation Using Document Profiling

Maximilien Sauban and Bernhard Pfahringer. Text categorisation using document profiling. In N Lavrac, editor, Proc 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, volume 2838 of LNCS, pages 411-422. Springer, 2003.
[ bib | .ps | .pdf ]
This paper presents an extension of prior work by Michael D. Lee on psychologically plausible text categorisation. Our approach utilises Lee�s model as a pre-processing filter to generate a dense representation for a given text document (a document profile) and passes that on to an arbitrary standard propositional learning algorithm. Similarly to standard feature selection for text classification, the dimensionality of instances is drastically reduced this way, which in turn greatly lowers the computational load for the subsequent learning algorithm. The filter itself is very fast as well, as it basically is just an interesting variant of Naive Bayes. We present different variations of the filter and conduct an evaluation against the Reuters-21578 collection that shows performance comparable to previously published results on that collection, but at a lower computational cost.

[176]

All of MEDLINE indexed to the Gene Ontology

A. C. Smith and J. G. Cleary. All of medline indexed to the gene ontology. In Proc Sixth Annual Bio-Ontologies Meeting, pages 7-11, Brisbane, Australia, 2003.
[ bib ]
Most of what is known about genes and proteins is contained in the biomedical literature. The problem for the biologist is how to connect novel sequence data to relevant published documents. One way is to BLAST the sequence and then follow the literature links established in genomic/proteomic databases for known sequences with similar structure. Another way is to find the closely-matching genes or proteins in the Gene Ontology, then retrieve documents associated with GO terms. The advantage of this approach is that it provides a conceptual context for discovering possible genetic roles or molecular functions for a new sequence. The problems with both search strategies, however, is that they return only a small portion of the available literature. We are solving this problem by amplifying the available documents associated with GO terms to cover the entirety of the MEDLINE corpus.

[177]

A Two-Level Learning Method for Generalized Multi-instance Problems

Nils Weidmann, Eibe Frank, and Bernhard Pfahringer. A two-level learning method for generalized multi-instance problems. In N. Lavrac and et al., editors, Proc 14th European Conference on Machine Learning, Cavtat-Dubrovnik, Croatia, pages 468-479. Springer, 2003.
[ bib | .ps.gz | .pdf ]
In traditional multi-instance (MI) learning, a single positive instance in a bag produces a positive class label. Hence, the learner knows how the bag�s class label depends on the labels of the instances in the bag and can explicitly use this information to solve the learning task. In this paper we investigate a generalized view of the MI problem where this simple assumption no longer holds. We assume that an �interaction� between instances in a bag determines the class label. Our two-level learning method for this type of problem transforms an MI bag into a single meta-instance that can be learned by a standard propositional method. The meta-instance indicates which regions in the instance space are covered by instances of the bag. Results on both artificial and real-world data show that this two-level classification approach is well suited for generalized MI problems.

[178]

Statistical learning in multiple instance problem

Xin Xu. Statistical learning in multiple instance problem. Master's thesis, University of Waikato, Hamilton, NZ, 2003. 0657.594.
[ bib | http | .ps.gz ]
Multiple instance (MI) learning is a relatively new topic in machine learning. It is concerned with supervised learning but differs from normal supervised learning in two points: (1) it has multiple instances in an example (and there is only one instance in an example in standard supervised learning), and (2) only one class label is observable for all the instances in an example (whereas each instance has its own class label in normal supervised learning). In MI learning there is a common assumption regarding the relationship between the class label of an example and the unobservable class labels of the instances inside it. This assumption, which is called the MI assumption in this thesis, states that An example is positive if at least one of its instances is positive and negative otherwise.

In this thesis, we first categorize current MI methods into a new framework. According to our analysis, there are two main categories of MI methods, instancebased and metadata-based approaches. Then we propose a new assumption for MI learning, called the collective assumption. Although this assumption has been used in some previous MI methods, it has never been explicitly stated, and this is the first time that it is formally specified. Using this new assumption we develop new algorithms - more specifically two instance-based and one metadata-based methods. All of these methods build probabilistic models and thus implement statistical learning algorithms. The exact generative models underlying these methods are explicitly stated and illustrated so that one may clearly understand the situations to which they can best be applied. The empirical results presented in this thesis show that they are competitive on standard benchmark datasets. Finally, we explore some practical applications of MI learning, both existing and new ones.

This thesis makes three contributions: a new framework for MI learning, new MI methods based on this framework and experimental results for new applications of MI learning.

[179]

Unsupervised learning from incomplete data using a mixture model approach

Lynette Hunt and Murray Jorgensen. Unsupervised learning from incomplete data using a mixture model approach. In Statistical Data Mining and Knowledge Discovery, pages 173-188. Chapman & Hall/CRC, 2004.
[ bib ]
Many unsupervised learning tasks involve high dimensional data sets where some of the attributes are continuous and some are categorical. One possible approach to clustering such data is to assume that the data to be clustered arise from a finite mixture of populations. The mixture likelihood approach has been well developed and much used, especially for mixtures where the component distributions are multivariate normal. Hunt and Jorgensen [17] presented methodology that enabled the clustering of mixed categorical and continuous data using a mixture approach. However many multivariate data sets encountered also contain unobserved or missing values. In this paper we demonstrate that the mixture approach to clustering can handle mixed categorical and continuous data where data are missing at random in the sense of Little and Rubin [30]. The methodology is illustrated by clustering two data sets.

[180]

Experiments in predicting biodegradability

H. Blockeel, S. Dzeroski, B. Kompare, S. Kramer, B. Pfahringer, and W. Van Laer. Experiments in predicting biodegradability. Journal of Applied Artificial Intelligence, 18(2), Feburary 2004.
[ bib | .ps | .pdf ]
This paper is concerned with the use of AI techniques in ecology. More specifically, we present a novel application of inductive logic programming (ILP) in the area of quantitative structure-activity relationships (QSARs). The activity we want to predict is the biodegradability of chemical compounds in water. In particular, the target variable is the half-life for aerobic aqueous biodegradation. Structural descriptions of chemicals in terms of atoms and bonds are derived from the chemicals' SMILES encodings. Definition of substructures are used as background knowledge. Predicting biodegradability is essentially a regression problem, but we also consider a discretized version of the target variable. We thus employ a number of relational classification and regression methods on the relational representation and compare these to propositional methods applied to different propositionalizations of the problem. We also experiment with the prediction technique that consists of merging upper and lower bound predictions into one prediction. Some conclusions are drawn concerning the applicability of machine learning systems and the merging technique in this domain and concerning the evaluation of hypotheses.

[181]

Bayesian network classifiers in Weka

R. Bouckaert. Bayesian network classifiers in weka. Technical Report 14/2004, The University of Waikato, Department of Computer Science, Hamilton, New Zealand, 2004.
[ bib | .pdf ]
Various Bayesian network classifier learning algorithms are implemented in Weka. This note provides some user documentation and implementation details. Summary of the main capabilities:

(a) Structure learning of Bayesian networks using various hill climbing (K2, B, etc) and general purpose (simulated annealing, tabu search) algorithms.
(b) Local score metrics implemented; Bayes, BDe, MDL, entropy, AIC.
(c) Global score metrics implemented; leave one out cv, k-fold cv and cumulative cv.
(d) Conditional independence based causal recovery algorithm available.
(e) Parameter estimation using direct estimates and Bayesian model averaging.
(f) GUI for easy inspection of Bayesian networks.
(g) Part of Weka allowing systematic experiments to compare Bayes net performance with general purpose classifiers like C4.5, nearest neighbor, support vector, etc.
(h) Source code available under GPL allows for integration in other systems and makes it easy to extend.

[182]

Estimating replicability of classifier learning

R. Bouckaert. Estimating replicability of classifier learning. In Proc International Conference on Machine Learning, 2004.
[ bib | .ps | .pdf ]
Replicability of machine learning experiments measures how likely it is that the outcome of one experiment is repeated when performed with a different randomization of the data. In this paper, we present an estimator or replicability of an experiment that is efficient. More precisely, the estimator is unbiased and has lowest variance in the class of estimators formed by a linear combination of outcomes of experiments on a given data set.

We gathered empirical data for comparing experiments consisting of different sampling schemes and hypothesis tests. Both factors are shown to have an impact on replicability of experiments. The data suggests that sign tests should not be used due to low replicability. Ranked sum tests show better performance, but the combination of a sorted runs sampling scheme with a t-test gives the most desirable performance judged on Type I and II error and replicability.

[183]

Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms

Remco R. Bouckaert and Eibe Frank. Evaluating the replicability of significance tests for comparing learning algorithms. In Proc 8th Pacific-Asia Conference on Knowledge Discovery and Data Mining, volume 3056 of LNAI, pages 3-12, Sydney, Australia, 2004. Springer.
[ bib | .ps | .pdf ]
Empirical research in learning algorithms for classification tasks generally requires the use of significance tests. The quality of a test is typically judged on Type I error (how often the test indicates a difference when it should not) and Type II error (how often it indicates no difference when it should). In this paper we argue that the replicability of a test is also of importance. We say that a test has low replicability if its outcome strongly depends on the particular random partitioning of the data that is used to perform it. We present empirical measures of replicability and use them to compare the performance of several popular tests in a realistic setting involving standard learning algorithms and benchmark datasets. Based on our results we give recommendations on which test to use.

[184]

Naive Bayes classifiers that perform well with continuous variables

R. Bouckaert. Naive bayes classifiers that perform well with continuous variables. In G.I. Webb and X. Yu, editors, Proc Seventeenth Australian Joint Conference on Artificial Intelligence (AI 2004), Advances in Artificial Intelligence, volume 3339 of LNAI, pages 1089-1094, Cairns, Australia, December 2004. Springer.
[ bib ]
There are three main methods for handling continuous variables in naive Bayes classifiers, namely, the normal method (parametric approach), the kernel method (non parametric approach) and discretization. In this article, we perform a methodologically sound comparison of the three methods, which shows large mutual differences of each of the methods and no single method being universally better. This suggests that a method for selecting one of the three approaches to continuous variables could improve overall performance of the naive Bayes classifier. We present three methods that can be implemented efficiently v-fold cross validation for the normal, kernel and discretization method. Empirical evidence suggests that selection using 10 fold cross validation (especially when repeated 10 times) can largely and significantly improve over all performance of naive Bayes classifiers and consistently outperform any of the three popular methods for dealing with continuous variables on their own. This is remarkable, since selection among more classifiers does not consistently result in better accuracy.

[185]

Data mining in bioinformatics using Weka

Eibe Frank, Mark Hall, Leonard E. Trigg, Geoffrey Holmes, and Ian H. Witten. Data mining in bioinformatics using Weka. Bioinformatics, 20(15):2479-2481, April 2004.
[ bib | http | .ps | .pdf ]
The Weka machine learning workbench provides a general-purpose environment for automatic classification, regression, clustering, and feature selection-common data mining problems in bioinformatics research. It contains an extensive collection of machine learning algorithms and data pre-processing methods complemented by graphical user interfaces for data exploration and the experimental comparison of different machine learning techniques on the same problem. Weka can process data given in the form of a single relational table. Its main objectives are to (a) assist users in extracting useful information from data and (b) enable them to easily identify a suitable algorithm for generating an accurate predictive model from it.

[186]

Predicting Library of Congress classifications from Library of Congress subject headings

Eibe Frank and Gordon W. Paynter. Predicting library of congress classifications from library of congress subject headings. JASIST, 55(3):214-227, 2004. Also published as Working Paper 01/2003, Department of Computer Science, The University of Waikato, Hamilton.
[ bib | .ps | .pdf ]
This paper addresses the problem of automatically assigning a Library of Congress Classification (LCC) to work given its set of Library of Congress Subject Headings (LCSH). LCC are organized in a tree: the root node of this hierarchy comprises all possible topics, and leaf nodes correspond to the most specialized topic areas defined. We describe a procedure that, given a resource identified by its LCSH, automatically places that resource in the LCC hierarchy. The procedure uses machine learning techniques and training data from a large library catalog to learn a classification model mapping from sets of LCSH to nodes in the LCC tree. We present empirical results for our technique showing its accuracy on an independent collection of 50,000 LCSH/LCC pairs.

[187]

Ensembles of nested dichotomies for multi-class problems

Eibe Frank and Stefan Kramer. Ensembles of nested dichotomies for multi-class problems. In R. Greiner and D. Schuurmans, editors, Proc 21st International Conference on Machine Learning, pages 305-312, Banff, Canada, 2004. ACM Press. Also published as Working Paper 06/2004, Department of Computer Science, The University of Waikato, Hamilton.
[ bib | .ps.gz | .pdf ]
Nested dichotomies are a standard statistical technique for tackling certain polytomous classification problems with logistic regression. They can be represented as binary trees that recursively split a multi-class classification task into a system of dichotomies and provide a statistically sound way of applying two-class learning algorithms to multi-class problems (assuming these algorithms generate class probability estimates). However, there are usually many candidate trees for a given problem and in the standard approach the choice of a particular tree is based on domain knowledge that may not be available in practice. An alternative is to treat every system of nested dichotomies as equally likely and to form an ensemble classifier based on this assumption. We show that this approach produces more accurate classifications than applying C4.5 and logistic regression directly to multi-class problems. Our results also show that ensembles of nested dichotomies produce more accurate classifiers than pairwise classification if both techniques are used with C4.5, and comparable results for logistic regression. Compared to error-correcting output codes, they are preferable if logistic regression is used, and comparable in the case of C4.5. An additional benefit is that they generate class probability estimates. Consequently they appear to be a good general-purpose method for applying binary classifiers to multi-class problems.

[188]

An Instrument Control System Using Predictive Modelling

Geoffrey Holmes and Dale Fletcher. An instrument control system using predictive modelling. In H. Araujo and et al., editors, Proceedings of the First International Conference on Informatics in Control, Automation and Robotics, pages 292-295, Setubal, Portugal, 2004. INSTICC Press.
[ bib ]
We describe a system for providing early warning of possible error to an operator in control of an instrument providing results in batches from samples, for example, chemical elements found in soil or water samples. The system has the potential to be used with any form of instrument that provides multiple results for a given sample. The idea is to train models for each measurement, using historical data. The set of trained models are then capable of making predictions on new data based on the values of the other measurements. This approach has the potential to uncover previously unknown relationships between the measurements. An example application has been constructed that highlights the difference of the actual value for a measurement from its predicted value. The operator is provided with sliders to attenuate the sensitivity of the measurement perhaps based on its importance or its known sensitivity.

[189]

Mining data streams using option trees (revised edition, 2004)

G. Holmes, R. Kirkby, and B. Pfahringer. Mining data streams using option trees (revised edition, 2004). Technical Report 03/2004, The University of Waikato, Department of Computer Science, Hamilton, New Zealand, 2004.
[ bib | .pdf ]
The data stream model for data mining places harsh restrictions on a learning algorithm. A model must be induced following the briefest interrogation of the data, must use only available memory and must update itself over time within these constraints. Additionally, the model must be able to be used for data mining at any point in time. This paper describes a data stream classification algorithm using an ensemble of option trees. The ensemble of trees is induced by boosting and iteratively combined into a single interpretable model. The algorithm is evaluated using benchmark datasets for accuracy against state-of-the-art algorithms that make use of the entire dataset.

[190]

Mining data streams using option trees

G. Holmes, R. Kirkby, and B. Pfahrinnger. Mining data streams using option trees. In J. Aguilar-Ruiz and J. Gama, editors, Proc Workshop W10 on Knowledge Discovery in Data Streams, Fifthteenth Conference on Machine Learning, Eighth European Conference on Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD), pages 75-84, Pisa, Italy, 2004.
[ bib ]
The data stream model for data mining places harsh restrictions on a learning algorithm. A model must be induced following the briefest interrogation of the data, must use only available memory and must update itself over time within these constraints. Additionally, the model must be able to be used for data mining at any point in time. This paper describes a data stream classification algorithm using an ensemble of option trees. This ensemble is iteratively combined into a single interpretable model. The algorithm is evaluated using benchmark datasets for accuracy against state-of-the-art algorithms that make use of the entire dataset.

[191]

Multinomial Naive Bayes for Text Categorization Revisited

Ashraf M. Kibriya, Eibe Frank, Bernhard Pfahringer, and Geoffrey Holmes. Multinomial naive Bayes for text categorization revisited. In G. I. Webb and X. Yu, editors, Proc 17th Australian Joint Conference on Artificial Intelligence, volume 3339 of LNAI, pages 488-499, Cairns, Australia, 2004. Springer.
[ bib | .ps.gz | .pdf ]
This paper presents empirical results for several versions of the multinomial naive Bayes classifier on four text categorization problems, and a way of improving it using locally weighted learning. More specifically, it compares standard multinomial naive Bayes to the recently proposed transformed weight-normalized complement naive Bayes classifier (TWCNB) [1], and shows that some of the modifications included in TWCNB may not be necessary to achieve optimum performance on some datasets. However, it does show that TFIDF conversion and document length normalization are important. It also shows that support vector machines can, in fact, sometimes very significantly outperform both methods. Finally, it shows how the performance of multinomial naive Bayes can be improved using locally weighted learning. However, the overall conclusion of our paper is that support vector machines are still the method of choice if the aim is to maximize accuracy.

[192]

Clustering Large Datasets Using Cobweb and K-Means in Tandem

Mi Li, Geoffrey Holmes, and Bernhard Pfahringer. Clustering large datasets using cobweb and k-means in tandem. In G. I. Webb and X. Yu, editors, Proc 17th Australian Joint Conference on Artificial Intelligence, volume 3339 of LNAI, pages 368-379, Cairns, Australia, 2004. Springer.
[ bib ]
This paper presents a single scan algorithm for clustering large datasets based on a two phase process which combines two well known clustering methods. The Cobweb algorithm is modified to produce a balanced tree with subclusters at the leaves, and then K-means is applied to the resulting subclusters. The resulting method, Scalable Cobweb, is then compared to a single pass K-means algorithm and standard K-means. The evaluation looks at error as measured by the sum of squared error and vulnerability to the order in which data points are processed.

[193]

Flow clustering using machine learning techniques

A. McGregor, M. Hall, P. Lorier, and J. Brunskill. Flow clustering using machine learning techniques. In C. Barakat and I. Pratt, editors, Proc Fifth International Workshop on Passive and Active Network Measurement (PAM 2004), volume 3015 of LNCS, pages 205-214, Antibes Juan-les-Pins, France, 2004. Springer.
[ bib ]
Packet header traces are widely used in network analysis. Header traces are the aggregate of traffic from many concurrent applications. We present a methodology, based on machine learning, that can break the trace down into clusters of traffic where each cluster has different traffic characteristics. Typical clusters include bulk transfer, single and multiple transactions and interactive traffic, amongst others. The paper includes a description of the methodology, a visualisation of the attribute statistics that aids in recognising cluster types and a discussion of the stability and effectiveness of the methodology.

[194]

Using Classification to Evaluate the Output of Confidence-Based Association Rule Mining

Stefan Mutter, Mark Hall, and Eibe Frank. Using classification to evaluate the output of confidence-based association rule mining. In G. I. Webb and X. Yu, editors, Proc 17th Australian Joint Conference on Artificial Intelligence, volume 3339 of LNAI, pages 538-549, Cairns, Australia, 2004. Springer.
[ bib | .ps.gz | .pdf ]
Association rule mining is a data mining technique that reveals interesting relationships in a database. Existing approaches employ different parameters to search for interesting rules. This fact and the large number of rules make it difficult to compare the output of confidence-based association rule miners. This paper explores the use of classification performance as a metric for evaluating their output. Previous work on forming classifiers from association rules has focussed on accurate classification, whereas we concentrate on using the properties of the resulting classifiers as a basis for comparing confidence-based association rule learners. Therefore, we present experimental results on 12 UCI datasets showing that the quality of small rule sets generated by Apriori can be improved by using the predictive Apriori algorithm. We also show that CBA, the standard method for classification using association rules, is generally inferior to standard rule learners concerning both running time and size of rule sets.

[195]

Applying machine learning to programming by demonstration

Gordon W. Paynter and Ian H. Witten. Applying machine learning to programming by demonstration. J. Exp. Theor. Artif. Intell., 16(3):161-188, 2004.
[ bib ]
'Familiar' is a tool that helps end-users automate iterative tasks in their applications by showing examples of what they want to do. It observes the user's actions, predicts what they will do next, and then offers to complete their task. Familiar learns in two ways. First, it creates a model, based on data gathered from training tasks, that selects the best prediction from among several candidates. Experiments show that decision trees outperform heuristic methods, and can be further improved by incrementally updating the classifier at task time. Second, it uses decision stumps inferred from analogous examples in the event trace to predict the parameters of conditional rules. Because data is sparse-for most users balk at giving more than a few training examples-permutation tests are used to calculate the statistical significance of each stump, successfully eliminating bias towards attributes with many different values.

[196]

A Toolbox for Learning from Relational Data with Propositional and Multi-instance Learners

Peter Reutemann, Bernhard Pfahringer, and Eibe Frank. A toolbox for learning from relational data with propositional and multi-instance learners. In G. I. Webb and X. Yu, editors, Proc 17th Australian Joint Conference on Artificial Intelligence, volume 3339 of LNAI, pages 1017-1023, Cairns, Australia, 2004. Springer.
[ bib | .ps.gz | .pdf ]
Most databases employ the relational model for data storage. To use this data in a propositional learner, a propositionalization step has to take place. Similarly, the data has to be transformed to be amenable to a multi-instance learner. The Proper Toolbox contains an extended version of RELAGGS, the Multi-Instance Learning Kit MILK, and can also combine the multi-instance data with aggregated data from RELAGGS. RELAGGS was extended to handle arbitrarily nested relations and to work with both primary keys and indices. For MILK the relational model is flattened into a single table and this data is fed into a multi-instance learner. REMILK finally combines the aggregated data produced by RELAGGS and the multi-instance data, flattened for MILK, into a single table that is once again the input for a multi-instance learner. Several well-known datasets are used for experiments which highlight the strengths and weaknesses of the different approaches.

[197]

Computer Aided Software Engineering (CASE), empowered by Natural Language Processing (NLP) for object oriented analysis and design using Unified Modeling Language (UML)

T. C. Smith and R. T. Giganto. Computer aided software engineering (case), empowered by natural language processing (nlp) for object oriented analysis and design using unified modeling language (uml). In Proc Second Mindanao Conference on Information Technology Education (MCITE'04), pages 19-23, Davao City, Philippines, 2004.
[ bib ]
Computer Aided Software Engineering (CASE) tools are often regarded as the ultimate aide for software engineers in developing software. However, studies show that CASE tools are not enough to assist developers fully in their analysis tasks. Natural Language Processing (NLP) has been empowering these CASE tools to at least aid developers in the analysis phases. Still, NLP-based CASE tools have certain weaknesses when it comes to detecting classes or objects in the system. With the advent of object oriented analysis and design (OOAD), Unified Modeling Language (UML) has come to be the preferred medium for modeling systems. Existing NLP-based CASE tools produce object models in UML; however, unnecessary classes are often created in the process, while other object models are not produced at all. This paper introduces a research motivation aimed at designing a new NLP-based CASE system that minimizes generation of unnecessary classes and expresses object models in UML.

[198]

A text-classification approach to the prediction and characterization of signal peptides

T. C. Smith. A text-classification approach to the prediction and characterization of signal peptides. In Proc X Convencion Internacional y Feria, Libro Reumen, Informatica 2004, pages 361-362, La Habana, Cuba, 2004.
[ bib ]
This paper describes an unconventional machine learning approach to predicting signal peptide cleavage points through a kind of pseudo-text classification procedure. It is based on the view that the first amino residue in the mature protein can be characterized with a textual summary of some of its specific properties. By creating a textual description of every amino residue-that is, an account of such things as the chemical properties of the residue, its context in the amino-acid chain, its relative proximity to the amino terminus, and so forth-a text classification algorithm can infer a characteristic model of those textual features that imply whether or not any one residue is or is not the start of the mature protein. More to the point, insofar as a biologist might give an account, in English, as to why some particular residue in the amino-acid chain actually marks the cleavage site, a ranked list of the textual attributes inferred by a text classification algorithm can do the same.

[199]

Data mining bread quality and process data in a plant bakery

A.J. Wilson, M.P. Morgenstern, B. Pfahringer, and C. Leschi. Data mining bread quality and process data in a plant bakery. In Proc Twelfth ICC Cereal and Bread Congress, Harrogate, UK, May 2004.
[ bib | .ps | .pdf ]
In modern automated plant bakeries a large amount of data is collected on the operation of the plant. When this data is combined with product quality data such as loaf colour, appearance, consumer complaints sales data etc, it has the potential to be used to improve processing efficiency, final product quality, and product marketability. However the huge volume of this data means it is often ignored as being too hard to analyse in any meaningful way. Data mining, which is a combination of techniques that produces information from large data sets, has the potential to be applied to this data to extract useful information. This paper describes our experience of applying data mining techniques to a plant bakery in New Zealand. The process involved setting up the systems required to extract data from the bakeries SCADA system, setting up sensors to automatically measure and record quality parameters, cleaning the data to remove faulted or anomalous results and then combining all the separate data blocks into one complete database for analysis. Data were analysed at two levels. Firstly, selected data were analysed for simple trends on an individual loaf basis which served to identify variability caused by divider pockets, tin positioning etc. Secondly data mining techniques such as various classifiers and principal components were applied to the whole data set to find relationships between process data and product quality.

[200]

Text mining in a digital library

Ian H. Witten, Katherine J. Don, Michael Dewsnip, and Valentin Tablan. Text mining in a digital library. Int. J. on Digital Libraries, 4(1):56-59, 2004.
[ bib ]
Digital librarians strive to add value to the collections they create and maintain. One way is through selectivity: a carefully chosen set of authoritative documents in a particular topic area is far more useful to those working in the area than a huge, unfocused collections (like the Web). Another is by augmenting the collection with high-quality metedata, which supports activities of searching and browsing in a uniform and useful way. A third way, and our topic here, is to enrich the documents by examining their content, extracting information, and using it to enhance the ways they can be located and presented.

[201]

Adaptive text mining: inferring structure from sequences

Ian H. Witten. Adaptive text mining: inferring structure from sequences. J. Discrete Algorithms, 2(2):137-159, 2004.
[ bib ]
Text mining is about inferring structure from sequences representing natural language text, and may be defined as the process of analyzing text to extract information that is useful for particular purposes. Although hand-crafted heuristics are a common practical approach for extracting information from text, a general, and generalizable, approach requires adaptive techniques. This paper studies the way in which the adaptive techniques used in text compression can be applied to text mining. It develops several examples: extraction of hierarchical phrase structures from text, identification of keyphrases in documents, locating proper names and quantities of interest in a piece of text, text categorization, word segmentation, acronym extraction, and structure recognition. We conclude that compression forms a sound unifying principle that allows many text mining problems to be tacked adaptively.

[202]

Logistic Regression and Boosting for Labeled Bags of Instances

Xin Xu and Eibe Frank. Logistic regression and boosting for labeled bags of instances. In H. Dai, R. Srikant, and C. Zhang, editors, Proc 8th Pacific-Asia Conference on Knowledge Discovery and Data Mining, volume 3056 of LNAI, pages 272-281, Sydney, Australia, 2004. Springer.
[ bib | .ps | .pdf ]
In this paper we upgrade linear logistic regression and boosting to multi-instance data, where each example consists of a labeled bag of instances. This is done by connecting predictions for individual instances to a bag-level probability estimate by simple averaging and maximizing the likelihood at the bag level-in other words, by assuming that all instances contribute equally and independently to a bag�s label. We present empirical results for artificial data generated according to the underlying generative model that we assume, and also show that the two algorithms produce competitive results on the Musk benchmark datasets.

[203]

A novel two stage scheme utilizing the test set for model selection in text classification

Bernhard Pfahringer, Peter Reutemann, and Mike Mayo. A novel two stage scheme utilizing the test set for model selection in text classification. In Ranadhir Ghosh, Brijesh Verma, and Xue Li, editors, Proc Workshop on Learning Algorithms for Pattern Recognition, Eighteenth Australian Joint Conference on Artificial Intelligence (AI'05), pages 60-65, Sydney, Australia, 2005. University of Technology. 5-9 December 2005.
[ bib | .pdf ]
Text classification is a natural application domain for semi- supervised learning, as labeling documents is expensive, but on the other hand usually an abundance of unlabeled documents is available. We describe a novel simple two- stage scheme based on dagging which allows for utilizing the test set in model selection. The dagging ensemble can also be used by itself instead of the original classifier. We evaluate the performance of a meta classifier choosing between various base learners and their respective dagging ensembles. The selection process seems to perform robustly especially for small percentages of available labels for training.

[204]

Tie Breaking in Hoeffding trees

Geoffrey Holmes, Richard Kirkby, and Bernhard Pfahringer. Tie breaking in hoeffding trees. In J. Gama and J. S. Aguilar-Ruiz, editors, Proc Workshop W6: Second International Workshop on Knowledge Discovery in Data Streams, pages 107-116, 2005.
[ bib ]
[205]

Cache hierarchy inspired compression: a novel architecture for data streams

Geoffrey Holmes, Bernhard Pfahringer, and Richard Kirkby. Cache hierarchy inspired compression: a novel architecture for data streams. In Narayanan Kulathuramaiyer, Alvin W. Yeo, Wang Yin Chai, and Tan Chong Eng, editors, Proc Fourth International Conference on Information Technology in Asia (CITA'05), pages 130-136, 2005. 12-15 December 2005.
[ bib | .pdf ]
We present an architecture for data streams based on structures typically found in web cache hierarchies. The main idea is to build a meta level analyser from a number of levels constructed over time from a data stream. We present the general architecture for such a system and an application to classification. This architecture is an instance of the general wrapper idea allowing us to reuse standard batch learning algorithms in an inherently incremental learning environment. By artificially generating data sources we demonstrate that a hierarchy containing a mixture of models is able to adapt over time to the source of the data. In these experiments the hierarchies use an elementary performance based replacement policy and unweighted voting for making classification decisions.

[206]

Inductive Logic Programming, 15th International Conference, ILP 2005, Bonn, Germany, August 10-13, 2005, Proceedings

Stefan Kramer and Bernhard Pfahringer, editors. Inductive Logic Programming, 15th International Conference, ILP 2005, Bonn, Germany, August 10-13, 2005, Proceedings, volume 3625 of Lecture Notes in Computer Science. Springer, 2005.
[ bib ]
[207]

Unsupervised Discretization Using Tree-Based Density Estimation

Gabi Schmidberger and Eibe Frank. Unsupervised discretization using tree-based density estimation. In Proc 9th European Conference on Principles and Practice of Knowledge Discovery in Databases, Porto, Portugal, pages 240-251. Springer, 2005.
[ bib | .ps.gz | .pdf ]
This paper presents an unsupervised discretization method that performs density estimation for univariate data. The subintervals that the discretization produces can be used as the bins of a histogram. Histograms are a very simple and broadly understood means for display- ing data, and our method automatically adapts bin widths to the data. It uses the log-likelihood as the scoring function to select cut points and the cross-validated log-likelihood to select the number of intervals. We compare this method with equal-width discretization where we also se- lect the number of bins using the cross-validated log-likelihood and with equal-frequency discretization.

[208]

Ensembles of Balanced Nested Dichotomies for Multi-class Problems

Lin Dong, Eibe Frank, and Stefan Kramer. Ensembles of balanced nested dichotomies for multi-class problems. In Proc 9th European Conference on Principles and Practice of Knowledge Discovery in Databases, Porto, Portugal, pages 84-95. Springer, 2005.
[ bib | .ps.gz | .pdf ]
A system of nested dichotomies is a hierarchical decomposition of a multi-class problem with c classes into c - 1 two-class problems and can be represented as a tree structure. Ensembles of randomly-generated nested dichotomies have proven to be an effective approach to multi-class learning problems [1]. However, sampling trees by giving each tree equal probability means that the depth of a tree is limited only by the number of classes, and very unbalanced trees can negatively affect runtime. In this paper we investigate two approaches to building balanced nested dichotomies-class-balanced nested dichotomies and data-balanced nested dichotomies-and evaluate them in the same ensemble setting. Using C4.5 decision trees as the base models, we show that both approaches can reduce runtime with little or no effect on accuracy, especially on problems with many classes. We also investigate the effect of caching models when building ensembles of nested dichotomies.

[209]

Speeding Up Logistic Model Tree Induction

Marc Sumner, Eibe Frank, and Mark A. Hall. Speeding up logistic model tree induction. In Proc 9th European Conference on Principles and Practice of Knowledge Discovery in Databases, Porto, Portugal, pages 675-683. Springer, 2005.
[ bib | .ps.gz | .pdf ]
Logistic Model Trees have been shown to be very accurate and compact classifiers [8]. Their greatest disadvantage is the computa- tional complexity of inducing the logistic regression models in the tree. We address this issue by using the AIC criterion [1] instead of cross- validation to prevent overfitting these models. In addition, a weight trim- ming heuristic is used which produces a significant speedup. We compare the training time and accuracy of the new induction process with the original one on various datasets and show that the training time often decreases while the classification accuracy diminishes only slightly.

[210]

Stress-Testing Hoeffding Trees

Geoffrey Holmes, Richard Kirkby, and Bernhard Pfahringer. Stress-testing hoeffding trees. In Proc 9th European Conference on Principles and Practice of Knowledge Discovery in Databases, Porto, Portugal, pages 495-502. Springer, 2005.
[ bib ]
[211]

Logistic Model Trees

Niels Landwehr, Mark Hall, and Eibe Frank. Logistic model trees. Machine Learning, 59(1-2):161-205, 2005.
[ bib | .ps.gz | .pdf ]
Tree induction methods and linear models are popular techniques for supervised learning tasks, both for the prediction of nominal classes and numeric values. For predicting numeric quantities, there has been work on combining these two schemes into 'model trees', i.e. trees that contain linear regression functions at the leaves. In this paper, we present an algorithm that adapts this idea for classification problems, using logistic regression instead of linear regression. We use a stagewise fitting process to construct the logisitic regression models that can select relevant attributes in the data in a natural way, and show how this approach can be used to build the logistic regression models at the leaves by incrementally refining those constructed at higher levels in the tree. We compare the performance of our algorithm to several other state-of-the-art learning schemes on 36 benchmark UCI datasets, and show that it produces accurate and compact classifiers.

[212]

Gene selection from microarray data for cancer classification - a machine learning approach

Yu Wang, Igor V. Tetko, Mark A. Hall, Eibe Frank, Axel Facius, Klaus F. X. Mayer, and Hans-Werner Mewes. Gene selection from microarray data for cancer classification - a machine learning approach. Computational Biology and Chemistry, 29(1):37-46, 2005.
[ bib | http ]
[213]

Kea: Practical automatic keyphrase extraction

Ian H. Witten, Gordon W. Paynter, Eibe Frank, Carl Gutwin, and Craig G. Nevill-Manning. Kea: Practical automatic keyphrase extraction. In Y.-L. Theng and S. Foo, editors, Design and Usability of Digital Libraries: Case Studies in the Asia Pacific, pages 129-152. Information Science Publishing, London, 2005.
[ bib | .ps.gz | .pdf ]
Keyphrases provide semantic metadata that summarize and characterize documents. This chapter describes Kea, an algorithm for automatically extracting keyphrases from text. Kea identifies candidate keyphrases using lexical methods, calculates feature values for each candidate, and uses a machine-learning algorithm to predict which candidates are good keyphrases. The machine-learning scheme first builds a prediction model using training documents with known keyphrases, and then uses the model to find keyphrases in new documents. We use a large text corpus to evaluate Kea's effectiveness in terms of how many author-assigned keyphrases are correctly identified. The system is simple, robust, and available under the GNU General Public License; the chapter gives instructions for use.

[214]

Data Mining: Practical Machine Learning Tools and Techniques

Ian H. Witten and Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco, 2 edition, 2005.
[ bib | .html ]
[215]

WEKA - A Machine Learning Workbench for Data Mining

Eibe Frank, Mark A. Hall, Geoffrey Holmes, Richard Kirkby, Bernhard Pfahringer, Ian H. Witten, and Leonhard Trigg. Weka - a machine learning workbench for data mining. In Oded Maimon and Lior Rokach, editors, The Data Mining and Knowledge Discovery Handbook, pages 1305-1314. Springer, 2005.
[ bib | .ps.gz | .pdf ]
The Weka workbench is an organized collection of state-of-the-art machine learning algorithms and data preprocessing tools. The basic way of interacting with these methods is by invoking them from the com- mand line. However, convenient interactive graphical user interfaces are provided for data exploration, for setting up large-scale experiments on distributed computing platforms, and for designing configurations for streamed data processing. These interfaces constitute an advanced en- vironment for experimental data mining. The system is written in Java and distributed under the terms of the GNU General Public License.

[216]

Semantic Role-Labeling via Consensus Pattern-Matching

Chi-San Lin and Tony C. Smith. Semantic role-labeling via consensus pattern-matching. In Proceedings of the Ninth Conference on Computational Natural Language Learning, pages 185-188, Ann Arbor, Michigan, June 2005.
[ bib | .pdf ]
This paper describes a system for semantic role labeling for the CoNLL2005 Shared task. We divide the task into two sub-tasks: boundary recognition by a general tree- based predicate-argument recognition algo- rithm to convert a parse tree into a flat rep- resentation of all predicates and their related boundaries, and role labeling by a consensus model using a pattern-matching framework to find suitable roles for core constituents and adjuncts. We describe the system architecture and report results for the CoNLL2005 development dataset.

[217]

Racing for Conditional Independence Inference

Remco R. Bouckaert and Milan Studeny. Racing for conditional independence inference. In Symbolic and Quantitative Approaches to Reasoning with Uncertainty, 8th European Conference, ECSQARU 2005, volume 3571 of LNCS, pages 221-232, 2005.
[ bib | .pdf ]
In this article, we consider the computational aspects of deciding whether a conditional independence statement t is implied by a list of conditional independence statements L using the implication related to the method of structural imsets. We present two methods which have the interesting complementary properties that one method performs well to prove that t is implied by L, while the other performs well to prove that t is not implied by L. However, both methods do not perform well the opposite. This gives rise to a parallel algorithm in which both methods race against each other in order to determine effectively whether t is or is not implied.

Some empirical evidence is provided that suggest this racing algorithms method performs a lot better than an existing method based on so-called skeletal characterization of the respective implication. Furthermore, the method is able to handle more than five variables.

[218]

Low Replicability of Machine Learning Experiments is not a Small Data Set Phenomenon

R. R. Bouckaert. Low replicability of machine learning experiments is not a small data set phenomenon. In Proceedings of the ICML'05 Workshop on Meta-Learning, 2005.
[ bib | .pdf ]
This paper investigates the relation between replicability of experiments for deciding which of two algorithms performs better on a given data set. We prove that lack of replicability is not just a small data phenomenon (as was shown before), but is present in experiments on medium and large data sets as well. We establish intuition in the relation between data set size, power and replicability.

The main method for improving replicability is to increase the number of samples. For large data sets and/or inefficient learning algorithms, this implies that exerperiments may take a long time to completion. We propose a procedure for deciding which of two learning algorithms is best that has a high replicability but takes moderate computational effort.

[219]

Bayesian Sequence Learning for Predicting Protein Cleavage Points

Mike Mayo. Bayesian sequence learning for predicting protein cleavage points. In Advances in Knowledge Discovery and Data Mining: 9th Pacific-Asia Conference, PAKDD 2005, page 192, Hanoi, Vietnam, 2005. May 18-20, 2005.
[ bib | .pdf ]
A challenging problem in data mining is the application of efficient techniques to automatically annotate the vast databases of biological sequence data. This paper describes one such application in this area, to the prediction of the position of signal peptide cleavage points along protein sequences. It is shown that the method, based on Bayesian statistics, is comparable in terms of accuracy to the existing state-of-the-art neural network techniques while providing explanatory information for its predictions.

[220]

Learning Petri Net Models of Non-Linear Gene Interactions

Mike Mayo. Learning petri net models of non-linear gene interactions. BioSystems, 85(1):74-82, 2005.
[ bib | .pdf ]
Understanding how an individual's genetic make-up influences their risk of disease is a problem of paramount importance. Although machine learning techniques are able to uncover the relationships between genotype and disease, the problem of automatically building the best biochemical model or explanation of the relationship has received less attention. In this paper, I describe a method based on random hill climbing that automatically builds Petri net models of non-linear (or multi-factorial) disease-causing gene-gene interactions. Petri nets are a suitable formalism for this problem, because they are used to model concurrent, dynamic processes analogous to biochemical reaction networks. I show that this method is routinely able to identify perfect Petri net models for three disease-causing gene-gene interactions recently reported in the literature.

[221]

Generalized Unified Decomposition of Ensemble Loss

Remco R. Bouckaert, Michael Goebel, and Patricia J. Riddle. Generalized unified decomposition of ensemble loss. In Proc 19th Australian Joint Conference on Artificial Intelligence, pages 1133-1139, 2006.
[ bib | .pdf ]
Goebel et al. [4] presented a unified decomposition of ensemble loss for explaining ensemble performance. They considered democratic voting schemes with uniform weights, where the various base classifiers each can vote for a single class once only. In this article, we generalize their decomposition to cover weighted, probabilistic voting schemes and non-uniform (progressive) voting schemes. Empirical results suggest that democratic voting schemes can be outperformed by probabilistic and progressive voting schemes. This makes the generalization worth exploring and we show how to use the generalization to analyze ensemble loss.

[222]

Efficient AUC Learning Curve Calculation

Remco R. Bouckaert. Efficient auc learning curve calculation. In Proc 19th Australian Joint Conference on Artificial Intelligence, pages 181-191, 2006.
[ bib | .pdf ]
A learning curve of a performance measure provides a graphical method with many benefits for judging classifier properties. The area under the ROC curve (AUC) is a useful and increasingly popular performance measure. In this paper, we consider the computational aspects of calculating AUC learning curves. A new method is provided for incrementally updating exact AUC curves and for calculating approximate AUC curves for datasets with millions of instances. Both theoretical and empirical justifications are given for the approximation. Variants for incremental exact and approximate AUC curves are provided as well.

[223]

Voting Massive Collections of Bayesian Network Classifiers for Data Streams

Remco R. Bouckaert. Voting massive collections of bayesian network classifiers for data streams. In Proc 19th Australian Joint Conference on Artificial Intelligence, pages 243-252, 2006.
[ bib | .pdf ]
We present a new method for voting exponential (in the number of attributes) size sets of Bayesian classifiers in polynomial time with polynomial memory requirements. Training is linear in the number of instances in the dataset and can be performed incrementally. This allows the collection to learn from massive data streams. The method allows for flexibility in balancing computational complexity, memory requirements and classification performance. Unlike many other incremental Bayesian methods, all statistics kept in memory are directly used in classification. Experimental results show that the classifiers perform well on both small and very large data sets, and that classification performance can be weighed against computational and memory costs.

[224]

A tree-based algorithm for predicate-argument recognition

Chi-San Lin and Tony Smith. A tree-based algorithm for predicate-argument recognition. Association for Computing Machinery New Zealand Bulletin, 2(1):online journal, January 2006.
[ bib ]
[225]

Improving on Bagging with Input Smearing

Eibe Frank and Bernhard Pfahringer. Improving on bagging with input smearing. In Proc 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Singapore. Springer, 2006.
[ bib | .ps.gz | .pdf ]
Bagging is an ensemble learning method that has proved to be a useful tool in the arsenal of machine learning practitioners. Commonly applied in conjunction with decision tree learners to build an ensemble of decision trees, it often leads to reduced errors in the predictions when compared to using a single tree. A single tree is built from a training set of size N . Bagging is based on the idea that, ideally, we would like to eliminate the variance due to a particular training set by combining trees built from all training sets of size N . However, in practice, only one training set is available, and bagging simulates this platonic method by sampling with replacement from the original training data to form new training sets. In this paper we pursue the idea of sampling from a kernel density estimator of the underlying distribution to form new training sets, in addition to sampling from the data itself. This can be viewed as smearing out the resampled training data to generate new datasets, and the amount of smear is controlled by a parameter. We show that the resulting method, called input smearing, can lead to improved results when compared to bagging. We present results for both classification and regression problems.

[226]

Using Weighted Nearest Neighbor to Benefit from Unlabeled Data

Kurt Driessens, Peter Reutemann, Bernhard Pfahringer, and Claire Leschi. Using weighted nearest neighbor to benefit from unlabeled data. In Wee Keong Ng, Masaru Kitsuregawa, Jianzhong Li, and Kuiyu Chang, editors, Advances in Knowledge Discovery and Data Mining, 10th Pacific-Asia Conference, PAKDD 2006, volume 3918 of LNCS, pages 60-69, 2006.
[ bib | .pdf ]
The development of data-mining applications such as textclassification and molecular profiling has shown the need for machine learning algorithms that can benefit from both labeled and unlabeled data, where often the unlabeled examples greatly outnumber the labeled examples. In this paper we present a two-stage classifier that improves its predictive accuracy by making use of the available unlabeled data. It uses a weighted nearest neighbor classification algorithm using the combined example-sets as a knowledge base. The examples from the unlabeled set are pre-labeled by an initial classifier that is build using the limited available training data. By choosing appropriate weights for this pre-labeled data, the nearest neighbor classifier consistently improves on the original classifier.

[227]

A Semi-Supervised Spam Mail Detector

Bernhard Pfahringer. A semi-supervised spam mail detector. In Steffen Bickel, editor, Proceedings of the ECML/PKDD 2006 Discovery Challenge Workshop, pages 48-53. Humboldt University Berlin, 2006.
[ bib | .pdf ]
This document describes a novel semi-supervised approach to spam classification, which was successful at the ECML/PKDD2006 spam classification challenge. A local learning method based on lazy projections was successfully combined with a variant of a standard semi-supervised learning algorithm.

[228]

Naive Bayes for Text Classification with Unbalanced Classes

Eibe Frank and Remco R. Bouckaert. Naive bayes for text classification with unbalanced classes. In Proc 10th European Conference on Principles and Practice of Knowledge Discovery in Databases, Berlin, Germany, pages 503-510. Springer, 2006.
[ bib | .pdf ]
Multinomial naive Bayes (MNB) is a popular method for document classification due to its computational efficiency and relatively good predictive performance. It has recently been established that predictive performance can be improved further by appropriate data transformations. In this paper we present another transformation that is designed to combat a potential problem with the application of MNB to unbalanced datasets. We propose an appropriate correction by adjusting attribute priors. This correction can be implemented as another data normalization step, and we show that it can significantly improve the area under the ROC curve. We also show that the modified version of MNB is very closely related to the simple centroid-based classifier and compare the two methods empirically.

[229]

Automatic Species Identification of Live Moths

Michael Mayo. Automatic species identification of live moths. In Ellis et. al, editor, Proc. of the 26th SGAI International Conference on Innovative Techniques and Applications of Artificial Intelligence, pages 195-202, 2006.
[ bib | .pdf ]
A collection consisting of the images of 774 live moth individuals, each moth belonging to one of 35 different UK species, was analysed to determine if data mining techniques could be used effectively for automatic species identification. Feature vectors were extracted from each of the moth images and the machine learning toolkit WEKA was used to classify the moths by species using the feature vectors. Whereas a previous analysis of this image dataset reported in the literature [1] required that each moth's least worn wing region be highlighted manually for each image, WEKA was able to achieve a greater level of accuracy (85%) using support vector machines without manual specification of a region of interest at all. This paper describes the features that were extracted from the images, and the various experiments using different classifiers and datasets that were performed. The results show that data mining can be usefully applied to the problem of automatic species identification of live specimens in the field.

[230]

Random Relational Rules

Grant Anderson and Bernhard Pfahringer. Random relational rules. In Stephen Muggleton and Ramon Otero, editors, Extended Abstracts for 16th International Conference on Inductive Logic Programming, 2006.
[ bib | .pdf ]
Exhaustive search in relational learning is generally infeasible, therefore some form of heuristic search is usually employed, such as in FOIL[1]. On the other hand, so-called stochastic discrimination provides a framework for combining arbitrary numbers of weak classifiers (in this case randomly generated relational rules) in a way where accuracy improves with additional rules, even after maxi- mal accuracy on the training data has been reached.[2] The weak classifiers must have a slightly higher probability of covering instances of their target class than of other classes. As the rules are also independent and identically distributed, the Central Limit theorem applies and as the number of weak classifiers/rules grows, coverages for different classes resemble well-separated normal distribu- tions. Stochastic discrimination is closely related to other ensemble methods like Bagging, Boosting, or Random forests, all of which have been tried in relational learning[3, 4, 5].

[231]

A Comparison of Multi-instance Learning Algorithms

Lin Dong. A comparison of multi-instance learning algorithms. Master's thesis, Department of Computer Science, University of Waikato, 2006.
[ bib | http ]
Motivated by various challenging real-world applications, such as drug activity prediction and image retrieval, multi-instance (MI) learning has attracted considerable interest in recent years. Compared with standard supervised learning, the MI learning task is more difficult as the label information of each training example is incomplete. Many MI algorithms have been proposed. Some of them are specifically designed for MI problems whereas others have been upgraded or adapted from standard single-instance learning algorithms. Most algorithms have been evaluated on only one or two benchmark datasets, and there is a lack of systematic comparisons of MI learning algorithms. This thesis presents a comprehensive study of MI learning algorithms that aims to compare their performance and find a suitable way to properly address different MI problems. First, it briefly reviews the history of research on MI learning. Then it discusses five general classes of MI approaches that cover a total of 16 MI algorithms. After that, it presents empirical results for these algorithms that were obtained from 15 datasets which involve five different real-world application domains. Finally, some conclusions are drawn from these results: (1) applying suitable standard single-instance learners to MI problems can often generate the best result on the datasets that were tested, (2) algorithms exploiting the standard asymmetric MI assumption do not show significant advantages over approaches using the so-called collective assumption, and (3) different MI approaches are suitable for different application domains, and no MI algorithm works best on all MI problems.

[232]

Reinforcement Learning for Racecar Control

Benjamin George Cleland. Reinforcement learning for racecar control. Master's thesis, Department of Computer Science, University of Waikato, 2006.
[ bib | http ]
This thesis investigates the use of reinforcement learning to learn to drive a racecar in the simulated environment of the Robot Automobile Racing Simulator. Real-life race driving is known to be difficult for humans, and expert human drivers use complex sequences of actions. There are a large number of variables, some of which change stochastically and all of which may affect the outcome. This makes driving a promising domain for testing and developing Machine Learning techniques that have the potential to be robust enough to work in the real world. Therefore the principles of the algorithms from this work may be applicable to a range of problems. The investigation starts by finding a suitable data structure to represent the information learnt. This is tested using supervised learning. Reinforcement learning is added and roughly tuned, and the supervised learning is then removed. A simple tabular representation is found satisfactory, and this avoids difficulties with more complex methods and allows the investigation to concentrate on the essentials of learning. Various reward sources are tested and a combination of three are found to produce the best performance. Exploration of the problem space is investigated. Results show exploration is essential but controlling how much is done is also important. It turns out the learning episodes need to be very long and because of this the task needs to be treated as continuous by using discounting to limit the size of the variables stored. Eligibility traces are used with success to make the learning more efficient. The tabular representation is made more compact by hashing and more accurate by using smaller buckets. This slows the learning but produces better driving. The improvement given by a rough form of generalisation indicates the replacement of the tabular method by a function approximator is warranted. These results show reinforcement learning can work within the Robot Automobile Racing Simulator, and lay the foundations for building a more efficient and competitive agent.

[233]

Improving Hoeffding Trees

Richard Kirkby. Improving Hoeffding Trees. PhD thesis, Department of Computer Science, University of Waikato, 2007.
[ bib | http ]
Modern information technology allows information to be collected at a far greater rate than ever before. So fast, in fact, that the main problem is making sense of it all. Machine learning offers promise of a solution, but the field mainly focusses on achieving high accuracy when data supply is limited. While this has created sophisticated classification algorithms, many do not cope with increasing data set sizes. When the data set sizes get to a point where they could be considered to represent a continuous supply, or data stream, then incremental classification algorithms are required. In this setting, the effectiveness of an algorithm cannot simply be assessed by accuracy alone. Consideration needs to be given to the memory available to the algorithm and the speed at which data is processed in terms of both the time taken to predict the class of a new data sample and the time taken to include this sample in an incrementally updated classification model. The Hoeffding tree algorithm is a state-of-the-art method for inducing decision trees from data streams. The aim of this thesis is to improve this algorithm. To measure improvement, a comprehensive framework for evaluating the performance of data stream algorithms is developed. Within the framework memory size is fixed in order to simulate realistic application scenarios. In order to simulate continuous operation, classes of synthetic data are generated providing an evaluation on a large scale. Improvements to many aspects of the Hoeffding tree algorithm are demonstrated. First, a number of methods for handling continuous numeric features are compared. Second, tree prediction strategy is investigated to evaluate the utility of various methods. Finally, the possibility of improving accuracy using ensemble methods is explored. The experimental results provide meaningful comparisons of accuracy and processing speeds between different modifications of the Hoeffding tree algorithm under various memory limits. The study on numeric attributes demonstrates that sacrificing accuracy for space at the local level often results in improved global accuracy. The prediction strategy shown to perform best adaptively chooses between standard majority class and Naive Bayes prediction in the leaves. The ensemble method investigation shows that combining trees can be worthwhile, but only when sufficient memory is available, and improvement is less likely than in traditional machine learning. In particular, issues are encountered when applying the popular boosting method to streams.

[234]

Racing algorithms for conditional independence inference

Remco R. Bouckaert and Milan Studeny. Racing algorithms for conditional independence inference. Int. J. Approx. Reasoning, 45(2):386-401, 2007.
[ bib | .pdf ]
In this article, we consider the computational aspects of deciding whether a conditional independence statement t is implied by a list of conditional independence statements L using the implication related to the method of structural imsets. We present two methods which have the interesting complementary properties that one method performs well to prove that t is implied by L, while the other performs well to prove that t is not implied by L. However, both methods do not well perform the opposite. This gives rise to a parallel algorithm in which both methods race against each other in order to determine effectively whether t is or is not implied. Some empirical evidence is provided that suggest this racing algorithms method performs considerably better than an existing method based on so-called skeletal characterization of the respective implication. Furthermore, unlike previous methods, the method is able to handle more than five variables.

[235]

Syntax-driven argument identification and multi-argument classification for semantic role labeling

Chi-San Althon Lin. Syntax-driven argument identification and multi-argument classification for semantic role labeling. PhD thesis, Department of Computer Science, University of Waikato, 2007.
[ bib | http ]
Semantic role labeling is an important stage in systems for Natural Language Understanding. The basic problem is one of identifying who did what to whom for each predicate in a sentence. Thus labeling is a two-step process: identify constituent phrases that are arguments to a predicate, then label those arguments with appropriate thematic roles. Existing systems for semantic role labeling use machine learning methods to assign roles one-at-a-time to candidate arguments. There are several drawbacks to this general approach. First, more than one candidate can be assigned the same role, which is undesirable. Second, the search for each candidate argument is exponential with respect to the number of words in the sentence. Third, single-role assignment cannot take advantage of dependencies known to exist between semantic roles of predicate arguments, such as their relative juxtaposition. And fourth, execution times for existing algorithm are excessive, making them unsuitable for real-time use. This thesis seeks to obviate these problems by approaching semantic role labeling as a multi-argument classification process. It observes that the only valid arguments to a predicate are unembedded constituent phrases that do not overlap that predicate. Given that semantic role labeling occurs after parsing, this thesis proposes an algorithm that systematically traverses the parse tree when looking for arguments, thereby eliminating the vast majority of impossible candidates. Moreover, instead of assigning semantic roles one at a time, an algorithm is proposed to assign all labels simultaneously; leveraging dependencies between roles and eliminating the problem of duplicate assignment. Experimental results are provided as evidence to show that a combination of the proposed argument identification and multi-argument classification algorithms outperforms all existing systems that use the same syntactic information.

[236]

An Empirical Comparison of Exact Nearest Neighbour Algorithms

Ashraf M. Kibriya and Eibe Frank. An empirical comparison of exact nearest neighbour algorithms. In Proc 11th European Conference on Principles and Practice of Knowledge Discovery in Databases, Warsaw, Poland, pages 140-151. Springer, 2007.
[ bib | .pdf ]
Abstract. Nearest neighbour search (NNS) is an old problem that is of practical importance in a number of fields. It involves finding, for a given point q, called the query, one or more points from a given set of points that are nearest to the query q. Since the initial inception of the problem a great number of algorithms and techniques have been proposed for its solution. However, it remains the case that many of the proposed algorithms have not been compared against each other on a wide variety of datasets. This research attempts to fill this gap to some extent by presenting a detailed empirical comparison of three prominent data structures for exact NNS: KD-Trees, Metric Trees, and Cover Trees. Our results suggest that there is generally little gain in using Metric Trees or Cover Trees instead of KD-Trees for the standard NNS problem.

[237]

Effective Classifiers for Detecting Objects

Michael Mayo. Effective classifiers for detecting objects. In Proc. of the Fourth International Conference on Computational Intelligence, Robotics, and Autonomous Systems (CIRAS '07), 2007.
[ bib | .pdf ]
Several state-of-the-art machine learning classifiers are compared for the purposes of object detection in complex images, using global image features derived from the Ohta color space and Local Binary Patterns. Image complexity in this sense refers to the degree to which the target objects are occluded and/or non-dominant (i.e. not in the foreground) in the image, and also the degree to which the images are cluttered with non-target objects. The results indicate that a voting ensemble of Support Vector Machines, Random Forests, and Boosted Decision Trees provide the best performance with AUC values of up to 0.92 and Equal Error Rate accuracies of up to 85.7% in stratified 10-fold cross validation experiments on the GRAZ02 complex image dataset.

[238]

Random Convolution Ensembles

Michael Mayo. Random convolution ensembles. In Hi Shing Ip et. al, editor, Advances in Multimedia Information Processing - PCM 2007, 8th Pacific Rim Conference on Multimedia, Lecture Notes in Computer Science 4810, pages 216-225. Springer, 2007.
[ bib | .pdf ]
A novel method for creating diverse ensembles of image classifiers is proposed. The idea is that, for each base image classifier in the ensemble, a random image transformation is generated and applied to all of the images in the labeled training set. The base classifiers are then learned using features extracted from these randomly transformed versions of the training data, and the result is a highly diverse ensemble of image classifiers. This approach is evaluated on a benchmark pedestrian detection dataset and shown to be effective.

[239]

A discriminative approach to structured biological data

S. Mutter and B. Pfahringer. A discriminative approach to structured biological data. In Proc NZCSRSC'07, the Fifth New Zealand Computer Science Research Student Conference, Hamilton, New Zealand, April 2007.
[ bib | .pdf ]
This paper introduces the first author's PhD pro ject which has just got out of its initial stage. Biological sequence data is, on the one hand, highly structured. On the other hand there are large amounts of unlabelled data. Thus we combine probabilistic graphical models and semi-supervised learning. The former to handle structured data and the latter to deal with unlabelled data. We apply our models to genotype- phenotype modelling problems. In particular we predict the set of Single Nucleotide Polymorphisms which underlie a specific phenotypical trait.

[240]

Scaling Up Semi-supervised Learning: An Efficient and Effective LLGC Variant

Bernhard Pfahringer, Claire Leschi, and Peter Reutemann. Scaling up semi-supervised learning: An efficient and effective llgc variant. In Proc 11th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Nanjing, China, pages 236-247. Springer, 2007.
[ bib | http ]
Domains like text classification can easily supply large amounts of unlabeled data, but labeling itself is expensive. Semi- supervised learning tries to exploit this abundance of unlabeled training data to improve classification. Unfortunately most of the theoretically well-founded algorithms that have been described in recent years are cubic or worse in the total number of both labeled and unlabeled training examples. In this paper we apply modifications to the standard LLGC algorithm to improve efficiency to a point where we can handle datasets with hundreds of thousands of training data. The modifications are priming of the unlabeled data, and most importantly, sparsification of the similarity matrix. We report promising results on large text classification problems.

[241]

New Options for Hoeffding Trees

Bernhard Pfahringer, Geoffrey Holmes, and Richard Kirkby. New options for hoeffding trees. In Proc 20th Australian Conference on Artificial Intelligence, Gold Coast, Australia, pages 90-99. Springer, 2007.
[ bib | http ]
Hoeffding trees are state-of-the-art for processing high-speed data streams. Their ingenuity stems from updating sufficient statistics, only addressing growth when decisions can be made that are guaranteed to be almost identical to those that would be made by conventional batch learning methods. Despite this guarantee, decisions are still subject to limited lookahead and stability issues. In this paper we explore Hoeffding Option Trees, a regular Hoeffding tree containing additional option nodes that allow several tests to be applied, leading to multiple Hoeffding trees as separate paths. We show how to control tree growth in order to generate a mixture of paths, and empirically determine a reasonable number of paths. We then empirically evaluate a spectrum of Hoeffding tree variations: single trees, option trees and bagged trees. Finally, we investigate pruning. We show that on some datasets a pruned option tree can be smaller and more accurate than a single tree.

[242]

Clustering Relational Data Based on Randomized Propositionalization

Grant Anderson and Bernhard Pfahringer. Clustering relational data based on randomized propositionalization. In Proc 17th International Conference on Inductive Logic Programming, Corvallis, OR, pages 39-48. Springer, 2007.
[ bib | http ]
Clustering of relational data has so far received a lot less attention than classification of such data. In this paper we investigate a simple approach based on randomized propositionalization, which allows for applying standard clustering algorithms like KMeans to multi-relational data. We describe how random rules are generated and then turned into boolean-valued features. Clustering generally is not straightforward to evaluate, but preliminary experimental results on a number of standard ILP datasets show promising results. Clusters generated without class information usually agree well with the true class labels of cluster members, i.e. class distributions inside clusters generally differ significantly from the global class distributions. The two-tiered algorithm described shows good scalability due to the randomized nature of the first step and the availability of efficient propositional clustering algorithms for the second step.

[243]

Clustering for Classification

Reuben Evans. Clustering for classification. Master's thesis, Department of Computer Science, University of Waikato, 2007.
[ bib | http ]
Advances in technology have provided industry with an array of devices for collecting data. The frequency and scale of data collection means that there are now many large datasets being generated. To find patterns in these datasets it would be useful to be able to apply modern methods of classification such as support vector machines. Unfortunately these methods are computationally expensive, quadratic in the number of data points in fact, so cannot be applied directly. This thesis proposes a framework whereby a variety of clustering methods can be used to summarise datasets, that is, reduce them to a smaller but still representative dataset so that these advanced methods can be applied. It compares the results of using this framework against using random selection on a large number of classification and regression problems. Results show that the clustered datasets are on average fifty percent smaller than the original datasets without loss of classification accuracy which is significantly better than random selection. They also show that there is no free lunch, for each dataset it is important to choose a clustering method carefully.

[244]

Fast Algorithms for Nearest Neighbour Search

Ashraf Masood Kibriya. Fast algorithms for nearest neighbour search. Master's thesis, Department of Computer Science, University of Waikato, 2007.
[ bib | http ]
The nearest neighbour problem is of practical significance in a number of fields. Often we are interested in finding an object near to a given query object. The problem is old, and a large number of solutions have been proposed for it in the literature. However, it remains the case that even the most popular of the techniques proposed for its solution have not been compared against each other. Also, many techniques, including the old and popular ones, can be implemented in a number of ways, and often the different implementations of a technique have not been thoroughly compared either. This research presents a detailed investigation of different implementations of two popular nearest neighbour search data structures, KDTrees and Metric Trees, and compares the different implementations of each of the two structures against each other. The best implementations of these structures are then compared against each other and against two other techniques, Annulus Method and Cover Trees. Annulus Method is an old technique that was rediscovered during the research for this thesis. Cover Trees are one of the most novel and promising data structures for nearest neighbour search that have been proposed in the literature.

[245]

Effective Linear-Time Feature Selection

Nripendra Pradhananga. Effective linear-time feature selection. Master's thesis, Department of Computer Science, University of Waikato, 2007.
[ bib | http ]
The classification learning task requires selection of a subset of features to represent patterns to be classified. This is because the performance of the classifier and the cost of classification are sensitive to the choice of the features used to construct the classifier. Exhaustive search is impractical since it searches every possible combination of features. The runtime of heuristic and random searches are better but the problem still persists when dealing with high-dimensional datasets. We investigate a heuristic, forward, wrapper-based approach, called Linear Sequential Selection, which limits the search space at each iteration of the feature selection process. We introduce randomization in the search space. The algorithm is called Randomized Linear Sequential Selection. Our experiments demonstrate that both methods are faster, find smaller subsets and can even increase the classification accuracy. We also explore the idea of ensemble learning. We have proposed two ensemble creation methods, Feature Selection Ensemble and Random Feature Ensemble. Both methods apply a feature selection algorithm to create individual classifiers of the ensemble. Our experiments have shown that both methods work well with high-dimensional data.

[246]

Best-first Decision Tree Learning

Haijian Shi. Best-first decision tree learning. Master's thesis, Department of Computer Science, University of Waikato, 2007.
[ bib | http ]
In best-first top-down induction of decision trees, the best split is added in each step (e.g. the split that maximally reduces the Gini index). This is in contrast to the standard depth-first traversal of a tree. The resulting tree will be the same, just how it is built is different. The objective of this project is to investigate whether it is possible to determine an appropriate tree size on practical datasets by combining best-first decision tree growth with cross-validation-based selection of the number of expansions that are performed. Pre-pruning, post-pruning, CART-pruning can be performed this way to compare.

[247]

Prediction Intervals for Class Probabilities

Xiaofeng Yu. Prediction intervals for class probabilities. Master's thesis, Department of Computer Science, University of Waikato, 2007.
[ bib | http ]
Prediction intervals for class probabilities are of interest in machine learning because they can quantify the uncertainty about the class probability estimate for a test instance. The idea is that all likely class probability values of the test instance are included, with a pre-specified confidence level, in the calculated prediction interval. This thesis proposes a probabilistic model for calculating such prediction intervals. Given the unobservability of class probabilities, a Bayesian approach is employed to derive a complete distribution of the class probability of a test instance based on a set of class observations of training instances in the neighbourhood of the test instance. A random decision tree ensemble learning algorithm is also proposed, whose prediction output constitutes the neighbourhood that is used by the Bayesian model to produce a PI for the test instance. The Bayesian model, which is used in conjunction with the ensemble learning algorithm and the standard nearest-neighbour classifier, is evaluated on artificial datasets and modified real datasets.

[248]

Learning from the Past with Experiment Databases

Joaquin Vanschoren, Bernhard Pfahringer, and Geoffrey Holmes. Learning from the past with experiment databases. In Proc 10th Pacific Rim International Conference on Artificial Intelligence, Hanoi, Vietnam, pages 485-496. Springer, 2008.
[ bib ]
Thousands of Machine Learning research papers contain experimental comparisons that usually have been conducted with a single focus of interest, often losing detailed results after publication. Yet, when collecting all these past experiments in experiment databases, they can readily be reused for additional and possibly much broader investigation. In this paper, we make use of such a database to answer various interesting research questions about learning algorithms and to verify a number of recent studies. Alongside performing elaborate comparisons of algorithms, we also investigate the effects of algorithm parameters and data properties, and seek deeper insights into the behavior of learning algorithms by studying their learning curves and bias-variance profiles.

[249]

Multi-label Classification Using Ensembles of Pruned Sets

Jesse Read, Bernhard Pfahringer, and Geoffrey Holmes. Multi-label classification using ensembles of pruned sets. In Proc 8th IEEE International Conference on Data Mining, Pisa, Italy, pages 995-1000. IEEE Computer Society, 2008.
[ bib | http ]
This paper presents a pruned sets method (PS) for multi-label classification. It is centred on the concept of treating sets of labels as single labels. This allows the classification process to inherently take into account correlations between labels. By pruning these sets, PS focuses only on the most important correlations, which reduces complexity and improves accuracy. By combining pruned sets in an ensemble scheme (EPS), new label sets can be formed to adapt to irregular or complex data. The results from experimental evaluation on a variety of multi-label datasets show that [E]PS can achieve better performance and train much faster than other multi-label methods.

[250]

Practical Bias Variance Decomposition

Remco R. Bouckaert. Practical bias variance decomposition. In Proc 21st Australasian Joint Conference on Artificial Intelligence, Auckland, New Zealand. Springer, 2008.
[ bib | .pdf ]
Bias variance decomposition for classifiers is a useful tool in understanding classifier behavior. Unfortunately, the literature does not provide consistent guidelines on how to apply a bias variance decomposition. This paper examines the various parameters and variants of empirical bias variance decompositions through an extensive simulation study. Based on this study, we recommend to use ten fold cross validation as sampling method and take 100 samples within each fold with a test set size of at least 2000. Only if the learning algorithm is stable, fewer samples, a smaller test set size or lower number of folds may be justified.

[251]

Revisiting Multiple-Instance Learning via Embedded Instance Selection

James Foulds and Eibe Frank. Revisiting multiple-instance learning via embedded instance selection. In Proc 21st Australasian Joint Conference on Artificial Intelligence, Auckland, New Zealand. Springer, 2008.
[ bib | .pdf ]
Multiple-Instance Learning via Embedded Instance Selection (MILES) is a recently proposed multiple-instance (MI) classification al- gorithm that applies a single-instance base learner to a propositional- ized version of MI data. However, the original authors consider only one single-instance base learner for the algorithm - the 1-norm SVM. We present an empirical study investigating the efficacy of alternative base learners for MILES, and compare MILES to other MI algorithms. Our results show that boosted decision stumps can in some cases provide better classification accuracy than the 1-norm SVM as a base learner for MILES. Although MILES provides competitive performance when compared to other MI learners, we identify simpler propositionaliza- tion methods that require shorter training times while retaining MILES' strong classification performance on the datasets we tested.

[252]

Additive Regression Applied to a Large-Scale Collaborative Filtering Problem

Eibe Frank and Mark Hall. Additive regression applied to a large-scale collaborative filtering problem. In Proc 21set Australasian Joint Conference on Artificial Intelligence, Auckland, New Zealand. Springer, 2008.
[ bib | .pdf ]
The much-publicized Netflix competition has put the spot- light on the application domain of collaborative filtering and has sparked interest in machine learning algorithms that can be applied to this sort of problem. The demanding nature of the Netflix data has lead to some interesting and ingenious modifications to standard learning methods in the name of efficiency and speed. There are three basic methods that have been applied in most approaches to the Netflix problem so far: stand-alone neighborhood-based methods, latent factor models based on singular-value decomposition, and ensembles consisting of variations of these techniques. In this paper we investigate the application of forward stage-wise additive modeling to the Netflix problem, using two regression schemes as base learners: ensembles of weighted simple linear regressors and k-means clustering-the latter being interpreted as a tool for multi- variate regression in this context. Experimental results show that our methods produce competitive results.

[253]

Discriminating Against New Classes: One-Class versus Multi-Class Classification

Kathryn Hempstalk and Eibe Frank. Discriminating against new classes: One-class versus multi-class classification. In Proc 21set Australasian Joint Conference on Artificial Intelligence, Auckland, New Zealand. Springer, 2008.
[ bib | .pdf ]
Many applications require the ability to identify data that is anomalous with respect to a target group of observations, in the sense of belonging to a new, previously unseen 'attacker' class. One possible approach to this kind of verification problem is one-class classification, learning a description of the target class concerned based solely on data from this class. However, if known non-target classes are available at training time, it is also possible to use standard multi-class or two-class classification, exploiting the negative data to infer a description of the target class. In this paper we assume that this scenario holds and inves- tigate under what conditions multi-class and two-class Naive Bayes clas- sifiers are preferable to the corresponding one-class model when the aim is to identify examples from a new 'attacker' class. To this end we first identify a way of performing a fair comparison between the techniques concerned and present an adaptation of standard cross-validation. This is one of the main contributions of the paper. Based on the experimental results obtained, we then show under what conditions which group of techniques is likely to be preferable. Our main finding is that multi-class and two-class classification becomes preferable to one-class classification when a sufficiently large number of non-target classes is available.

[254]

Propositionalisation of Profile Hidden Markov Models for Biological Sequence Analysis

Stefan Mutter, Bernhard Pfahringer, and Geoffrey Holmes. Propositionalisation of profile hidden markov models for biological sequence analysis. In Proc 21st Australasian Joint Conference on Artificial Intelligence, Auckland, New Zealand. Springer, 2008.
[ bib | .pdf ]
Hidden Markov Models are a widely used generative model for analysing sequence data. A variant, Profile Hidden Markov Models are a special case used in Bioinformatics to represent, for example, protein families. In this paper we introduce a simple propositionalisation method for Profile Hidden Markov Models. The method allows the use of PHMMs discriminatively in a classification task. Previously, kernel approaches have been proposed to generate a discriminative description for an HMM, but require the explicit definition of a similarity measure for HMMs. Propositionalisation does not need such a measure and allows the use of any propositional learner including kernel-based approaches. We show empirically that using propositionalisation leads to higher accuracies in comparison with PHMMs on benchmark datasets.

[255]

Mining Arbitrarily Large Datasets Using Heuristic k-Nearest Neighbour Search

Xing Wu, Geoffrey Holmes, and Bernhard Pfahringer. Mining arbitrarily large datasets using heuristic k-nearest neighbour search. In Proc 21st Australasian Joint Conference on Artificial Intelligence, Auckland, New Zealand, pages 355-361. Springer, 2008.
[ bib | http ]
Nearest Neighbour Search (NNS) is one of the top ten data mining algorithms. It is simple and effective but has a time complexity that is the product of the number of instances and the number of dimensions. When the number of dimensions is greater than two there are no known solutions that can guarantee a sublinear retrieval time. This paper describes and evaluates two ways to make NNS efficient for datasets that are arbitrarily large in the number of instances and dimensions. The methods are best described as heuristic as they are neither exact nor approximate. Both stem from recent developments in the field of data stream classification. The first uses Hoeffding Trees, an extension of decision trees to streams and the second is a direct stream extension of NNS. The methods are evaluated in terms of their accuracy and the time taken to find the neighbours. Results show that the methods are competitive with NNS in terms of accuracy but significantly faster.

[256]

Organizing the World's Machine Learning Information

Joaquin Vanschoren, Hendrik Blockeel, Bernhard Pfahringer, and Geoffrey Holmes. Organizing the world's machine learning information. In Proc 3rd International Symposium on Leveraging Applications of Formal Methods, Verification and Validation, Porto Sani, Greece, pages 693-708. Springer, 2008.
[ bib | http ]
All around the globe, thousands of learning experiments are being executed on a daily basis, only to be discarded after interpretation. Yet, the information contained in these experiments might have uses beyond their original intent and, if properly stored, could be of great use to future research. In this paper, we hope to stimulate the development of such learning experiment repositories by providing a bird's-eye view of how they can be created and used in practice, bringing together existing approaches and new ideas. We draw parallels between how experiments are being curated in other sciences, and consecutively discuss how both the empirical and theoretical details of learning experiments can be expressed, organized and made universally accessible. Finally, we discuss a range of possible services such a resource can offer, either used directly or integrated into data mining tools.

[257]

Adaptive feature thresholding for off-line signature verification

Rober Larkins and Michael Mayo. Adaptive feature thresholding for off-line signature verification. In Proc 23rd International Conference Image and Vision Computing New Zealand, Christchurch, New Zealand, pages 1-6. IEEE, 2008.
[ bib | www: ]
This paper introduces Adaptive Feature Thresholding (AFT) which is a novel method of person-dependent off-line signature verification. AFT enhances how a simple image feature of a signature is converted to a binary feature vector by significantly improving its representation in relation to the training signatures. The similarity between signatures is then easily computed from their corresponding binary feature vectors. AFT was tested on the CEDAR and GPDS benchmark datasets, with classification using either a manual or an automatic variant. On the CEDAR dataset we achieved a classification accuracy of 92% for manual and 90% for automatic, while on the GPDS dataset we achieved over 87% and 85% respectively. For both datasets AFT is less complex and requires fewer images features than the existing state of the art methods, while achieving competitive results.

[258]

Improving face gender classification by adding deliberately misaligned faces to the training data

Michael Mayo and Edmond Zhang. Improving face gender classification by adding deliberately misaligned faces to the training data. In Proc 23rd International Conference Image and Vision Computing New Zealand, Christchurch, New Zealand, pages 1-5. IEEE, 2008.
[ bib | http ]
A novel method of face gender classifier construction is proposed and evaluated. Previously, researchers have assumed that a computationally expensive face alignment step (in which the face image is transformed so that facial landmarks such as the eyes, nose, chin, etc, are in uniform locations in the image) is required in order to maximize the accuracy of predictions on new face images. We, however, argue that this step is not necessary, and that machine learning classifiers can be made robust to face misalignments by automatically expanding the training data with examples of faces that have been deliberately misaligned (for example, translated or rotated). To test our hypothesis, we evaluate this automatic training dataset expansion method with two types of image classifier, the first based on weak features such as Local Binary Pattern histograms, and the second based on SIFT keypoints. Using a benchmark face gender classification dataset recently proposed in the literature, we obtain a state-of-the-art accuracy of 92.5%, thus validating our approach.

[259]

Pattern discovery for object categorization

Edmond Zhang and Michael Mayo. Pattern discovery for object categorization. In Proc 23rd International Conference Image and Vision Computing New Zealand, Christchurch, New Zealand, pages 1 - 6. IEEE, 2008.
[ bib | http ]
This paper presents a new approach for the object categorization problem. Our model is based on the successful `bag of words' approach. However, unlike the original model, image features (keypoints) are not seen as independent and orderless. Instead, our model attempts to discover intermediate representations for each object class. This approach works by partitioning the image into smaller regions then computing the spatial relationships between all of the informative image keypoints in the region. The results show that the inclusion of spatial relationships leads to a measurable increase in performance for two of the most challenging datasets.

[260]

Machine Learning for Adaptive Computer Game Opponents

Jonathan David Miles. Machine learning for adaptive computer game opponents. Master's thesis, Department of Computer Science, University of Waikato, 2008.
[ bib | http ]
This thesis investigates the use of machine learning techniques in computer games to create a computer player that adapts to its opponent's game-play. This includes first confirming that machine learning algorithms can be integrated into a modern computer game without have a detrimental effect on game performance, then experimenting with different machine learning techniques to maximize the computer player's performance. Experiments use three machine learning techniques; static prediction models, continuous learning, and reinforcement learning. Static models show the highest initial performance but are not able to beat a simple opponent. Continuous learning is able to improve the performance achieved with static models but the rate of improvement drops over time and the computer player is still unable to beat the opponent. Reinforcement learning methods have the highest rate of improvement but the lowest initial performance. This limits the effectiveness of reinforcement learning because a large number of episodes are required before performance becomes sufficient to match the opponent.

[261]

Random Relational Rules

Grant Anderson. Random Relational Rules. PhD thesis, Department of Computer Science, University of Waikato, 2008.
[ bib | http ]
In the field of machine learning, methods for learning from single-table data have received much more attention than those for learning from multi-table, or relational data, which are generally more computationally complex. However, a significant amount of the world's data is relational. This indicates a need for algorithms that can operate efficiently on relational data and exploit the larger body of work produced in the area of single-table techniques. This thesis presents algorithms for learning from relational data that mitigate, to some extent, the complexity normally associated with such learning. All algorithms in this thesis are based on the generation of random relational rules. The assumption is that random rules enable efficient and effective relational learning, and this thesis presents evidence that this is indeed the case. To this end, a system for generating random relational rules is described, and algorithms using these rules are evaluated. These algorithms include direct classification, classification by propositionalisation, clustering, semi-supervised learning and generating random forests. The experimental results show that these algorithms perform competitively with previously published results for the datasets used, while often exhibiting lower runtime than other tested systems. This demonstrates that sufficient information for classification and clustering is retained in the rule generation process and that learning with random rules is efficient. Further applications of random rules are investigated. Propositionalisation allows single-table algorithms for classification and clustering to be applied to the resulting data, reducing the amount of relational processing required. Further results show that techniques for utilising additional unlabeled training data improve accuracy of classification in the semi-supervised setting. The thesis also develops a novel algorithm for building random forests by making efficient use of random rules to generate trees and leaves in parallel.

[262]

Experiment Databases: Creating a New Platform for Meta-Learning Research

Joaquin Vanschoren, Hendrik Blockeel, Bernhard Pfahringer, and Geoffrey Holmes. Experiment databases: Creating a new platform for meta-learning research. In Proc ICML/COLT/UAI 2008 Planning to Learn Workshop, Helsinki, Finland. University of Porto, 2008.
[ bib | http ]
[263]

One-class Classification by Combining Density and Class Probability Estimation

Kathryn Hempstalk, Eibe Frank, and Ian H. Witten. One-class classification by combining density and class probability estimation. In Proc 12th European Conference on Principles and Practice of Knowledge Discovery in Databases and 19th European Conference on Machine Learning, Antwerp, Belgium. Springer, 2008.
[ bib | .pdf ]
One-class classification has important applications such as outlier and novelty detection. It is commonly tackled using density es- timation techniques or by adapting a standard classification algorithm to the problem of carving out a decision boundary that describes the location of the target data. In this paper we investigate a simple method for one-class classification that combines the application of a density es- timator, used to form a reference distribution, with the induction of a standard model for class probability estimation. In this method, the ref- erence distribution is used to generate artificial data that is employed to form a second, artificial class. In conjunction with the target class, this artificial class is the basis for a standard two-class learning problem. We explain how the density function of the reference distribution can be combined with the class probability estimates obtained in this way to form an adjusted estimate of the density function of the target class. Us- ing UCI datasets, and data from a typist recognition problem, we show that the combined model, consisting of both a density estimator and a class probability estimator, can improve on using either component tech- nique alone when used for one-class classification. We also compare the method to one-class classification using support vector machines.

[264]

Combining Naive Bayes and Decision Tables

Mark Hall and Eibe Frank. Combining naive Bayes and decision tables. In Proc 21st Florida Artificial Intelligence Research Society Conference, Miami, Florida. AAAI Press, 2008.
[ bib | .pdf ]
We investigate a simple semi-naive Bayesian ranking method that combines naive Bayes with induction of decision tables. Naive Bayes and decision tables can both be trained effi- ciently, and the same holds true for the combined semi-naive model. We show that the resulting ranker, compared to ei- ther component technique, frequently significantly increases AUC. For some datasets it significantly improves on both techniques. This is also the case when attribute selection is performed in naive Bayes and its semi-naive variant.

[265]

Propositionalisation of multiple sequence alignments using probabilistic models

Stefan Mutter, Bernhard Pfahringer, and Geoffrey Holmes. Propositionalisation of multiple sequence alignments using probabilistic models. In New Zealand Computer Science Research Student Conference (NZCSRSC 2008), Christchurch, New Zealand, pages 234-237, April 2008.
[ bib | .pdf ]
Multiple sequence alignments play a central role in Bioinformatics. Most alignment representations are designed to facilitate knowledge extraction by human experts. Additionally statistical models like Profile Hidden Markov Models are used as representations. They offer the advantage to provide sound, probabilistic scores. The basic idea we present in this paper is to use the structure of a Profile Hidden Markov Model for propositionalisation. This way we get a simple, extendable representation of multiple sequence alignments which facilitates further analysis by Machine Learning algorithms.

[266]

Exploiting propositionalization based on random relational rules for semi-supervised learning

Bernhard Pfahringer and Grant Anderson. Exploiting propositionalization based on random relational rules for semi-supervised learning. In Proc 12th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Osaka, Japan. Springer, 2008.
[ bib | http ]
In this paper we investigate an approach to semi-supervised learning based on randomized propositionalization, which allows for applying standard propositional classification algorithms like support vector machines to multi-relational data. Randomization based on random relational rules can work both with and without a class attribute and can therefore be applied simultaneously to both the labeled and the unlabeled portion of the data present in semi-supervised learning. An empirical investigation compares semi-supervised propositionalization to standard propositionalization using just the labeled data portion, as well as to a variant that also just uses the labeled data portion but includes the label information in an attempt to improve the resulting propositionalization. Preliminary experimental results indicate that propositionalization generated on the full dataset, i.e. the semi- supervised approach, tends to outperform the other two more standard approaches.

[267]

Handling Numeric Attributes in Hoeffding Trees

Bernhard Pfahringer, Geoffrey Holmes, and Richard Kirkby. Handling numeric attributes in hoeffding trees. In Proc 12th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Osaka, Japan, pages 296-307. Springer, 2008.
[ bib | http ]
For conventional machine learning classification algorithms handling numeric attributes is relatively straightforward. Unsupervised and supervised solutions exist that either segment the data into pre-defined bins or sort the data and search for the best split points. Unfortunately, none of these solutions carry over particularly well to a data stream environment. Solutions for data streams have been proposed by several authors but as yet none have been compared empirically. In this paper we investigate a range of methods for multi-class tree-based classification where the handling of numeric attributes takes place as the tree is constructed. To this end, we extend an existing approximation approach, based on simple Gaussian approximation. We then compare this method with four approaches from the literature arriving at eight final algorithm configurations for testing. The solutions cover a range of options from perfectly accurate and memory intensive to highly approximate. All methods are tested using the Hoeffding tree classification algorithm. Surprisingly, the experimental comparison shows that the most approximate methods produce the most accurate trees by allowing for faster tree growth.

[268]

Learning Instance Weights in Multi-Instance Learning

James Foulds. Learning instance weights in multi-instance learning. Master's thesis, Department of Computer Science, University of Waikato, 2008.
[ bib | http ]
Multi-instance (MI) learning is a variant of supervised machine learning, where each learning example contains a bag of instances instead of just a single feature vector. MI learning has applications in areas such as drug activity prediction, fruit disease management and image classification. This thesis investigates the case where each instance has a weight value determining the level of influence that it has on its s class label. This is a more general assumption than most existing approaches use, and thus is more widely applicable. The challenge is to accurately estimate these weights in order to make predictions at the bag level. An existing approach known as MILES is retroactively identified as an algorithm that uses instance weights for MI learning, and is evaluated using a variety of base learners on benchmark problems. New algorithms for learning instance weights for MI learning are also proposed and rigorously evaluated on both artificial and real-world datasets. The new algorithms are shown to achieve better root mean squared error rates than existing approaches on artificial data generated according to the underlying assumptions. Experimental results also demonstrate that the new algorithms are competitive with existing approaches on real-world problems.

[269]

The Positive Effects of Negative Information: Extending One-Class Classification Models in Binary Proteomic Sequence Classification

Stefan Mutter, Bernhard Pfahringer, and Geoffrey Holmes. The positive effects of negative information: Extending one-class classification models in binary proteomic sequence classification. In Proc 22nd Australasian Conference on Artificial Intelligence, Melbourne, Australia, pages 260-269. Springer, 2009.
[ bib | http ]
Profile Hidden Markov Models (PHMMs) have been widely used as models for Multiple Sequence Alignments. By their nature, they are generative one-class classifiers trained only on sequences belonging to the target class they represent. Nevertheless, they are often used to discriminate between classes. In this paper, we investigate the beneficial effects of information from non-target classes in discriminative tasks. Firstly, the traditional PHMM is extended to a new binary classifier. Secondly, we propose propositional representations of the original PHMM that capture information from target and non-target sequences and can be used with standard binary classifiers. Since PHMM training is time intensive, we investigate whether our approach allows the training of the PHMM to stop, before it is fully converged, without loss of predictive power.

[270]

Relational Random Forests Based on Random Relational Rules

Grant Anderson and Bernhard Pfahringer. Relational random forests based on random relational rules. In Proc 21st International Joint Conference on Artificial Intelligence, Pasadena, California, pages 986-991. AAAI Press, 2009.
[ bib | .pdf ]
Random Forests have been shown to perform very well in propositional learning. FORF is an upgrade of Random Forests for relational data. In this paper we investigate shortcomings of FORF and propose an alternative algorithm, RF, for generating Random Forests over relational data. RF employs randomly generated relational rules as fully self-contained Boolean tests inside each node in a tree and thus can be viewed as an instance of dynamic propositionalization. The implementation of RF allows for the simultaneous or parallel growth of all the branches of all the trees in the ensemble in an efficient shared, but still single-threaded way. Experiments favorably compare RF to both FORF and the combination of static propositionalization together with standard Random Forests. Various strategies for tree initialization and splitting of nodes, as well as resulting ensemble size, diversity, and computational complexity of RF are also investigated.

[271]

The WEKA data mining software: an update

Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. The WEKA data mining software: an update. SIGKDD Explorations, 11(1):10-18, 2009.
[ bib | .pdf ]
More than twelve years have elapsed since the first public release of WEKA. In that time, the software has been re-written entirely from scratch, evolved substantially and now accompanies a text on data mining [35]. These days, WEKA enjoys widespread acceptance in both academia and business, has an active community, and has been downloaded more than 1.4 million times since being placed on SourceForge in April 2000. This paper provides an introduction to the WEKA workbench, reviews the history of the project, and, in light of the recent 3.6 stable release, briefly discusses what has been added since the last stable version (Weka 3.4) released in 2003.

[272]

3D Face Recognition Using Multiview Keypoint Matching

Michael Mayo and Edmond Zhang. 3d face recognition using multiview keypoint matching. In Proc 6th International Conference on Advanced Video and Signal Based Surveillance, Genova, Italy, pages 290-295. IEEE Computer Society, 2009.
[ bib | http ]
A novel algorithm for 3D face recognition based point cloud rotations, multiple projections, and voted keypoint matching is proposed and evaluated. The basic idea is to rotate each 3D point cloud representing an individual's face around the x, y or z axes, iteratively projecting the 3D points onto multiple 2.5D images at each step of the rotation. Labelled keypoints are then extracted from the resulting collection of 2.5D images, and this much smaller set of keypoints replaces the original face scan and its projections in the face database. Unknown test faces are recognised firstly by performing the same multiview keypoint extraction technique, and secondly, the application of a new weighted keypoint matching algorithm. In an extensive evaluation using the GavabDB 3D face recognition dataset (61 subjects, 9 scans per subject), our method achieves up to 95% recognition accuracy for faces with neutral expressions only, and over 90% accuracy for face recognition where expressions (such as a smile or a strong laugh) and random faceoccluding gestures are permitted.

[273]

SIFTing the Relevant from the Irrelevant: Automatically Detecting Objects in Training Images

Edmond Zhang and Michael Mayo. Sifting the relevant from the irrelevant: Automatically detecting objects in training images. In Proc Digital Image Computing: Techniques and Applications, Melbourne, Australia, pages 317-324. IEEE Computer Society, 2009.
[ bib | http ]
Many state-of-the-art object recognition systems rely on identifying the location of objects in images, in order to better learn its visual attributes. In this paper, we propose four simple yet powerful hybrid ROI detection methods (combining both local and global features), based on frequently occurring keypoints. We show that our methods demonstrate competitive performance in two different types of datasets, the Caltech101 dataset and the GRAZ-02 dataset, where the pairs of keypoint bounding box method achieved the best accuracies overall.

[274]

New ensemble methods for evolving data streams

Albert Bifet, Geoff Holmes, Bernhard Pfahringer, Richard Kirkby, and Ricard Gavaldà. New ensemble methods for evolving data streams. In KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 139-148, New York, NY, USA, 2009. ACM.
[ bib | .pdf ]
Advanced analysis of data streams is quickly becoming a key area of data mining research as the number of applications demanding such processing increases. Online mining when such data streams evolve over time, that is when concepts drift or change completely, is becoming one of the core issues. When tackling non-stationary concepts, ensembles of classifiers have several advantages over single classifier methods: they are easy to scale and parallelize, they can adapt to change quickly by pruning under-performing parts of the ensemble, and they therefore usually also generate more accurate concept descriptions. This paper proposes a new experimental data stream framework for studying concept drift, and two new variants of Bagging: ADWIN Bagging and Adaptive-Size Hoeffding Tree (ASHT) Bagging. Using the new experimental framework, an evaluation study on synthetic and real-world datasets comprising up to ten million examples shows that the new ensemble methods perform very well compared to several known methods.

[275]

Improving Adaptive Bagging Methods for Evolving Data Streams

Albert Bifet, Geoff Holmes, Bernhard Pfahringer, and Ricard Gavaldà. Improving adaptive bagging methods for evolving data streams. In Proceedings of the 1st Asian Conference on Machine Learning, Nanjing, China. Springer, 2009.
[ bib | .pdf ]
We propose two new improvements for bagging methods on evolving data streams. Recently, two new variants of Bagging were proposed: ADWIN Bagging and Adaptive-Size Hoeffding Tree (ASHT) Bagging. ASHT Bagging uses trees of different sizes, and ADWIN Bagging uses ADWIN as a change detector to decide when to discard underperforming ensemble members. We improve ADWIN Bagging using Hoeffding Adaptive Trees, trees that can adaptively learn from data streams that change over time. To speed up the time for adapting to change of Adaptive-Size Hoeffding Tree (ASHT) Bagging, we add an error change detector for each classifier. We test our improvements by performing an evaluation study on synthetic and real-world datasets comprising up to ten million examples.

[276]

Conditional Density Estimation with Class Probability Estimators

Eibe Frank and Remco Bouckaert. Conditional density estimation with class probability estimators. In Proceedings of the 1st Asian Conference on Machine Learning, Nanjing, China. Springer, 2009.
[ bib | .pdf ]
Many regression schemes deliver a point estimate only, but often it is useful or even essential to quantify the uncertainty inherent in a prediction. If a conditional density estimate is available, then prediction intervals can be derived from it. In this paper we compare three techniques for computing conditional density estimates using a class probability estimator, where this estimator is applied to the discretized target variable and used to derive instance weights for an underlying univariate density estimator; this yields a conditional density estimate. The three density estimators we compare are: a histogram estimator that has been used previously in this context, a normal density estimator, and a kernel estimator. In our experiments, the latter two deliver better performance, both in terms of cross-validated log-likelihood and in terms of quality of the resulting prediction intervals. The empirical coverage of the intervals is close to the desired confidence level in most cases. We also include results for point estimation, as well as a comparison to Gaussian process regression and nonparametric quantile estimation.

[277]

Tree-based Density Estimation: Algorithms and Applications

Gabi Schmidberger. Tree-based Density Estimation: Algorithms and Applications. PhD thesis, Department of Computer Science, University of Waikato, 2009.
[ bib | http ]
Data Mining can be seen as an extension to statistics. It comprises the preparation of data and the process of gathering new knowledge from it. The extraction of new knowledge is supported by various machine learning methods. Many of the algorithms are based on probabilistic principles or use density estimations for their computations. Density estimation has been practised in the field of statistics for several centuries. In the simplest case, a histogram estimator, like the simple equalwidth histogram, can be used for this task and has been shown to be a practical tool to represent the distribution of data visually and for computation. Like other nonparametric approaches, it can provide a flexible solution. However, flexibility in existing approaches is generally restricted because the size of the bins is fixed either the width of the bins or the number of values in them. Attempts have been made to generate histograms with a variable bin width and a variable number of values per interval, but the computational approaches in these methods have proven too difficult and too slow even with modern computer technology. In this thesis new flexible histogram estimation methods are developed and tested as part of various machine learning tasks, namely discretization, naive Bayes classification, clustering and multiple-instance learning. Not only are the new density estimation methods applied to machine learning tasks, they also borrow design principles from algorithms that are ubiquitous in artificial intelligence: divide-andconquer methods are a well known way to tackle large problems by dividing them into small subproblems. Decision trees, used for machine learning classification, successfully apply this approach. This thesis presents algorithms that build density estimators using a binary split tree to cut a range of values into subranges of varying length. No class values are required for this splitting process, making it an unsupervised method. The result is a histogram estimator that adapts well even to complex density functions a novel density estimation method with flexible density estimation ability and good computational behaviour. Algorithms are presented for both univariate and multivariate data. The univariate histogram estimator is applied to discretization for density estimation and also used as density estimator inside a naive Bayes classifier. The multivariate histogram, used as the basis for a clustering method, is applied to improve the runtime behaviour of a well-known algorithm for multiple-instance classification. Performance in these applications is evaluated by comparing the new approaches with existing methods.

[278]

Classifier Chains for Multi-label Classification

Jesse Read, Bernhard Pfahringer, Geoff Holmes, and Eibe Frank. Classifier chains for multi-label classification. In Proc 13th European Conference on Principles and Practice of Knowledge Discovery in Databases and 20th European Conference on Machine Learning, Bled, Slovenia. Springer, 2009.
[ bib | .pdf ]
The widely known binary relevance method for multi-label classification, which considers each label as an independent binary problem, has been sidelined in the literature due to the perceived inadequacy of its label-independence assumption. Instead, most current methods invest considerable complexity to model interdependencies between labels. This paper shows that binary relevance-based methods have much to offer, especially in terms of scalability to large datasets. We exemplify this with a novel chaining method that can model label correlations while maintaining acceptable computational complexity. Empirical evaluation over a broad range of multi-label datasets with a variety of evaluation metrics demonstrates the competitiveness of our chaining method against related and state-of-the-art methods, both in terms of predictive performance and time complexity.

[279]

Continuous Typist Verification using Machine Learning

Kathryn Hempstalk. Continuous Typist Verification using Machine Learning. PhD thesis, Department of Computer Science, University of Waikato, 2009.
[ bib | http ]
A keyboard is a simple input device. Its function is to send keystroke information to the computer (or other device) to which it is attached. Normally this information is employed solely to produce text, but it can also be utilized as part of an authentication system. Typist verification exploits a typist's patterns to check whether they are who they say they are, even after standard authentication schemes have confirmed their identity. This thesis investigates whether typists behave in a sufficiently unique yet consistent manner to enable an effective level of verification based on their typing patterns. Typist verification depends on more than the typist's behaviour. The quality of the patterns and the algorithms used to compare them also determine how accurately verification is performed. This thesis sheds light on all technical aspects of the problem, including data collection, feature identification and extraction, and sample classification. A dataset has been collected that is comparable in size, timing accuracy and content to others in the field, with one important exception: it is derived from real emails, rather than samples collected in an artificial setting. This dataset is used to gain insight into what features distinguish typists from one another. The features and dataset are used to train learning algorithms that make judgements on the origin of previously unseen typing samples. These algorithms use 'one-class classification'; they make predictions for a particular user having been trained on only that user's patterns. This thesis examines many one-class classification algorithms, including ones designed specifically for typist verification. New algorithms and features are proposed to increase speed and accuracy. The best method proposed performs at the state of the art in terms of classification accuracy, while decreasing the time taken for a prediction from minutes to seconds, and - more importantly - without requiring any negative data from other users. Also, it is general: it applies not only to typist verification, but to any other one-class classification problem. Overall, this thesis concludes that typist verification can be considered a useful biometric technique.

[280]

Human-competitive Automatic Topic Indexing

Olena Medelyan. Human-competitive Automatic Topic Indexing. PhD thesis, Department of Computer Science, University of Waikato, 2009.
[ bib | http ]
Topic indexing is the task of identifying the main topics covered by a document. These are useful for many purposes: as subject headings in libraries, as keywords in academic publications and as tags on the web. Knowing a document's topics helps people judge its relevance quickly. However, assigning topics manually is labor intensive. This thesis shows how to generate them automatically in a way that competes with human performance. Three kinds of indexing are investigated: term assignment, a task commonly performed by librarians, who select topics from a controlled vocabulary; tagging, a popular activity of web users, who choose topics freely; and a new method of keyphrase extraction, where topics are equated to Wikipedia article names. A general two-stage algorithm is introduced that first selects candidate topics and then ranks them by significance based on their properties. These properties draw on statistical, semantic, domain-specific and encyclopedic knowledge. They are combined using a machine learning algorithm that models human indexing behavior from examples. This approach is evaluated by comparing automatically generated topics to those assigned by professional indexers, and by amateurs. We claim that the algorithm is human-competitive because it chooses topics that are as consistent with those assigned by humans as their topics are with each other. The approach is generalizable, requires little training data and applies across different domains and languages.

[281]

Human-competitive tagging using automatic keyphrase extraction

Olena Medelyan, Eibe Frank, and Ian H. Witten. Human-competitive tagging using automatic keyphrase extraction. In Proc Conf on Empirical Methods in Natural Language Processing. ACL, 2009.
[ bib | .pdf ]
This paper connects two research areas: automatic tagging on the web and statistical keyphrase extraction. First, we analyze the quality of tags in a collaboratively created folksonomy using traditional evaluation techniques. Next, we demonstrate how documents can be tagged automatically with a state-of-the-art keyphrase extraction algorithm, and further improve performance in this new domain using a new algorithm, ``Maui'', that utilizes semantic information extracted from Wikipedia. Maui outperforms existing approaches and extracts tags that are competitive with those assigned by the best performing human taggers.

[282]

Clustering Documents using a Wikipedia-based Concept Representation

Anna Huang, David Milne, Eibe Frank, and Ian H. Witten. Clustering documents using a wikipedia-based concept representation. In Proc 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Bangkog, Thailand, pages 628-636. Springer, 2009.
[ bib | .pdf ]
This paper shows how Wikipedia and the semantic knowledge it contains can be exploited for document clustering. We first create a concept-based document representation by mapping the terms and phrases within documents to their corresponding articles (or concepts) in Wikipedia. We also developed a similarity measure that evaluates the semantic relatedness between concept sets for two documents. We test the concept-based representation and the similarity measure on two standard text document datasets. Empirical results show that although further optimizations could be performed, our approach already improves upon related techniques.

[283]

Sampling-based Prediction of Algorithm Runtime

Quan Sun. Sampling-based prediction of algorithm runtime. Master's thesis, Department of Computer Science, University of Waikato, 2009.
[ bib | http ]
The ability to handle and analyse massive amounts of data has been progres- sively improved during the last decade with the growth of computing power and the opening up of the Internet era. Nowadays, machine learning algorithms have been widely applied in various fields of engineering sciences and in real world applications. However, currently, users of machine learning algorithms do not usually receive feedback on when a given algorithm will have finished building a model for a particular data set. While in theory such estimation can be obtained by asymptotic performance analysis, the complexity of machine learning algorithms means theoretical asymptotic performance analysis can be a very difficult task. This work has two goals. The first goal is to investigate how to use sampling-based techniques to predict the running time of a ma- chine learning algorithm training on a particular data set. The second goal is to empirically evaluate a set of sampling-based running time prediction meth- ods. Experimental results show that, with some care in the sampling stage, application of appropriate transformations on the running time observations followed by the use of suitable curve fitting algorithms makes it possible to obtain useful average-case running time predictions and an approximate time function for a given machine learning algorithm building a model on a particular data set. There are 41 WEKA (Witten Frank, 2005) machine learning algorithms are used for the experiments.

[284]

Large-scale attribute selection using wrappers

Martin Gütlein, Eibe Frank, Mark Hall, and Andreas Karwath. Large-scale attribute selection using wrappers. In Proc IEEE Symposium on Computational Intelligence and Data Mining, pages 332-339. IEEE, 2009.
[ bib | .pdf ]
Scheme-specific attribute selection with the wrapper and variants of forward selection is a popular attribute selection technique for classification that yields good results. However, it can run the risk of overfitting because of the extent of the search and the extensive use of internal cross-validation. Moreover, although wrapper evaluators tend to achieve superior accuracy compared to filters, they face a high computational cost. The problems of overfitting and high runtime occur in particular on high-dimensional datasets, like microarray data. We investigate Linear Forward Selection, a technique to reduce the number of attributes expansions in each forward selection step. Our experiments demonstrate that this approach is faster, finds smaller subsets and can even increase the accuracy compared to standard forward selection. We also investigate a variant that applies explicit subset size determination in forward selection to combat overfitting, where the search is forced to stop at a precomputed ``optimal'' subset size. We show that this technique reduces subset size while maintaining comparable accuracy.

[285]

Analysing chromatographic data using data mining to monitor petroleum content in water

G. Holmes, D. Fletcher, P. Reutemann, and E. Frank. Analysing chromatographic data using data mining to monitor petroleum content in water. In Proc 4th International ICSC Symposium, Thessaloniki, Greece, Information Technologies in Enviromental Engineering, pages 279-290, May 2009.
[ bib | .pdf ]
Chromatography is an important analytical technique that has wide- spread use in environmental applications. A typical application is the monitoring of water samples to determine if they contain petroleum. These tests are mandated in many countries to enable environmental agencies to determine if tanks used to store petrol are leaking into local water systems.
Chromatographic techniques, typically using gas or liquid chromatography coupled with mass spectrometry, allow an analyst to detect a vast array of compounds-potentially in the order of thousands. Accurate analysis relies heavily on the skills of a limited pool of experienced analysts utilising semi-automatic techniques to analyse these datasets-making the out- comes subjective.
The focus of current laboratory data analysis systems has been on refine- ments of existing approaches. The work described here represents a para- digm shift achieved through applying data mining techniques to tackle the problem. These techniques are compelling because the efficacy of pre- processing methods, which are essential in this application area, can be ob- jectively evaluated. This paper presents preliminary results using a data mining framework to predict the concentrations of petroleum compounds in water samples. Experiments demonstrate that the framework can be used to produce models of sufficient accuracy-measured in terms of root mean squared error and correlation coefficients-to offer the potential fo significantly reducing the time spent by analysts on this task.

[286]

Off-line signature verification

Robert L. Larkins. Off-line signature verification. Master's thesis, Department of Computer Science, University of Waikato, 2009.
[ bib | http ]
In today's society signatures are the most accepted form of identity verification. However, they have the unfortunate side-effect of being easily abused by those who would feign the identification or intent of an individual. This thesis implements and tests current approaches to off-line signature verification with the goal of determining the most beneficial techniques that are available. This investigation will also introduce novel techniques that are shown to significantly boost the achieved classification accuracy for both person-dependent (one-class training) and person-independent (two-class training) signature verification learning strategies. The findings presented in this thesis show that many common techniques do not always give any significant advantage and in some cases they actually detract from the classification accuracy. Using the techniques that are proven to be most beneficial, an effective approach to signature verification is constructed, which achieves approximately 90% and 91% on the standard CEDAR and GPDS signature datasets respectively. These results are significantly better than the majority of results that have been previously published. Additionally, this approach is shown to remain relatively stable when a minimal number of training signatures are used, representing feasibility for real-world situations.

[287]

Machine learning for adaptive computer game opponents

Jonathan Miles. Machine learning for adaptive computer game opponents. Master's thesis, Department of Computer Science, University of Waikato, 2009.
[ bib ]
[288]

Prediction of Oestrus in Dairy Cows: An Application of Machine Learning to Skewed Data

Adam David Lynam. Prediction of oestrus in dairy cows: An application of machine learning to skewed data. Master's thesis, Department of Computer Science, University of Waikato, 2009.
[ bib | http ]
The Dairy industry requires accurate detection of oestrus(heat) in dairy cows to maximise output of the animals. Traditionally this is a process dependant on human observation and interpretation of the various signs of heat. Many areas of the dairy industry can be automated, however the detection of oestrus is an area that still requires human experts. This thesis investigates the application of Machine Learning classification techniques, on dairy cow milking data provided by the Livestock Improvement Corporation, to predict oestrus. The usefulness of various ensemble learning algorithms such as Bagging and Boosting are explored as well as specific skewed data techniques. An empirical study into the effectiveness of classifiers designed to target skewed data is included as a significant part of the investigation. Roughly Balanced Bagging and the novel Under Bagging classifiers are explored in considerable detail and found to perform quite favourably over the SMOTE technique for the datasets selected. This study uses non-dairy, commonplace, Machine Learning datasets; many of which are found in the UCI Machine Learning Repository.

[289]

Accuracy of machine learning models versus hand crafted expert systems - A credit scoring case study

Arie Ben-David and Eibe Frank. Accuracy of machine learning models versus hand crafted expert systems - a credit scoring case study. Expert Systems with Applications, 36(3):5264-5271, 2009.
[ bib | http ]
Relatively few publications compare machine learning models with expert systems when applied to the same problem domain. Most publications emphasize those cases where the former beat the latter. Is it a realistic picture of the state of the art? Some other findings are presented here. The accuracy of a real world mind crafted credit scoring expert system is compared with dozens of machine learning models. The results show that while some machine learning models can surpass the expert system's accuracy with statistical significance, most models do not. More interestingly, this happened only when the problem was treated as regression. In contrast, no machine learning model showed any statistically significant advantage over the expert system's accuracy when the same problem was treated as classification. Since the true nature of the class data was ordinal, the latter is the more appropriate setting. It is also shown that the answer to the question is highly dependent on the meter that is being used to define accuracy.

[290]

Encyclopedia of Database Systems

Ian H. Witten. Encyclopedia of Database Systems, chapter Classification, pages 331-335. Springer, 2009.
[ bib | http ]
[291]

Efficient multi-label classification for evolving data streams

Jesse Read. Efficient multi-label classification for evolving data streams. PhD thesis, Department of Computer Science, University of Waikato, 2010.
[ bib | http ]
Multi-label classification is relevant to many domains, such as text, image and other media, and bioinformatics. Researchers have already noticed that in multi-label data, correlations exist between labels, and a variety of approaches, drawing inspiration from many spheres of machine learning, have been able to model these correlations. However, data sources from the real world are growing ever larger and the multi-label task is particularly sensitive to this due to the complexity associated with multiple labels and the correlations between them. Consequently, many methods do not scale up to large problems. This thesis deals with scalable multi-label classification: methods which exhibit high predictive performance, but are also able to scale up to larger problems. The first major contribution is the pruned sets method, which is able to model label correlations directly for high predictive performance, but reduces overfitting and complexity over related methods by pruning and subsampling label sets, and can thus scale up to larger datasets. The second major contribution is the classifier chains method, which models correlations with a chain of binary classifiers. The use of binary models allows for scalability to even larger datasets. Pruned sets and classifier chains are robust with respect to both the variety and scale of data that they can deal with, and can be incorporated into other methods. In an ensemble scheme, these methods are able to compete with state-of-the-art methods in terms of predictive performance as well as scale up to large datasets of hundreds of thousands of training examples. This thesis also puts a special emphasis on multi-label evaluation; introducing a new evaluation measure and studying threshold calibration. With one of the largest and most varied collections of multi-label datasets in the literature, extensive experimental evaluation shows the advantage of these methods, both in terms of predictive performance, and computational efficiency and scalability.

[292]

WEKA-Experiences with a Java Open-Source Project

Remco R. Bouckaert, Eibe Frank, Mark A. Hall, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. WEKA-experiences with a java open-source project. Journal of Machine Learning Research, 11:2533-2541, 2010.
[ bib | .pdf ]
WEKA is a popular machine learning workbench with a development life of nearly two decades. This article provides an overview of the factors that we believe to be important to its success. Rather than focussing on the software's functionality, we review aspects of project management and historical development decisions that likely had an impact on the uptake of the project.

[293]

Accurate Ensembles for Data Streams: Combining Restricted Hoeffding Trees using Stacking

Albert Bifet, Eibe Frank, Geoffrey Holmes, and Bernhard Pfahringer. Accurate ensembles for data streams: Combining restricted Hoeffding trees using stacking. In Proc 2nd Asian Conference on Machine Learning, Tokyo. JMLR, 2010.
[ bib | .pdf ]
The success of simple methods for classification shows that is is often not necessary to model complex attribute interactions to obtain good classification accuracy on practical problems. In this paper, we propose to exploit this phenomenon in the data stream context by building an ensemble of Hoeffding trees that are each limited to a small subset of attributes. In this way, each tree is restricted to model interactions between attributes in its corresponding subset. Because it is not known a priori which attribute subsets are relevant for prediction, we build exhaustive ensembles that consider all possible attribute subsets of a given size. As the resulting Hoeffding trees are not all equally important, we weigh them in a suitable manner to obtain accurate classifications. This is done by combining the log-odds of their probability estimates using sigmoid perceptrons, with one perceptron per class. We propose a mechanism for setting the perceptrons' learning rate using the ADWIN change detection method for data streams, and also use ADWIN to reset ensemble members (i.e. Hoeffding trees) when they no longer perform well. Our experiments show that the resulting ensemble classifier outperforms bagging for data streams in terms of accuracy when both are used in conjunction with adaptive naive Bayes Hoeffding trees, at the expense of runtime and memory consumption.

[294]

Enhanced spatial pyramid matching using log-polar-based image subdivision and representation

Edmond Zhang and Michael Mayo. Enhanced spatial pyramid matching using log-polar-based image subdivision and representation. In Proc International Conference on Digital Image Computing: Techniques and Applications, Sydney, Australia. IEEE Computer Society, 2010.
[ bib | http ]
This paper presents a new model for capturing spatial information for object categorization with bag-of-words (BOW). BOW models have recently become popular for the task of object recognition, owing to their good performance and simplicity. Much work has been proposed over the years to improve the BOW model, where the Spatial Pyramid Matching (SPM) technique is the most notable. We propose a new method to exploit spatial relationships between image features, based on binned log-polar grids. Our model works by partitioning the image into grids of different scales and orientations and computing histogram of local features within each grid. Experimental results show that our approach improves the results on three diverse datasets over the SPM technique.

[295]

Speeding Up and Boosting Diverse Density Learning

James R. Foulds and Eibe Frank. Speeding up and boosting diverse density learning. In Proc 13th International Conference on Discovery Science, Canberra, Australia, pages 102-116. Springer, 2010.
[ bib | .pdf ]
In multi-instance learning, each example is described by a bag of instances instead of a single feature vector. In this paper, we revisit the idea of performing multi-instance classification based on a point-and-scaling concept by searching for the point in instance space with the highest diverse density. This is a computationally expensive process, and we describe several heuristics designed to improve runtime. Our results show that simple variants of existing algorithms can be used to find diverse density maxima more efficiently. We also show how significant increases in accuracy can be obtained by applying a boosting algorithm with a modified version of the diverse density algorithm as the weak learner.

[296]

Sentiment Knowledge Discovery in Twitter Streaming Data

Albert Bifet and Eibe Frank. Sentiment knowledge discovery in Twitter streaming data. In Proc 13th International Conference on Discovery Science, Canberra, Australia, pages 1-15. Springer, 2010.
[ bib | .pdf ]
Micro-blogs are a challenging new source of information for data mining techniques. Twitter is a micro-blogging service built to dis- cover what is happening at any moment in time, anywhere in the world. Twitter messages are short, and generated constantly, and well suited for knowledge discovery using data stream mining. We briefly discuss the challenges that Twitter data streams pose, focusing on classification problems, and then consider these streams for opinion mining and sentiment analysis. To deal with streaming unbalanced classes, we propose a sliding window Kappa statistic for evaluation in time-changing data streams. Using this statistic we perform a study on Twitter data using learning algorithms for data streams.

[297]

A Study of Hierarchical and Flat Classification of Proteins

Arthur Zimek, Fabian Buchwald, Eibe Frank, and Stefan Kramer. A study of hierarchical and flat classification of proteins. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 7:563-571, 2010.
[ bib | http ]
Automatic classification of proteins using machine learning is an important problem that has received significant attention in the literature. One feature of this problem is that expert-defined hierarchies of protein classes exist and can potentially be exploited to improve classification performance. In this article we investigate empirically whether this is the case for two such hierarchies. We compare multi-class classification techniques that exploit the information in those class hierarchies and those that do not, using logistic regression, decision trees, bagged decision trees, and support vector machines as the underlying base learners. In particular, we compare hierarchical and flat variants of ensembles of nested dichotomies. The latter have been shown to deliver strong classification performance in multi-class settings. We present experimental results for synthetic, fold recognition, enzyme classification, and remote homology detection data. Our results show that exploiting the class hierarchy improves performance on the synthetic data, but not in the case of the protein classification problems. Based on this we recommend that strong flat multi-class methods be used as a baseline to establish the benefit of exploiting class hierarchies in this area.

[298]

Evolving concurrent Petri net models of epistasis

Michael Mayo and Lorenzo Beretta. Evolving concurrent petri net models of epistasis. In Proc 2nd Asian Conference on Intelligent Information and Database Systems, Hue City, Vietnam, pages 166-175. Springer, 2010.
[ bib | http ]
A genetic algorithm is used to learn a non-deterministic Petri netbased model of non-linear gene interactions, or statistical epistasis. Petri nets are computational models of concurrent processes. However, often certain global assumptions (e.g. transition priorities) are required in order to convert a non-deterministic Petri net into a simpler deterministic model for easier analysis and evaluation. We show, by converting a Petri net into a set of state trees, that it is possible to both retain Petri net non-determinism (i.e. allowing local interactions only, thereby making the model more realistic), whilst also learning useful Petri nets with practical applications. Our Petri nets produce predictions of genetic disease risk assessments derived from clinical data that match with over 92% accuracy.

[299]

A 3-factor epistatic model predicts digital ulcers in Italian scleroderma patients

Lorenzo Beretta, Alessandro Santaniello, Michael Mayo, Francesca Cappiello, Maurizio Marchini, and Raffaella Scorza. A 3-factor epistatic model predicts digital ulcers in italian scleroderma patients. European Journal of Internal Medicine, 21(4):347-353, 2010.
[ bib | http ]
Background The genetic background may predispose systemic sclerosis (SSc) patients to the development of digital ulcers (DUs). Methods Twenty-two functional cytokine single nucleotide polymorphisms (SNPs) and 3 HLA class I and II antigens were typed at the genomic level by polymerase chain reaction in 200 Italian SSc patients. Associations with DUs were sought by parametric models and with the Multifactor Dimensionality Reduction (MDR) algorithm to depict the presence of epistasis. Biological models consistent with MDR results were built by means of Petri nets to describe the metabolic significance of our findings. Results On the exploratory analysis, the diffuse cutaneous subset (dcSSc) was the only single factor statistically associated with DUs (p = 0.045, ns after Bonferroni correction). Gene-gene analysis showed that a 3-factor model comprising the IL-6 C-174G, the IL-2 G-330T SNPs and the HLA-B*3501 allele was predictive for the occurrence of DUs in our population (testing accuracy = 66.9%; p < 0.0001, permutation testing). Conclusion Biological interpretation via Petri net showed that IL-6 is a key factor in determining DUs occurrence and that this cytokines may synergise with HLA-B*3501 to determine DUs onset. Owing to the limited number of patients included in the study, future research are needed to replicate our statistical findings as well as to better determine their functional meaning.

[300]

Predicting polycyclic aromatic hydrocarbon concentrations in soil and water samples

Geoffrey Holmes, Dale Fletcher, and Peter Reutemann. Predicting polycyclic aromatic hydrocarbon concentrations in soil and water samples. In Proc International Congress on Environmental Modelling and Software, 2010.
[ bib | http ]
Polycyclic Aromatic Hydrocarbons (PAHs) are compounds found in the environment that can be harmful to humans. They are typically formed due to incomplete combustion and as such remain after burning coal, oil, petrol, diesel, wood, household waste and so forth. Testing laboratories routinely screen soil and water samples taken from potentially contaminated sites for PAHs using Gas Chromatography Mass Spectrometry (GC-MS). A GC-MS device produces a chromatogram which is processed by an analyst to determine the concentrations of PAH compounds of interest. In this paper we investigate the application of data mining techniques to PAH chromatograms in order to provide reliable prediction of compound concentrations. A workflow engine with an easy-to-use graphical user interface is at the heart of processing the data. This engine allows a domain expert to set up workflows that can load the data, preprocess it in parallel in various ways and convert it into data suitable for data mining toolkits. The generated output can then be evaluated using different data mining techniques, to determine the impact of preprocessing steps on the performance of the generated models and for picking the best approach. Encouraging results for predicting PAH compound concentrations, in terms of correlation coefficients and root-mean-squared error are demonstrated.

[301]

Fast Conditional Density Estimation for Quantitative Structure-Activity Relationships

Fabian Buchwald, Tobias Girschick, Eibe Frank, and Stefan Kramer. Fast conditional density estimation for quantitative structure-activity relationships. In Proc 24th AAAI Conference on Artificial Intelligence, 2010.
[ bib | http ]
Many methods for quantitative structure-activity relation- ships (QSARs) deliver point estimates only, without quantifying the uncertainty inherent in the prediction. One way to quantify the uncertainy of a QSAR prediction is to pre- dict the conditional density of the activity given the structure instead of a point estimate. If a conditional density estimate is available, it is easy to derive prediction intervals of activities. In this paper, we experimentally evaluate and compare three methods for conditional density estimation for their suitability in QSAR modeling. In contrast to traditional methods for conditional density estimation, they are based on generic machine learning schemes, more specifically, class probability estimators. Our experiments show that a kernel estimator based on class probability estimates from a random forest classifier is highly competitive with Gaussian process regression, while taking only a fraction of the time for train- ing. Therefore, generic machine-learning based methods for conditional density estimation may be a good and fast option for quantifying uncertainty in QSAR modeling.

[302]

Fast Perceptron Decision Tree Learning from Evolving Data Streams

Albert Bifet, Geoff Holmes, Bernhard Pfahringer, and Eibe Frank. Fast perceptron decision tree learning from evolving data streams. In Proc 14th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Hyderabad, India, pages 299-310. Springer, 2010.
[ bib | .pdf ]
Mining of data streams must balance three evaluation dimensions: accuracy, time and memory. Excellent accuracy on data streams has been obtained with Naive Bayes Hoeffding Trees-Hoeffding Trees with naive Bayes models at the leaf nodes-albeit with increased runtime compared to standard Hoeffding Trees. In this paper, we show that runtime can be reduced by replacing naive Bayes with perceptron classifiers, while maintaining highly competitive accuracy. We also show that accuracy can be increased even further by combining majority vote, naive Bayes, and perceptrons. We evaluate four perceptron-based learning strategies and compare them against appropriate baselines: simple perceptrons, Perceptron Hoeffding Trees, hybrid Naive Bayes Perceptron Trees, and bagged versions thereof. We implement a perceptron that uses the sigmoid activation function instead of the threshold activation function and optimizes the squared error, with one perceptron per class value. We test our methods by performing an evaluation study on synthetic and real-world datasets comprising up to ten million examples.

[303]

MOA: Massive Online Analysis

Albert Bifet, Geoff Holmes, Richard Kirkby, and Bernhard Pfahringer. Moa: Massive online analysis. Journal of Machine Learning Research (JMLR), 2010.
[ bib | .pdf ]
Massive Online Analysis (MOA) is a software environment for implementing algorithms and running experiments for online learning from evolving data streams. MOA includes a collection of offline and online methods as well as tools for evaluation. In particular, it implements boosting, bagging, and Hoeffding Trees, all with and without Naive Bayes classifiers at the leaves. MOA supports bi-directional interaction with WEKA, the Waikato Environment for Knowledge Analysis, and is released under the GNU GPL license.

[304]

GNUsmail: Open Framework for On-line Email Classification

José M. Carmona-Cejudo, Manuel Baena-García, José del Campo-Ávila, Rafael Morales Bueno, and Albert Bifet. Gnusmail: Open framework for on-line email classification. In ECAI 2010 - 19th European Conference on Artificial Intelligence, Lisbon, Portugal, August 16-20, 2010, Proceedings, pages 1141-1142, 2010.
[ bib | http ]
Real-time classification of massive email data is a challenging task that presents its own particular difficulties. Since email data presents an important temporal component, several problems arise: emails arrive continuously, and the criteria used to classify those emails can change, so the learning algorithms have to be able to deal with concept drift. Our problem is more general than spam detection, which has received much more attention in the literature. In this paper we present GNUsmail, an open-source extensible framework for email classification, which structure supports incremental and on-line learning. This framework enables the incorporation of algorithms developed by other researchers, such as those included in WEKA and MOA. We evaluate this framework, characterized by two overlapping phases (pre-processing and learning), using the ENRON dataset, and we compare the results achieved by WEKA and MOA algorithms.

[305]

Leveraging Bagging for Evolving Data Streams

Albert Bifet, Geoffrey Holmes, and Bernhard Pfahringer. Leveraging bagging for evolving data streams. In Machine Learning and Knowledge Discovery in Databases, European Conference, ECML PKDD 2010, Barcelona, Spain, September 20-24, 2010, Proceedings, Part I, pages 135-150, 2010.
[ bib | http ]
Bagging, boosting and Random Forests are classical ensemble methods used to improve the performance of single classifiers. They obtain superior performance by increasing the accuracy and diversity of the single classifiers. Attempts have been made to reproduce these methods in the more challenging context of evolving data streams. In this paper, we propose a new variant of bagging, called leveraging bagging. This method combines the simplicity of bagging with adding more randomization to the input, and output of the classifiers. We test our method by performing an evaluation study on synthetic and real-world datasets comprising up to ten million examples.

[306]

MOA: Massive Online Analysis, a Framework for Stream Classification and Clustering

Albert Bifet, Geoff Holmes, Bernhard Pfahringer, Philipp Kranen, Hardy Kremer, Timm Jansen, and Thomas Seidl. Moa: Massive online analysis, a framework for stream classification and clustering. Journal of Machine Learning Research - Proceedings Track, 11:44-50, 2010.
[ bib | .html ]
Massive Online Analysis (MOA) is a software environment for implementing algorithms and running experiments for online learning from evolving data streams. MOA is designed to deal with the challenging problem of scaling up the implementation of state of the art algorithms to real world dataset sizes. It contains collection of offline and online for both classification and clustering as well as tools for evaluation. In particular, for classification it implements boosting, bagging, and Hoeffding Trees, all with and without Naive Bayes classifiers at the leaves. For clustering, it implements StreamKM++, CluStream, ClusTree, Den-Stream, D-Stream and CobWeb. Researchers benefit from MOA by getting insights into workings and problems of different approaches, practitioners can easily apply and compare several algorithms to real world data set and settings. MOA supports bi-directional interaction with WEKA, the Waikato Environment for Knowledge Analysis, and is released under the GNU GPL license.

[307]

Clustering Performance on Evolving Data Streams: Assessing Algorithms and Evaluation Measures within MOA

Philipp Kranen, Hardy Kremer, Timm Jansen, Thomas Seidl, Albert Bifet, Geoff Holmes, and Bernhard Pfahringer. Clustering performance on evolving data streams: Assessing algorithms and evaluation measures within moa. In ICDM Workshops, pages 1400-1403, 2010.
[ bib | http ]
In today's applications, evolving data streams are ubiquitous. Stream clustering algorithms were introduced to gain useful knowledge from these streams in real-time. The quality of the obtained clusterings, i.e. how good they reflect the data, can be assessed by evaluation measures. A multitude of stream clustering algorithms and evaluation measures for clusterings were introduced in the literature, however, until now there is no general tool for a direct comparison of the different algorithms or the evaluation measures. In our demo, we present a novel experimental framework for both tasks. It offers the means for extensive evaluation and visualization and is an extension of the Massive Online Analysis (MOA) software environment released under the GNU GPL License.

[308]

ADAPTIVE STREAM MINING: Pattern Learning and Mining from Evolving Data Streams

Albert Bifet. ADAPTIVE STREAM MINING: Pattern Learning and Mining from Evolving Data Streams. IOS Press, Amsterdam, 2010.
[ bib | http ]
[309]

A Review of Multi-Instance Learning Assumptions

James Foulds and Eibe Frank. A review of multi-instance learning assumptions. Knowledge Engineering Review, 25(1):1-25, 2010.
[ bib | .pdf ]
Multi-instance (MI) learning is a variant of inductive machine learning where each learning example contains a bag of instances instead of a single feature vector. The term commonly refers to the supervised setting, where each bag is associated with a label. This type of representation is a natural fit for a number of real-world learning scenarios, including drug activity prediction and image classification, hence many multi-instance learning algorithms have been proposed. Any MI learning method must relate instances to bag-level class labels, but many types of relationships between instances and class labels are possible. Although all early work in MI learning assumes a specific MI concept class known to be appropriate for a drug activity prediction domain, this standard MI assumption is not guaranteed to hold in other domains. Much of the recent work in MI learning has concentrated on a relaxed view of the MI problem, where the standard MI assumption is dropped, and alternative assumptions are considered instead. However, often it is not clearly stated what particular assumption is used and how it relates to other assumptions that have been proposed. In this paper, we aim to clarify the use of alternative MI assumptions by reviewing the work done in this area.

[310]

Clustering mixed data

Lynette Hunt and Murray Jorgensen. Clustering mixed data. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(4):352-361, 2011.
[ bib | http ]
Mixture model clustering proceeds by fitting a finite mixture of multivariate distributions to data, the fitted mixture density then being used to allocate the data to one of the components. Common model formulations assume that either all the attributes are continuous or all the attributes are categorical. In this paper, we consider options for model formulation in the more practical case of mixed data: multivariate data sets that contain both continuous and categorical attributes.

[311]

Using the online cross-entropy method to learn relational policies for playing different games

Samuel Sarjant, Bernhard Pfahringer, Kurt Driessens, and Tony Smith. Using the online cross-entropy method to learn relational policies for playing different games. In Proceedings IEEE Conference on Computational Intelligence and Games, pages 182-189. IEEE, 2011.
[ bib ]
By defining a video-game environment as a collection of objects, relations, actions and rewards, the relational reinforcement learning algorithm presented in this paper generates and optimises a set of concise, human-readable relational rules for achieving maximal reward. Rule learning is achieved using a combination of incremental specialisation of rules and a modified online cross-entropy method, which dynamically adjusts the rate of learning as the agent progresses. The algorithm is tested on the Ms. Pac-Man and Mario environments, with results indicating the agent learns an effective policy for acting within each environment.

[312]

Classifier chains for multi-label classification

Jesse Read, Bernhard Pfahringer, Geoff Holmes, and Eibe Frank. Classifier chains for multi-label classification. Machine Learning, 85(3):333-359, 2011.
[ bib | http ]
The widely known binary relevance method for multi-label classification, which considers each label as an independent binary problem, has often been overlooked in the literature due to the perceived inadequacy of not directly modelling label correlations. Most current methods invest considerable complexity to model interdependencies between labels. This paper shows that binary relevance-based methods have much to offer, and that high predictive performance can be obtained without impeding scalability to large datasets. We exemplify this with a novel classifier chains method that can model label correlations while maintaining acceptable computational complexity. We extend this approach further in an ensemble framework. An extensive empirical evaluation covers a broad range of multi-label datasets with a variety of evaluation metrics. The results illustrate the competitiveness of the chaining method against related and state-of-the-art methods, both in terms of predictive performance and time complexity.

[313]

Beyond Trees: Adopting MITI to Learn Rules and Ensemble Classifiers for Multi-Instance Data

Luke Bjerring and Eibe Frank. Beyond trees: Adopting MITI to learn rules and ensemble classifiers for multi-instance data. In Proc 24th Australasian Joint Conference on Artificial Intelligence, Perth, Australia, pages 41-50. Springer, 2011.
[ bib | .pdf ]
MITI is a simple and elegant decision tree learner designed for multi-instance classification problems, where examples for learning consist of bags of instances. MITI grows a tree in best-first manner by maintaining a priority queue containing the unexpanded nodes in the fringe of the tree. When the head node contains instances from positive examples only, it is made into a leaf, and any bag of data that is associated with this leaf is removed. In this paper we first revisit the basic algorithm and consider the effect of parameter settings on classification accuracy, using several benchmark datasets. We show that the chosen splitting criterion in particular can have a significant effect on accuracy. We identify a potential weakness of the algorithm—subtrees can contain structure that has been created using data that is subsequently removed—and show that a simple modification turns the algorithm into a rule learner that avoids this problem. This rule learner produces more compact classifiers with comparable accuracy on the benchmark datasets we consider. Finally, we present randomized algorithm variants that enable us to generate ensemble classifiers. We show that these can yield substantially improved classification accuracy.

[314]

Using Output Codes for Two-class Classification Problems

Fanhua Zeng. Using output codes for two-class classification problems. Master's thesis, Department of Computer Science, University of Waikato, 2011.
[ bib | http ]
Error-correcting output codes (ECOCs) have been widely used in many applications for multi-class classification problems. The problem is that ECOCs cannot be ap- plied directly on two-class datasets. The goal of this thesis is to design and evaluate an approach to solve this problem, and then investigate whether the approach can yield better classification models. To be able to use ECOCs, we turn two-class datasets into multi-class datasets first, by using clustering. With the resulting multi-class datasets in hand, we evalu- ate three different encoding methods for ECOCs: exhaustive coding, random coding and a “pre-defined” code that is found using random search. The exhaustive coding method has the highest error-correcting abilities. However, this method is limited due to the exponential growth of bit columns in the codeword matrix precluding it from being used for problems with large numbers of classes. Random coding can be used to cover situations with large numbers of classes in the data. To improve on completely random matrices, “pre-defined” codeword matrices can be generated by using random search that optimizes row separation yielding better error correction than a purely random matrix. To speed up the process of finding good matrices, GPU parallel programming is investigated in this thesis. From the empirical results, we can say that the new algorithm, which applies multi-class ECOCs on two-class data using clustering, does improve the performance for some base learners, when compared to applying them directly to the original two- class datasets.

[315]

Smoothing in Probability Estimation Trees

Zhimeng Han. Smoothing in probability estimation trees. Master's thesis, Department of Computer Science, University of Waikato, 2011.
[ bib | http ]
Classification learning is a type of supervised machine learning technique that uses a classification model (e.g. decision tree) to predict unknown class labels for previously unseen instances. In many applications it can be very useful to additionally obtain class probabilities for the different class labels. Decision trees that yield these probabilities are also called probability estimation trees (PETs). Smoothing is a technique used to improve the probability estimates. There are several existing smoothing methods, such as the Laplace correction, M-Estimate smoothing and M-Branch smoothing. Smoothing does not just apply to PETs. In the field of text compression, PPM in particular, smoothing methods play a important role. This thesis migrates smoothing methods from text compression to PETs. The newly migrated methods in PETs are compared with the best of the existing smoothing methods considered in this thesis under different experiment setups. Unpruned, pruned and bagged trees are considered in the experiments. The main finding is that the PPM-based methods yield the best probability estimate when used with bagged trees, but not when used with individual (pruned or unpruned) trees.

[316]

Concept-based text clustering

Lan Huang. Concept-based text clustering. PhD thesis, Department of Computer Science, University of Waikato, 2011.
[ bib | http ]
Thematic organization of text is a natural practice of humans and a crucial task for today’s vast repositories. Clustering automates this by assessing the similarity between texts and organizing them accordingly, grouping like ones together and separating those with different topics. Clusters provide a comprehensive logical structure that facilitates exploration, search and interpretation of current texts, as well as organization of future ones. Automatic clustering is usually based on words. Text is represented by the words it mentions, and thematic similarity is based on the proportion of words that texts have in common. The resulting bag-of-words model is semantically ambiguous and undesirably orthogonal—it ignores the connections between words. This thesis claims that using concepts as the basis of clustering can signifi- cantly improve effectiveness. Concepts are defined as units of knowledge. When organized according to the relations among them, they form a concept system. Two concept systems are used here: WordNet, which focuses on word knowledge, and Wikipedia, which encompasses world knowledge. We investigate a clustering procedure with three components: using concepts to represent text; taking the semantic relations among them into account dur- ing clustering; and learning a text similarity measure from concepts and their relations. First, we demonstrate that concepts provide a succinct and informa- tive representation of the themes in text, exemplifying this with the two concept systems. Second, we define methods for utilizing concept relations to enhance clustering by making the representation models more discriminative and extend- ing thematic similarity beyond surface overlap. Third, we present a similarity measure based on concepts and their relations that is learned from a small num- ber of examples, and show that it both predicts similarity consistently with human judgement and improves clustering. The thesis provides strong support for the use of concept-based representations instead of the classic bag-of-words model.

[317]

Sequence-based protein classification: binary Profile Hidden Markov Models and propositionalisation

Stefan Mutter. Sequence-based protein classification: binary Profile Hidden Markov Models and propositionalisation. PhD thesis, Department of Computer Science, University of Waikato, 2011.
[ bib | http ]
Detecting similarity in biological sequences is a key element to understanding the mechanisms of life. Researchers infer potential structural, functional or evolutionary relationships from similarity. However, the concept of similarity is complex in biology. Sequences consist of different molecules with different chemical properties, have short and long distance interactions, form 3D structures and change through evolutionary processes. Amino acids are one of the key molecules of life. Most importantly, a sequence of amino acids constitutes the building block for proteins which play an essential role in cellular processes. This thesis investigates similarity amongst proteins. In this area of research there are two important and closely related classification tasks – the detection of similar proteins and the discrimination amongst them. Hidden Markov Models (HMMs) have been successfully applied to the detection task as they model sequence similarity very well. From a Machine Learning point of view these HMMs are essentially one-class classifiers trained solely on a small number of similar proteins neglecting the vast number of dissimilar ones. Our basic assumption is that integrating this neglected information will be highly beneficial to the classification task. Thus, we transform the problem representation from a one-class to a binary one. Equipped with the necessary sound understanding of Machine Learning, especially concerning problem representation and statistically significant evaluation, our work pursues and combines two different avenues on this aforementioned transformation. First, we introduce a binary HMM that discriminates significantly better than the standard one, even when only a fraction of the negative information is used. Second, we interpret the HMM as a structured graph of information. This information cannot be accessed by highly optimised standard Machine Learning classifiers as they expect a fixed length feature vector representation. Propositionalisation is a technique to transform the former representation into the latter. This thesis introduces new propositionalisation techniques. The change in representation changes the learning problem from a one-class, generative to a propositional, discriminative one. It is a common assumption that discriminative techniques are better suited for classification tasks, and our results validate this assumption. We suggest a new way to significantly improve on discriminative power and runtime by means of terminating the time-intense training of HMMs early, subsequently applying propositionalisation and classifying with a discriminative, binary learner.

[318]

Experiments with Multi-view Multi-instance Learning for Supervised Image Classification

Michael Mayo and Eibe Frank. Experiments with multi-view multi-instance learning for supervised image classification. In Proc 26th International Conference Image and Vision Computing New Zealand, Auckland, New Zealand, pages 363-369, 2011.
[ bib | .pdf ]
In this paper we empirically investigate the benefits of multi-view multi-instance (MVMI) learning for supervised image classification. In multi-instance learning, examples for learning contain bags of feature vectors and thus data from different views cannot simply be concatenated as in the single-instance case. Hence, multi-view learning, where one classifier is built per view, is particularly attractive when applying multi-instance learning to image classification. We take several diverse image data sets—ranging from person detection to astronomical object classification to species recognition—and derive a set of multiple instance views from each of them. We then show via an extensive set of 10×10 stratified cross-validation experiments that MVMI, based on averaging predicted confidence scores, generally exceeds the performance of traditional single-view multi-instance learning, when using support vector machines and boosting as the underlying learning algorithms.

[319]

A comparison of methods for estimating prediction intervals in NIR spectroscopy: Size matters

Remco R. Bouckaert, Eibe Frank, Geoffrey Holmes, and Dale Fletcher. A comparison of methods for estimating prediction intervals in NIR spectroscopy: Size matters. Chemometrics and Intelligent Laboratory Systems, 109(2):139 - 145, 2011.
[ bib | http ]
In this article we demonstrate that, when evaluating a method for determining prediction intervals, interval size matters more than coverage because the latter can be fixed at a chosen confidence level with good reliability. To achieve the desired coverage, we employ a simple non-parametric method to compute prediction intervals by calibrating estimated prediction errors, and we extend the basic method with a continuum correction to deal with small data sets. In our experiments on a collection of several NIR data sets, we evaluate several existing methods of computing prediction intervals for partial least-squares (PLS) regression. Our results show that, when coverage is fixed at a chosen confidence level, and the number of PLS components is selected to minimize squared error of point estimates, interval estimation based on the classic ordinary least-squares method produces the narrowest intervals, outperforming the U-deviation method and linearization, regardless of the confidence level that is chosen.

[320]

Data Mining: Practical Machine Learning Tools and Techniques

Ian H. Witten, Eibe Frank, and Mark A. Hall. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Burlington, MA, 3 edition, 2011.
[ bib | .html ]
[321]

Bagging Ensemble Selection

Quan Sun and Bernhard Pfahringer. Bagging ensemble selection. In Australasian Conference on Artificial Intelligence, pages 251-260. Springer, 2011.
[ bib | http ]
Ensemble selection has recently appeared as a popular ensemble learning method, not only because its implementation is fairly straightforward, but also due to its excellent predictive performance on practical problems. The method has been highlighted in winning solutions of many data mining competitions, such as the Netflix competition, the KDD Cup 2009 and 2010, the UCSD FICO contest 2010, and a number of data mining competitions on the Kaggle platform. In this paper we present a novel variant: bagging ensemble selection. Three variations of the proposed algorithm are compared to the original ensemble selection algorithm and other ensemble algorithms. Experiments with ten real world problems from diverse domains demonstrate the benefit of the bagging ensemble selection algorithm.

[322]

Semi-random Model Tree Ensembles: An Effective and Scalable Regression Method

Bernhard Pfahringer. Semi-random model tree ensembles: An effective and scalable regression method. In Australasian Conference on Artificial Intelligence, pages 231-240. Springer, 2011.
[ bib | http ]
We present and investigate ensembles of semi-random model trees as a novel regression method. Such ensembles combine the scalability of tree-based methods with predictive performance rivalling the state of the art in numeric prediction. An empirical investigation shows that Semi-Random Model Trees produce predictive performance which is competitive with state-of-the-art methods like Gaussian Processes Regression or Additive Groves of Regression Trees. The training and optimization of Random Model Trees scales better than Gaussian Processes Regression to larger datasets, and enjoys a constant advantage over Additive Groves of the order of one to two orders of magnitude.

[323]

A Model for Predicting the Resolution of Type 2 Diabetes in Severely Obese Subjects Following Roux-en Y Gastric Bypass Surgery

Mark Thomas Hayes, Lynette Anne Hunt, Jonathan Foo, Yulia Tychinskaya, and Richard Strawson Stubbs. A model for predicting the resolution of type 2 diabetes in severely obese subjects following roux-en y gastric bypass surgery. Obesity Surgery, 21(7):910-916, 2011.
[ bib | http ]
Severely obese type 2 diabetics who undergo Roux-en Y gastric bypass surgery have significant improvements in glycaemic control. Little work has been undertaken to establish the independent predictors of such resolution or to develop a predictive model. The aim of this study was to develop a mathematical model and establish independent predictors for the resolution of diabetes. METHODS: A consecutive sample of 130 severely obese type 2 diabetics who underwent gastric bypass surgery for weight loss from November 1997 to May 2007 with prospective pre-operative documentation of biochemical and clinical measurements was followed up over 12 months. Logistic discrimination analysis was undertaken to identify those variables with independent predictive value and to develop a predictive model for resolution of type 2 diabetes. Consecutive samples of 130 patients with body mass index (BMI) ≥ 35 with type 2 diabetes were selected. One hundred and twenty-seven patients completed the study with a sufficient data set. Patients were deemed unresolved if (1) diabetic medication was still required after surgery; (2) if fasting plasma glucose (FPG) remained >7 mmol/L; or (3) HbA1c remained >7%. RESULTS: Resolution of diabetes was seen in 84%, while diabetes remained but was improved in 16% of patients. Resolution was rapid and sustained with 74% of those on medication before surgery being able to discontinue this by the time of discharge 6 days following surgery. Five pre-operative variables were found to have independent predictive value for resolution of diabetes, including BMI, HbA1c, FPG, hypertension and requirement for insulin. Two models have been proposed for prediction of diabetes resolution, each with 86% correct classification in this cohort of patients. CONCLUSIONS: Type 2 diabetes resolves in a very high percentage of patients undergoing gastric bypass surgery for severe obesity. The key predictive variables include pre-operative BMI, HbA1c, FPG, the presence of hypertension and diabetic status.

[324]

Hybridizing Data Stream Mining and Technical Indicators in Automated Trading Systems

Michael Mayo. Hybridizing data stream mining and technical indicators in automated trading systems. In Modeling Decision for Artificial Intelligence - 8th International Conference, MDAI 2011, Changsha, Hunan, China, July 28-30, 2011, Proceedings, pages 79-90. Springer, 2011.
[ bib | http ]
Automated trading systems for financial markets can use data mining techniques for future price movement prediction. However, classifier accuracy is only one important component in such a system: the other is a decision procedure utilizing the prediction in order to be long, short or out of the market. In this paper, we investigate the use of technical indicators as a means of deciding when to trade in the direction of a classifier’s prediction. We compare this “hybrid” technical/data stream mining-based system with a naive system that always trades in the direction of predicted price movement. We are able to show via evaluations across five financial market datasets that our novel hybrid technique frequently outperforms the naive system. To strengthen our conclusions, we also include in our evaluation several “simple” trading strategies without any data mining component that provide a much stronger baseline for comparison than traditional buy-and-hold or sell-and-hold strategies.

[325]

Modelling epistasis in genetic disease using Petri nets, evolutionary computation and frequent itemset mining

Michael Mayo and Lorenzo Beretta. Modelling epistasis in genetic disease using petri nets, evolutionary computation and frequent itemset mining. Expert Syst. Appl., 38(4):4006-4013, 2011.
[ bib | http ]
Petri nets are useful for mathematically modelling disease-causing genetic epistasis. A Petri net model of an interaction has the potential to lead to biological insight into the cause of a genetic disease. However, defining a Petri net by hand for a particular interaction is extremely difficult because of the sheer complexity of the problem and degrees of freedom inherent in a Petri net’s architecture. We propose therefore a novel method, based on evolutionary computation and data mining, for automatically constructing Petri net models of non-linear gene interactions. The method comprises two main steps. Firstly, an initial partial Petri net is set up with several repeated sub-nets that model individual genes and a set of constraints, comprising relevant common sense and biological knowledge, is also defined. These constraints characterise the class of Petri nets that are desired. Secondly, this initial Petri net structure and the constraints are used as the input to a genetic algorithm. The genetic algorithm searches for a Petri net architecture that is both a superset of the initial net, and also conforms to all of the given constraints. The genetic algorithm evaluation function that we employ gives equal weighting to both the accuracy of the net and also its parsimony. We demonstrate our method using an epistatic model related to the presence of digital ulcers in systemic sclerosis patients that was recently reported in the literature. Our results show that although individual “perfect” Petri nets can frequently be discovered for this interaction, the true value of this approach lies in generating many different perfect nets, and applying data mining techniques to them in order to elucidate common and statistically significant patterns of interaction.

[326]

MOA-TweetReader: Real-Time Analysis in Twitter Streaming Data

Albert Bifet, Geoffrey Holmes, and Bernhard Pfahringer. Moa-tweetreader: Real-time analysis in twitter streaming data. In Discovery Science - 14th International Conference, DS 2011, Espoo, Finland, October 5-7, 2011. Proceedings, pages 46-60. Springer, 2011.
[ bib | http ]
Twitter is a micro-blogging service built to discover what is happening at any moment in time, anywhere in the world. Twitter messages are short, generated constantly, and well suited for knowledge discovery using data stream mining. We introduce MOA-TweetReader, a system for processing tweets in real time. We show two main applications of the new system for studying Twitter data: detecting changes in term frequencies and performing real-time sentiment analysis.

[327]

Active Learning with Evolving Streaming Data

Indre Zliobaite, Albert Bifet, Bernhard Pfahringer, and Geoff Holmes. Active learning with evolving streaming data. In Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2011, Athens, Greece, September 5-9, 2011, Proceedings, Part III, pages 597-612. Springer, 2011.
[ bib | http ]
In learning to classify streaming data, obtaining the true labels may require major effort and may incur excessive cost. Active learning focuses on learning an accurate model with as few labels as possible. Streaming data poses additional challenges for active learning, since the data distribution may change over time (concept drift) and classifiers need to adapt. Conventional active learning strategies concentrate on querying the most uncertain instances, which are typically concentrated around the decision boundary. If changes do not occur close to the boundary, they will be missed and classifiers will fail to adapt. In this paper we develop two active learning strategies for streaming data that explicitly handle concept drift. They are based on uncertainty, dynamic allocation of labeling efforts over time and randomization of the search space. We empirically demonstrate that these strategies react well to changes that can occur anywhere in the instance space and unexpectedly.

[328]

MOA: A Real-Time Analytics Open Source Framework

Albert Bifet, Geoff Holmes, Bernhard Pfahringer, Jesse Read, Philipp Kranen, Hardy Kremer, Timm Jansen, and Thomas Seidl. Moa: A real-time analytics open source framework. In Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2011, Athens, Greece, September 5-9, 2011, Proceedings, Part III, pages 617-620. Springer, 2011.
[ bib | http ]
Massive Online Analysis (MOA) is a software environment for implementing algorithms and running experiments for online learning from evolving data streams. MOA is designed to deal with the challenging problems of scaling up the implementation of state of the art algorithms to real world dataset sizes and of making algorithms comparable in benchmark streaming settings. It contains a collection of offline and online algorithms for classification, clustering and graph mining as well as tools for evaluation. For researchers the framework yields insights into advantages and disadvantages of different approaches and allows for the creation of benchmark streaming data sets through stored, shared and repeatable settings for the data feeds. Practitioners can use the framework to easily compare algorithms and apply them to real world data sets and settings. MOA supports bi-directional interaction with WEKA, the Waikato Environment for Knowledge Analysis. Besides providing algorithms and measures for evaluation and comparison, MOA is easily extensible with new contributions and allows for the creation of benchmark scenarios.

[329]

Online Evaluation of Email Streaming Classifiers Using GNUsmail

José M. Carmona-Cejudo, Manuel Baena-García, José del Campo-Ávila, Albert Bifet, Joao Gama, and Rafael Morales Bueno. Online evaluation of email streaming classifiers using gnusmail. In Advances in Intelligent Data Analysis X - 10th International Symposium, IDA 2011, Porto, Portugal, October 29-31, 2011. Proceedings, pages 90-100. Springer, 2011.
[ bib | http ]
Real-time email classification is a challenging task because of its online nature, subject to concept-drift. Identifying spam, where only two labels exist, has received great attention in the literature. We are nevertheless interested in classification involving multiple folders, which is an additional source of complexity. Moreover, neither cross-validation nor other sampling procedures are suitable for data streams evaluation. Therefore, other metrics, like the prequential error, have been proposed. However, the prequential error poses some problems, which can be alleviated by using mechanisms such as fading factors. In this paper we present GNUsmail, an open-source extensible framework for email classification, and focus on its ability to perform online evaluation. GNUsmail’s architecture supports incremental and online learning, and it can be used to compare different online mining methods, using state-of-art evaluation metrics. We show how GNUsmail can be used to compare different algorithms, including a tool for launching replicable experiments.

[330]

Mining frequent closed graphs on evolving data streams

Albert Bifet, Geoff Holmes, Bernhard Pfahringer, and Ricard Gavaldà. Mining frequent closed graphs on evolving data streams. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, August 21-24, 2011, pages 591-599. ACM, 2011.
[ bib | http ]
Graph mining is a challenging task by itself, and even more so when processing data streams which evolve in real-time. Data stream mining faces hard constraints regarding time and space for processing, and also needs to provide for concept drift detection. In this paper we present a framework for studying graph pattern mining on time-varying streams. Three new methods for mining frequent closed subgraphs are presented. All methods work on coresets of closed subgraphs, compressed representations of graph sets, and maintain these sets in a batch-incremental manner, but use different approaches to address potential concept drift. An evaluation study on datasets comprising up to four million graphs explores the strength and limitations of the proposed methods. To the best of our knowledge this is the first work on mining frequent closed subgraphs in non-stationary data streams.

[331]

An effective evaluation measure for clustering on evolving data streams

Hardy Kremer, Philipp Kranen, Timm Jansen, Thomas Seidl, Albert Bifet, Geoff Holmes, and Bernhard Pfahringer. An effective evaluation measure for clustering on evolving data streams. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, August 21-24, 2011, pages 868-876. ACM, 2011.
[ bib | http ]
Due to the ever growing presence of data streams, there has been a considerable amount of research on stream mining algorithms. While many algorithms have been introduced that tackle the problem of clustering on evolving data streams, hardly any attention has been paid to appropriate evaluation measures. Measures developed for static scenarios, namely structural measures and ground-truth-based measures, cannot correctly reflect errors attributable to emerging, splitting, or moving clusters. These situations are inherent to the streaming context due to the dynamic changes in the data distribution. In this paper we develop a novel evaluation measure for stream clustering called Cluster Mapping Measure (CMM). CMM effectively indicates different types of errors by taking the important properties of evolving data streams into account. We show in extensive experiments on real and synthetic data that CMM is a robust measure for stream clustering evaluation.

[332]

Mining frequent closed trees in evolving data streams

Albert Bifet and Ricard Gavaldà. Mining frequent closed trees in evolving data streams. Intell. Data Anal., 15(1):29-48, 2011.
[ bib | http ]
We propose new algorithms for adaptively mining closed rooted trees, both labeled and unlabeled, from data streams that change over time. Closed patterns are powerful representatives of frequent patterns, since they eliminate redundant information. Our approach is based on an advantageous representation of trees and a low-complexity notion of relaxed closed trees, as well as ideas from Galois Lattice Theory. More precisely, we present three closed tree mining algorithms in sequence: an incremental one, IncTreeMiner, a sliding-window based one, WinTreeMiner, and finally one that mines closed trees adaptively from data streams, AdaTreeMiner. By adaptive we mean here that it presents at all times the closed trees that are frequent in the current state of the data stream. To the best of our knowledge this is the first work on mining closed frequent trees in streaming data varying with time. We give a first experimental evaluation of the proposed algorithms.

[333]

Using GNUsmail to Compare Data Stream Mining Methods for On-line Email Classification

José M. Carmona-Cejudo, Manuel Baena-García, Rafael Morales Bueno, Joao Gama, and Albert Bifet. Using GNUsmail to compare data stream mining methods for on-line email classification. Journal of Machine Learning Research - Proceedings Track, 17:12-18, 2011.
[ bib | .pdf ]
Real-time classification of emails is a challenging task because of its online nature, and also because email streams are subject to concept drift. Identifying email spam, where only two different labels or classes are defined (spam or not spam), has received great attention in the literature. We are nevertheless interested in a more specific classification where multiple folders exist, which is an additional source of complexity: the class can have a very large number of different values. Moreover, neither cross-validation nor other sampling procedures are suitable for evaluation in data stream contexts, which is why other metrics, like the prequential error, have been proposed. In this paper, we present GNUsmail, an open-source extensible framework for email classification, and we focus on its ability to perform online evaluation. GNUsmails architecture supports incremental and online learning, and it can be used to compare different data stream mining methods, using state-of-art online evaluation metrics. Besides describing the framework, characterized by two overlapping phases, we show how it can be used to compare different algorithms in order to find the most appropriate one. The GNUsmail source code includes a tool for launching replicable experiments.

[334]

Streaming Multi-label Classification

Jesse Read, Albert Bifet, Geoff Holmes, and Bernhard Pfahringer. Streaming multi-label classification. Journal of Machine Learning Research - Proceedings Track, 17:19-25, 2011.
[ bib | .pdf ]
This paper presents a new experimental framework for studying multi-label evolving stream classification, with efficient methods that combine the best practices in streaming scenarios with the best practices in multi-label classification. Many real world problems involve data which can be considered as multi-label data streams. Efficient methods exist for multi-label classification in non streaming scenarios. However, learning in evolving streaming scenarios is more challenging, as the learners must be able to adapt to change using limited time and memory. We present a new experimental software that extends the MOA framework. Massive Online Analysis (MOA) is a software environment for implementing algorithms and running experiments for online learning from evolving data streams. It is released under the GNU GPL license.

[335]

MOA Concept Drift Active Learning Strategies for Streaming Data

Indre Zliobaite, Albert Bifet, Geoff Holmes, and Bernhard Pfahringer. MOA concept drift active learning strategies for streaming data. Journal of Machine Learning Research - Proceedings Track, 17:48-55, 2011.
[ bib | .pdf ]
We present a framework for active learning on evolving data streams, as an extension to the MOA system. In learning to classify streaming data, obtaining the true labels may require major effort and may incur excessive cost. Active learning focuses on learning an accurate model with as few labels as possible. Streaming data poses additional challenges for active learning, since the data distribution may change over time (concept drift) and classifiers need to adapt. Conventional active learning strategies concentrate on querying the most uncertain instances, which are typically concentrated around the decision boundary. If changes do not occur close to the boundary, they will be missed and classifiers will fail to adapt. We propose a software system that implements active learning strategies, extending the MOA framework. This software is released under the GNU GPL license.

[336]

Detecting Sentiment Change in Twitter Streaming Data

Albert Bifet, Geoffrey Holmes, Bernhard Pfahringer, and Ricard Gavaldà. Detecting sentiment change in Twitter streaming data. Journal of Machine Learning Research - Proceedings Track, 17:5-11, 2011.
[ bib | .pdf ]
MOA-TweetReader is a real-time system to read tweets in real time, to detect changes, and to find the terms whose frequency changed. Twitter is a micro-blogging service built to discover what is happening at any moment in time, anywhere in the world. Twitter messages are short, and generated constantly, and well suited for knowledge discovery using data stream mining. MOA-TweetReader is a software extension to the MOA framework. Massive Online Analysis (MOA) is a software environment for implementing algorithms and running experiments for online learning from evolving data streams.

[337]

Combining Modifications to Multinomial Naive Bayes for Text Classification

Antti Puurula. Combining modifications to multinomial naive bayes for text classification. In Proc 8th Asia Information Retrieval Societies Conference, Tianjin, China, pages 114-125, 2012.
[ bib | http ]
Multinomial Naive Bayes (MNB) is a preferred classifier for many text classification tasks, due to simplicity and trivial scaling to large scale tasks. However, in terms of classification accuracy it has a performance gap to modern discriminative classifiers, due to strong data assumptions. This paper explores the optimized combination of popular modifications to generative models in the context of MNB text classification. In order to optimize the introduced classifier metaparameters, we explore direct search optimization using random search algorithms. We evaluate 7 basic modifications and 4 search algorithms across 5 publicly availably available datasets, and give comparisons to similarly optimized Multiclass Support Vector Machine (SVM) classifiers. The use of optimized modifications results in over 20% mean reduction in classification errors compared to baseline MNB models, reducing the gap between SVM and MNB mean performance by over 60%. Some of the individual modifications are shown to have substantial and significant effects, while differences between the random search algorithms are smaller and not statistically significant. The evaluated modifications are potentially applicable to many applications of generative text modeling, where similar performance gains can be achieved.

[338]

Scalable Text Classification with Sparse Generative Modeling

Antti Puurula. Scalable text classification with sparse generative modeling. In Proc 12th Pacific Rim International Conference on Artificial Intelligence, Kuching, Malaysia, pages 458-469, 2012.
[ bib | http ]
Machine learning technology faces challenges in handling Big Data: vast volumes of online data such as web pages, news stories and articles. A dominant solution has been parallelization, but this does not make the tasks less challenging. An alternative solution is using sparse computation methods to fundamentally change the complexity of the processing tasks themselves. This can be done by using both the sparsity found in natural data and sparsified models. In this paper we show that sparse representations can be used to reduce the time complexity of generative classifiers to build fundamentally more scalable classifiers. We reduce the time complexity of Multinomial Naive Bayes classification with sparsity and show how to extend these findings into three multi-label extensions: Binary Relevance, Label Powerset and Multi-label Mixture Models. To provide competitive performance we provide the methods with smoothing and pruning modifications and optimize model meta-parameters using direct search optimization. We report on classification experiments on 5 publicly available datasets for large-scale multi-label classification. All three methods scale easily to the largest available tasks, with training times measured in seconds and classification times in milliseconds, even with millions of training documents, features and classes. The presented sparse modeling techniques should be applicable to many other classifiers, providing the same types of fundamental complexity reductions when applied to large scale tasks.

[339]

Ensembles of Sparse Multinomial Classifiers for Scalable Text Classification

Antti Puurula and Albert Bifet. Ensembles of sparse multinomial classifiers for scalable text classification. In Proc ECML PKDD 2012 Workshop on Large-Scale Hierarchical Classification, Bristol, UK, 2012.
[ bib | .pdf ]
Machine learning techniques face new challenges in scalability to large-scale tasks. Many of the existing algorithms are unable to scale to potentially millions of features and structured classes encountered in web-scale datasets such as Wikipedia. The third Large Scale Hierarchical Text Classification evaluation (LSHTC3) evaluated systems for multi-label hierarchical categorization of Wikipedia articles. In this paper we present a broad overview of our system in the evaluation, performing among the top systems in the challenge. We describe the several new modeling ideas we used that make text classification systems both more effective and scalable. These are: reduction of inference time complexity for probabilistic classifiers using inverted indices, classifier modifications optimized with direct search algorithms, ensembles of diverse multi-label classifiers and a novel feature-regression based method for scalable ensemble combination.

[340]

Online Estimation of Discrete Densities using Classifier Chains

Michael Geilke, Eibe Frank, and Stefan Kramer. Online estimation of discrete densities using classifier chains. In Proc ECML PKDD 2012 Workshop on Instant Interactive Data Mining, Bristol, UK, 2012.
[ bib | .pdf ]
We propose an approach to estimate a discrete joint density online, that is, the algorithm is only provided the current example, its current estimate, and a limited amount of memory. To design an on-line estimator for discrete densities, we use classifier chains to model dependencies among features. Each classifier in the chain estimates the probability of one particular feature. Because a single chain may not pro- vide a reliable estimate, we also consider ensembles of classifier chains. Our experiments on synthetic data show that the approach is feasible and the estimated densities approach the true, known distribution with increasing amounts of data.

[341]

Learning a Concept-based Document Similarity Measure

Lan Huang, David Milne, Eibe Frank, and Ian H. Witten. Learning a concept-based document similarity measure. Journal of the American Society for Information Science and Technology, 2012.
[ bib | .pdf ]
Document similarity measures are crucial components of many text analysis tasks, including information retrieval, document classification, and document clustering. Conventional measures are brittle: they estimate the surface overlap between documents based on the words they mention and ignore deeper semantic connections. We propose a new measure that assesses similarity at both the lexical and semantic levels, and learns from human judgments how to combine them by using machine learning techniques. Experiments show that the new measure produces values for documents that are more consistent with people’s judgments than people are with each other. We also use it to classify and cluster large document sets covering different genres and topics, and find that it improves both classification and clustering performance.

[342]

Ensembles of Restricted Hoeffding Trees

Albert Bifet, Eibe Frank, Geoffrey Holmes, and Bernhard Pfahringer. Ensembles of restricted hoeffding trees. ACM Transactions on Intelligent Systems and Technology, 3(2), 2012.
[ bib | .pdf ]
The success of simple methods for classification shows that is is often not necessary to model complex attribute interactions to obtain good classification accuracy on practical problems. In this paper, we propose to exploit this phenomenon in the data stream context by building an ensemble of Hoeffding trees that are each limited to a small subset of attributes. In this way, each tree is restricted to model interactions between attributes in its corresponding subset. Because it is not known a priori which attribute subsets are relevant for prediction, we build exhaustive ensembles that consider all possible attribute subsets of a given size. As the resulting Hoeffding trees are not all equally important, we weigh them in a suitable manner to obtain accurate classifications. This is done by combining the log-odds of their probability estimates using sigmoid perceptrons, with one perceptron per class. We propose a mechanism for setting the perceptrons' learning rate using the ADWIN change detection method for data streams, and also use ADWIN to reset ensemble members (i.e. Hoeffding trees) when they no longer perform well. Our experiments show that the resulting ensemble classifier outperforms bagging for data streams in terms of accuracy when both are used in conjunction with adaptive naive Bayes Hoeffding trees, at the expense of runtime and memory consumption. We also show that our stacking method can improve the performance of a bagged ensemble.

[343]

Experiment databases - A new way to share, organize and learn from experiments

Joaquin Vanschoren, Hendrik Blockeel, Bernhard Pfahringer, and Geoffrey Holmes. Experiment databases - a new way to share, organize and learn from experiments. Machine Learning, 87(2):127-158, 2012.
[ bib ]
Thousands of machine learning research papers contain extensive experimental comparisons. However, the details of those experiments are often lost after publication, making it impossible to reuse these experiments in further research, or reproduce them to verify the claims made. In this paper, we present a collaboration framework designed to easily share machine learning experiments with the community, and automatically organize them in public databases. This enables immediate reuse of experiments for subsequent, possibly much broader investigation and offers faster and more thorough analysis based on a large set of varied results. We describe how we designed such an experiment database, currently holding over 650,000 classification experiments, and demonstrate its use by answering a wide range of interesting research questions and by verifying a number of recent studies.

[344]

Scalable and efficient multi-label classification for evolving data streams

Jesse Read, Albert Bifet, Geoff Holmes, and Bernhard Pfahringer. Scalable and efficient multi-label classification for evolving data streams. Machine Learning, 88(1-2):243-272, 2012.
[ bib ]
Many challenging real world problems involve multi-label data streams. Efficient methods exist for multi-label classification in non-streaming scenarios. However, learning in evolving streaming scenarios is more challenging, as classifiers must be able to deal with huge numbers of examples and to adapt to change using limited time and memory while being ready to predict at any point. This paper proposes a new experimental framework for learning and evaluating on multi-label data streams, and uses it to study the performance of various methods. From this study, we develop a multi-label Hoeffding tree with multi-label classifiers at the leaves. We show empirically that this method is well suited to this challenging task. Using our new framework, which allows us to generate realistic multi-label data streams with concept drift (as well as real data), we compare with a selection of baseline methods, as well as new learning methods from the literature, and show that our Hoeffding tree method achieves fast and more accurate performance.

[345]

Bagging Ensemble Selection for Regression

Quan Sun and Bernhard Pfahringer. Bagging ensemble selection for regression. In Australasian Conference on Artificial Intelligence, pages 695-706. Springer, 2012.
[ bib ]
Bagging ensemble selection (BES) is a relatively new ensemble learning strategy. The strategy can be seen as an ensemble of the ensemble selection from libraries of models (ES) strategy. Previous experimental results on binary classification problems have shown that using random trees as base classifiers, BES-OOB (the most successful variant of BES) is competitive with (and in many cases, superior to) other ensemble learning strategies, for instance, the original ES algorithm, stacking with linear regression, random forests or boosting. Motivated by the promising results in classification, this paper examines the predictive performance of the BES-OOB strategy for regression problems. Our results show that the BES-OOB strategy outperforms Stochastic Gradient Boosting and Bagging when using regression trees as the base learners. Our results also suggest that the advantage of using a diverse model library becomes clear when the model library size is relatively large. We also present encouraging results indicating that the non-negative least squares algorithm is a viable approach for pruning an ensemble of ensembles.

[346]

Stream Data Mining Using the MOA Framework

Philipp Kranen, Hardy Kremer, Timm Jansen, Thomas Seidl, Albert Bifet, Geoff Holmes, Bernhard Pfahringer, and Jesse Read. Stream data mining using the moa framework. In Proceedings Database Systems for Advanced Applications, pages 309-313. Springer, 2012.
[ bib ]
Massive Online Analysis (MOA) is a software framework that provides algorithms and evaluation methods for mining tasks on evolving data streams. In addition to supervised and unsupervised learning, MOA has recently been extended to support multi-label classification and graph mining. In this demonstrator we describe the main features of MOA and present the newly added methods for outlier detection on streaming data. Algorithms can be compared to established baseline methods such as LOF and ABOD using standard ranking measures including Spearman rank coefficient and the AUC measure. MOA is an open source project and videos as well as tutorials are publicly available on the MOA homepage.

[347]

Full model selection in the space of data mining operators

Quan Sun, Bernhard Pfahringer, and Michael Mayo. Full model selection in the space of data mining operators. In Proceedings Genetic and Evolutionary Computation Conference, pages 1503-1504. ACM, 2012.
[ bib ]
We propose a framework and a novel algorithm for the full model selection (FMS) problem. The proposed algorithm, combining both genetic algorithms (GA) and particle swarm optimization (PSO), is named GPS (which stands for GA-PSO-FMS), in which a GA is used for searching the optimal structure of a data mining solution, and PSO is used for searching the optimal parameter set for a particular structure instance. Given a classification or regression problem, GPS outputs a FMS solution as a directed acyclic graph consisting of diverse data mining operators that are applicable to the problem, including data cleansing, data sampling, feature transformation/selection and algorithm operators. The solution can also be represented graphically in a human readable form. Experimental results demonstrate the benefit of the algorithm.

[348]

Batch-Incremental versus Instance-Incremental Learning in Dynamic and Evolving Data

Jesse Read, Albert Bifet, Bernhard Pfahringer, and Geoff Holmes. Batch-incremental versus instance-incremental learning in dynamic and evolving data. In Proceedings Symposium on Advances in Intelligent Data Analysis, pages 313-323, 2012.
[ bib ]
Many real world problems involve the challenging context of data streams, where classifiers must be incremental: able to learn from a theoretically-infinite stream of examples using limited time and memory, while being able to predict at any point. Two approaches dominate the literature: batch-incremental methods that gather examples in batches to train models; and instance-incremental methods that learn from each example as it arrives. Typically, papers in the literature choose one of these approaches, but provide insufficient evidence or references to justify their choice. We provide a first in-depth analysis comparing both approaches, including how they adapt to concept drift, and an extensive empirical study to compare several different versions of each approach. Our results reveal the respective advantages and disadvantages of the methods, which we discuss in detail.

[349]

Maximum Common Subgraph based locally weighted regression

Madeleine Seeland, Fabian Buchwald, Stefan Kramer, and Bernhard Pfahringer. Maximum common subgraph based locally weighted regression. In Proceedings of the ACM Symposium on Applied Computing, pages 165-172. ACM, 2012.
[ bib ]
This paper investigates a simple, yet effective method for regression on graphs, in particular for applications in chem-informatics and for quantitative structure-activity relationships (QSARs). The method combines Locally Weighted Learning (LWL) with Maximum Common Subgraph (MCS) based graph distances. More specifically, we investigate a variant of locally weighted regression on graphs (structures) that uses the maximum common subgraph for determining and weighting the neighborhood of a graph and feature vectors for the actual regression model. We show that this combination, LWL-MCS, outperforms other methods that use the local neighborhood of graphs for regression. The performance of this method on graphs suggests it might be useful for other types of structured data as well.

[350]

Multi-label classification using boolean matrix decomposition

Jörg Wicker, Bernhard Pfahringer, and Stefan Kramer. Multi-label classification using boolean matrix decomposition. In Proceedings of the ACM Symposium on Applied Computing, pages 179-186. ACM, 2012.
[ bib ]
This paper introduces a new multi-label classifier based on Boolean matrix decomposition. Boolean matrix decomposition is used to extract, from the full label matrix, latent labels representing useful Boolean combinations of the original labels. Base level models predict latent labels, which are subsequently transformed into the actual labels by Boolean matrix multiplication with the second matrix from the decomposition. The new method is tested on six publicly available datasets with varying numbers of labels. The experimental evaluation shows that the new method works particularly well on datasets with a large number of labels and strong dependencies among them.

[351]

Parameter Tuning Using Gaussian Processes

Jinjin Ma. Parameter tuning using gaussian processes. Master's thesis, Department of Computer Science, University of Waikato, 2012.
[ bib | http ]
Most machine learning algorithms require us to set up their parameter values before applying these algorithms to solve problems. Appropriate parameter settings will bring good performance while inappropriate parameter settings generally result in poor modelling. Hence, it is necessary to acquire the “best” parameter values for a particular algorithm before building the model. The “best” model not only reflects the “real” function and is well fitted to existing points, but also gives good performance when making predictions for new points with previously unseen values. A number of methods exist that have been proposed to optimize parameter values. The basic idea of all such methods is a trial-and-error process whereas the work presented in this thesis employs Gaussian process (GP) regression to optimize the parameter values of a given machine learning algorithm. In this thesis, we consider the optimization of only two-parameter learning algorithms. All the possible parameter values are specified in a 2-dimensional grid in this work. To avoid brute-force search, Gaussian Process Optimization (GPO) makes use of “expected improvement” to pick useful points rather than validating every point of the grid step by step. The point with the highest expected improvement is evaluated using cross-validation and the resulting data point is added to the training set for the Gaussian process model. This process is repeated until a stopping criterion is met. The final model is built using the learning algorithm based on the best parameter values identified in this process. In order to test the effectiveness of this optimization method on regression and classification problems, we use it to optimize parameters of some well-known machine learning algorithms, such as decision tree learning, support vector machines and boosting with trees. Through the analysis of experimental results obtained on datasets from the UCI repository, we find that the GPO algorithm yields competitive performance compared with a brute-force approach, while exhibiting a distinct advantage in terms of training time and number of cross-validation runs. Overall, the GPO method is a promising method for the optimization of parameter values in machine learning.

[352]

An application of data mining to fruit and vegetable sample identification using Gas Chromatography-Mass Spectrometry

Geoffrey Holmes, Dale Fletcher, and Peter Reutemann. An application of data mining to fruit and vegetable sample identification using gas chromatography-mass spectrometry. In Proceedings of the International Congress on Environmental Modelling and Software (IEMSS), Leipzig, Germany, 2012.
[ bib | .pdf ]
One of the uses of Gas Chromatography-Mass Spectrometry (GC-MS) is in the detection of pesticide residues in fruit and vegetables. In a high throughput laboratory there is the potential for sample swaps or mislabelling, as once a sample has been pre-processed to be injected into the GC-MS analyser, it is no longer distinguishable by eye. Possible consequences of such mistakes can be the destruction of large amounts of actually safe produce or pesticide-contaminated produce reaching the consumer. For the purposes of food safety and traceability, it can also be extremely valuable to know the source (country of origin) of a food product. This can help uncover fraudulent attempts of trying to sell food originating from countries deemed unsafe. In this study, we use the workflow environment ADAMS to examine whether we can determine the fruit/vegetable, and the country of origin of a sample from a GC-MS chromatogram. A workflow is used to generate data sets using different data pre-processing methods, and data representations from a database of over 8000 GC-MS chromatograms, consisting of more than 100 types of fruit and vegetables from more than 120 countries. A variety of classification algorithms are evaluated using the WEKA data mining workbench. We demonstrate excellent results, both for the determination of fruit/vegetable type and for the country of origin, using a histogram of ion counts, and Classification by Regression using Random Regression Forest with PLS-transformed data.

[353]

Evolutionary Data Selection for Enhancing Models of Intraday Forex Time Series

Michael Mayo. Evolutionary data selection for enhancing models of intraday forex time series. In Proceedings Applications of Evolutionary Computation, pages 184-193. Springer, 2012.
[ bib ]
The hypothesis in this paper is that a significant amount of intraday market data is either noise or redundant, and that if it is eliminated, then predictive models built using the remaining intraday data will be more accurate. To test this hypothesis, we use an evolutionary method (called Evolutionary Data Selection, EDS) to selectively remove out portions of training data that is to be made available to an intraday market predictor. After performing experiments in which data-selected and non-data-selected versions of the same predictive models are compared, it is shown that EDS is effective and does indeed boost predictor accuracy. It is also shown in the paper that building multiple models using EDS and placing them into an ensemble further increases performance. The datasets for evaluation are large intraday forex time series, specifically series from the EUR/USD, the USD/JPY and the EUR/JPY markets, and predictive models for two primary tasks per market are built: intraday return prediction and intraday volatility prediction.

[354]

Developing data mining applications

Geoff Holmes. Developing data mining applications. In International Conference on Knowledge Discovery and Data Mining, page 225. ACM, 2012.
[ bib ]
In this talk I will review several real-world applications developed at the University of Waikato over the past 15 years. These include the use of near infrared spectroscopy coupled with data mining as an alternate laboratory technique for predicting compound concentrations in soil and plant samples, and the analysis of gas chromatography mass spectrometry (GCMS) data, a technique used to determine in environmental applications, for example, the petroleum content in soil and water samples. I will then briefly discuss how experience with these applications has led to the development of an open-source framework for application development.

[355]

Scientific Workflow Management with ADAMS

Peter Reutemann and Joaquin Vanschoren. Scientific workflow management with adams. In PeterA. Flach, Tijl Bie, and Nello Cristianini, editors, Machine Learning and Knowledge Discovery in Databases, volume 7524 of Lecture Notes in Computer Science, pages 833-837. Springer Berlin Heidelberg, 2012.
[ bib | http ]
We demonstrate the Advanced Data mining And Machine learning System (ADAMS), a novel workflow engine designed for rapid prototyping and maintenance of complex knowledge workflows. ADAMS does not require the user to manually connect inputs to outputs on a large canvas. It uses a compact workflow representation, control operators, and a simple interface between operators, allowing them to be auto-connected. It contains an extensive library of operators for various types of analysis, and a convenient plug-in architecture to easily add new ones.
Keywords: scientific workflows; machine learning; data mining

[356]

Integrated Instance- and Class-based Generative Modeling for Text Classification

Antti Puurula and Sung-Hyon Myaeng. Integrated instance- and class-based generative modeling for text classification. In Proc 18th Australasian Document Computing Symposium, pages 66-73. ACM, 2013.
[ bib | .pdf ]
Statistical methods for text classification are predominantly based on the paradigm of class-based learning that associates class variables with features, discarding the instances of data after model training. This results in efficient models, but neglects the fine-grained information present in individual documents. Instance-based learning uses this information, but suffers from data sparsity with text data. In this paper, we propose a generative model called Tied Document Mixture (TDM) for extending Multinomial Naive Bayes (MNB) with mixtures of hierarchically smoothed models for documents. Alternatively, TDM can be viewed as a Kernel Density Classifier using class-smoothed Multinomial kernels. TDM is evaluated for classification accuracy on 14 different datasets for multi-label, multi-class and binary-class text classification tasks and compared to instance- and class-based learning baselines. The comparisons to MNB demonstrate a substantial improvement in accuracy as a function of available training documents per class, ranging up to average error reductions of over 26% in sentiment clas- sification and 65% in spam classification. On average TDM is as accurate as the best discriminative classifiers, but retains the linear time complexities of instance-based learning methods, with exact algorithms for both model estimation and inference.

[357]

Cumulative Progress in Language Models for Information Retrieval

Antti Puurula. Cumulative progress in language models for information retrieval. In Proc 11th Australasian Language Technology Workshop, Brisbane, Australia, pages 96-100. ACL, 2013.
[ bib | .pdf ]
The improvements to ad-hoc IR systems over the last decades have been recently criticized as illusionary and based on incorrect baseline comparisons. In this paper several improvements to the LM approach to IR are combined and evaluated: Pitman-Yor Process smoothing, TF-IDF feature weighting and modelbased feedback. The increases in ranking quality are significant and cumulative over the standard baselines of Dirichlet Prior and 2-stage Smoothing, when evaluated across 13 standard ad-hoc retrieval datasets. The combination of the improvements is shown to improve the Mean Average Precision over the datasets by 17.1% relative. Furthermore, the considered improvements can be easily implemented with little additional computation to existing LM retrieval systems. On the basis of the results it is suggested that LM research for IR should move towards using stronger baseline models.

[358]

Online Estimation of Discrete Densities

Michael Geilke, Eibe Frank, Andreas Karwath, and Stefan Kramer. Online estimation of discrete densities. In Proc 13th IEEE International Conference on Data Mining, Dallas, Texas. IEEE, 2013.
[ bib | .pdf ]
We address the problem of estimating a discrete joint density online, that is, the algorithm is only provided the current example and its current estimate. The proposed online estimator of discrete densities, EDDO (Estimation of Discrete Densities Online), uses classifier chains to model dependencies among features. Each classifier in the chain estimates the probability of one particular feature. Because a single chain may not provide a reliable estimate, we also consider ensembles of classifier chains and ensembles of weighted classifier chains. For all density estimators, we provide consistency proofs and propose algorithms to perform certain inference tasks. The empirical evaluation of the estimators is conducted in several experiments and on data sets of up to several million instances: We compare them to density estimates computed from Bayesian structure learners, evaluate them under the influence of noise, measure their ability to deal with concept drift, and measure the run-time performance. Our experiments demonstrate that, even though designed to work online, EDDO delivers estimators of competitive accuracy compared to batch Bayesian structure learners and batch variants of EDDO.

[359]

Propositionalisation of Multi-instance Data using Random Forests

Eibe Frank and Bernhard Pfahringer. Propositionalisation of multi-instance data using random forests. In Proc 26th Australasian Conference on Artificial Intelligence, Otago, New Zealand, pages 362-373. Springer, 2013.
[ bib | .pdf ]
Multi-instance learning is a generalisation of attribute-value learning where examples for learning consist of labeled bags (i.e. multi-sets) of instances. This learning setting is more computationally challenging than attribute-value learning and a natural fit for important application areas of machine learning such as classification of molecules and image classification. One approach to solve multi-instance learning problems is to apply propositionalisation, where bags of data are converted into vectors of attribute-value pairs so that a standard propositional (i.e. attribute-value) learning algorithm can be applied. This approach is attractive because of the large number of propositional learning algorithms that have been developed and can thus be applied to the propositionalised data. In this paper, we empirically investigate a variant of an existing propositionalisation method called TLC. TLC uses a single decision tree to obtain propositionalised data. Our variant applies a random forest instead and is motivated by the potential increase in robustness that this may yield. We present results on synthetic and real-world data from the above two application domains showing that it indeed yields increased classification accuracy when applying boosting and support vector machines to classify the propositionalised data.

[360]

Random Projections as Regularizers: Learning a Linear Discriminant Ensemble from Fewer Observations than Dimensions

Robert J. Durrant and Ata Kabán. Random projections as regularizers: Learning a linear discriminant ensemble from fewer observations than dimensions. In Proc 5th Asian Conference on Machine Learning, Canberra, Australia. JMLR, 2013.
[ bib | .pdf ]
We examine the performance of an ensemble of randomly-projected Fisher Linear Discriminant classifiers, focusing on the case when there are fewer training observations than data dimensions. Our ensemble is learned from a sequence of randomly-projected representations of the original high dimensional data and therefore for this approach data can be collected, stored and processed in such a compressed form. The specific form and simplicity of this ensemble permits a direct and much more detailed analysis than existing generic tools in previous works. In particular, we are able to derive the exact form of the generalization error of our ensemble, conditional on the training set, and based on this we give theoretical guarantees which directly link the performance of the ensemble to that of the corresponding linear discriminant learned in the full data space. To the best of our knowledge these are the first theoretical results to prove such an explicit link for any classifier and classifier ensemble pair. Furthermore we show that the randomly-projected ensemble is equivalent to implementing a sophisticated regularization scheme to the linear discriminant learned in the original data space and this prevents overfitting in conditions of small sample size where pseudo-inverse FLD learned in the data space is provably poor. We confirm theoretical findings with experiments, and demonstrate the utility of our approach on several datasets from the bioinformatics domain where fewer observations than dimensions are the norm.

[361]

Dimension-Adaptive Bounds on Compressive FLD Classification

Ata Kabán and Robert J. Durrant. Dimension-adaptive bounds on compressive fld classification. In Proc 24th International Conference on Algorithmic Learning Theory, Singapore, pages 294-308, 2013.
[ bib | http | .pdf ]
Efficient dimensionality reduction by random projections (RP) gains popularity, hence the learning guarantees achievable in RP spaces are of great interest. In finite dimensional setting, it has been shown for the compressive Fisher Linear Discriminant (FLD) classifier that for good generalisation the required target dimension grows only as the log of the number of classes and is not adversely affected by the number of projected data points. However these bounds depend on the dimensionality d of the original data space. In this paper we give further guarantees that remove d from the bounds under certain conditions of regularity on the data density structure. In particular, if the data density does not fill the ambient space then the error of compressive FLD is independent of the ambient dimension and depends only on a notion of ‘intrinsic dimension’.

[362]

Clustering Based Active Learning for Evolving Data Streams

Dino Ienco, Albert Bifet, Indre Zliobaite, and Bernhard Pfahringer. Clustering based active learning for evolving data streams. In Proc 16th International Conference on Discovery Science, Singapore, pages 79-93, 2013.
[ bib | http ]
Data labeling is an expensive and time-consuming task. Choosing which labels to use is increasingly becoming important. In the active learning setting, a classifier is trained by asking for labels for only a small fraction of all instances. While many works exist that deal with this issue in non-streaming scenarios, few works exist in the data stream setting. In this paper we propose a new active learning approach for evolving data streams based on a pre-clustering step, for selecting the most informative instances for labeling. We consider a batch incremental setting: when a new batch arrives, first we cluster the examples, and then, we select the best instances to train the learner. The clustering approach allows to cover the whole data space avoiding to oversample examples from only few areas. We compare our method w.r.t. state of the art active learning strategies over real datasets. The results highlight the improvement in performance of our proposal. Experiments on parameter sensitivity are also reported.

[363]

Pairwise meta-rules for better meta-learning-based algorithm ranking

Quan Sun and Bernhard Pfahringer. Pairwise meta-rules for better meta-learning-based algorithm ranking. Machine Learning, 93(1):141-161, 2013.
[ bib | http ]
In this paper, we present a novel meta-feature generation method in the context of meta-learning, which is based on rules that compare the performance of individual base learners in a one-against-one manner. In addition to these new meta-features, we also introduce a new meta-learner called Approximate Ranking Tree Forests (ART Forests) that performs very competitively when compared with several state-of-the-art meta-learners. Our experimental results are based on a large collection of datasets and show that the proposed new techniques can improve the overall performance of meta-learning for algorithm ranking significantly. A key point in our approach is that each performance figure of any base learner for any specific dataset is generated by optimising the parameters of the base learner separately for each dataset.

[364]

Pitfalls in Benchmarking Data Stream Classification and How to Avoid Them

Albert Bifet, Jesse Read, Indre Zliobaite, Bernhard Pfahringer, and Geoff Holmes. Pitfalls in benchmarking data stream classification and how to avoid them. In Proc European Conference on Machine Learning and Knowledge Discovery in Databases, Prague, Czech Republic, pages 465-479, 2013.
[ bib | http ]
Data stream classification plays an important role in modern data analysis, where data arrives in a stream and needs to be mined in real time. In the data stream setting the underlying distribution from which this data comes may be changing and evolving, and so classifiers that can update themselves during operation are becoming the state-of-the-art. In this paper we show that data streams may have an important temporal component, which currently is not considered in the evaluation and benchmarking of data stream classifiers. We demonstrate how a naive classifier considering the temporal component only outperforms a lot of current state-of-the-art classifiers on real data streams that have temporal dependence, i.e. data is autocorrelated. We propose to evaluate data stream classifiers taking into account temporal dependence, and introduce a new evaluation measure, which provides a more accurate gauge of data stream classifier performance. In response to the temporal dependence issue we propose a generic wrapper for data stream classifiers, which incorporates the temporal component into the attribute space.

[365]

SMOTE for Regression

Luís Torgo, Rita P. Ribeiro, Bernhard Pfahringer, and Paula Branco. Smote for regression. In Proc 16th Portuguese Conference on Artificial Intelligence, Angra do Heroísmo, Azores, Portugal, pages 378-389, 2013.
[ bib | http ]
Several real world prediction problems involve forecasting rare values of a target variable. When this variable is nominal we have a problem of class imbalance that was already studied thoroughly within machine learning. For regression tasks, where the target variable is continuous, few works exist addressing this type of problem. Still, important application areas involve forecasting rare extreme values of a continuous target variable. This paper describes a contribution to this type of tasks. Namely, we propose to address such tasks by sampling approaches. These approaches change the distribution of the given training data set to decrease the problem of imbalance between the rare target cases and the most frequent ones. We present a modification of the well-known Smote algorithm that allows its use on these regression tasks. In an extensive set of experiments we provide empirical evidence for the superiority of our proposals for these particular regression tasks. The proposed SmoteR method can be used with any existing regression algorithm turning it into a general tool for addressing problems of forecasting rare extreme values of a continuous target variable

[366]

Applying additive logistic regression to data derived from sensors monitoring behavioral and physiological characteristics of dairy cows to detect lameness

Claudia Kamphuis, Eibe Frank, Jennie K. Burke, Gwyn Verkerk, and Jenny Jago. Applying additive logistic regression to data derived from sensors monitoring behavioral and physiological characteristics of dairy cows to detect lameness. Dairy Science, 2013.
[ bib | http ]
The hypothesis was that sensors currently available on farm that monitor behavioral and physiological characteristics have potential for the detection of lameness in dairy cows. This was tested by applying additive logistic regression to variables derived from sensor data. Data were collected between November 2010 and June 2012 on 5 commercial pasture-based dairy farms. Sensor data from weigh scales (liveweight), pedometers (activity), and milk meters (milking order, unadjusted and adjusted milk yield in the first 2 min of milking, total milk yield, and milking duration) were collected at every milking from 4,904 cows. Lameness events were recorded by farmers who were trained in detecting lameness before the study commenced. A total of 318 lameness events affecting 292 cows were available for statistical analyses. For each lameness event, the lame cow's sensor data for a time period of 14 d before observation date were randomly matched by farm and date to 10 healthy cows (i.e., cows that were not lame and had no other health event recorded for the matched time period). Sensor data relating to the 14-d time periods were used for developing univariable (using one source of sensor data) and multivariable (using multiple sources of sensor data) models. Model development involved the use of additive logistic regression by applying the LogitBoost algorithm with a regression tree as base learner. The model's output was a probability estimate for lameness, given the sensor data collected during the 14-d time period. Models were validated using leave-one-farm-out cross-validation and, as a result of this validation, each cow in the data set (318 lame and 3,180 nonlame cows) received a probability estimate for lameness. Based on the area under the curve (AUC), results indicated that univariable models had low predictive potential, with the highest AUC values found for liveweight (AUC = 0.66), activity (AUC = 0.60), and milking order (AUC = 0.65). Combining these 3 sensors improved AUC to 0.74. Detection performance of this combined model varied between farms but it consistently and significantly outperformed univariable models across farms at a fixed specificity of 80%. Still, detection performance was not high enough to be implemented in practice on large, pasture-based dairy farms. Future research may improve performance by developing variables based on sensor data of liveweight, activity, and milking order, but that better describe changes in sensor data patterns when cows go lame.

[367]

Towards a Framework for Designing Full Model Selection and Optimization Systems

Quan Sun, Bernhard Pfahringer, and Michael Mayo. Towards a framework for designing full model selection and optimization systems. In Proc 11th International Workshop on Multiple Classifier Systems, Nanjing, China, pages 259-270. Springer, 2013.
[ bib | http ]
People from a variety of industrial domains are beginning to realise that appropriate use of machine learning techniques for their data mining projects could bring great benefits. End-users now have to face the new problem of how to choose a combination of data processing tools and algorithms for a given dataset. This problem is usually termed the Full Model Selection (FMS) problem. Extended from our previous work [10], in this paper, we introduce a framework for designing FMS algorithms. Under this framework, we propose a novel algorithm combining both genetic algorithms (GA) and particle swarm optimization (PSO) named GPS (which stands for GA-PSO-FMS), in which a GA is used for searching the optimal structure for a data mining solution, and PSO is used for searching optimal parameters for a particular structure instance. Given a classification dataset, GPS outputs a FMS solution as a directed acyclic graph consisting of diverse data mining operators that are available to the problem. Experimental results demonstrate the benefit of the algorithm. We also present, with detailed analysis, two model-tree-based variants for speeding up the GPS algorithm.

[368]

Predicting Regression Test Failures Using Genetic Algorithm-Selected Dynamic Performance Analysis Metrics

Michael Mayo and Simon A. Spacey. Predicting regression test failures using genetic algorithm-selected dynamic performance analysis metrics. In Proc 5th International Symposium on Search Based Software Engineering, St. Petersburg, Russia, pages 158-171, 2013.
[ bib | http ]
A novel framework for predicting regression test failures is proposed. The basic principle embodied in the framework is to use performance analysis tools to capture the runtime behaviour of a program as it executes each test in a regression suite. The performance information is then used to build a dynamically predictive model of test outcomes. Our framework is evaluated using a genetic algorithm for dynamic metric selection in combination with state-of-the-art machine learning classifiers. We show that if a program is modified and some tests subsequently fail, then it is possible to predict with considerable accuracy which of the remaining tests will also fail which can be used to help prioritise tests in time constrained testing environments.

[369]

Automatic construction of lexicons, taxonomies, ontologies, and other knowledge structures

Olena Medelyan, Ian H. Witten, Anna Divoli, and Jeen Broekstra. Automatic construction of lexicons, taxonomies, ontologies, and other knowledge structures. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 3(4):257-279, 2013.
[ bib | http ]
Abstract, structured, representations of knowledge such as lexicons, taxonomies, and ontologies have proven to be powerful resources not only for the systematization of knowledge in general, but to support practical technologies of document organization, information retrieval, natural language understanding, and question-answering systems. These resources are extremely time consuming for people to create and maintain, yet demand for them is growing, particularly in specialized areas ranging from legacy documents of large enterprises to rapidly changing domains such as current affairs and celebrity news. Consequently, researchers are investigating methods of creating such structures automatically from document collections, calling on the proliferation of interlinked resources already available on the web for background knowledge and general information about the world. This review surveys what is possible, and also outlines current research directions.

[370]

Towards large scale continuous EDA: a random matrix theory perspective

Ata Kabán, Jakramate Bootkrajang, and Robert John Durrant. Towards large scale continuous eda: a random matrix theory perspective. In Proc Genetic and Evolutionary Computation Conference, Amsterdan, The Netherlands, pages 383-390, 2013.
[ bib | http | .pdf ]
Estimation of distribution algorithms (EDA) are a major branch of evolutionary algorithms (EA) with some unique advantages in principle. They are able to take advantage of correlation structure to drive the search more efficiently, and they are able to provide insights about the structure of the search space. However, model building in high dimensions is extremely challenging and as a result existing EDAs lose their strengths in large scale problems. Large scale continuous global optimisation is key to many real world problems of modern days. Scaling up EAs to large scale problems has become one of the biggest challenges of the field. This paper pins down some fundamental roots of the problem and makes a start at developing a new and generic framework to yield effective EDA-type algorithms for large scale continuous global optimisation problems. Our concept is to introduce an ensemble of random projections of the set of fittest search points to low dimensions as a basis for developing a new and generic divide-and-conquer methodology. This is rooted in the theory of random projections developed in theoretical computer science, and will exploit recent advances of non-asymptotic random matrix theory.

[371]

Sharp Generalization Error Bounds for Randomly-projected Classifiers

Robert J. Durrant and Ata Kabán. Sharp generalization error bounds for randomly-projected classifiers. In Proc 30th International Conference on Machine Learning, Atlanta, Georgia, pages 693-701. JMLR, 2013.
[ bib | .pdf ]
We derive sharp bounds on the generalization error of a generic linear classifier trained by empirical risk minimization on randomly-projected data. We make no restrictive assumptions (such as sparsity or separability) on the data: Instead we use the fact that, in a classification setting, the question of interest is really ‘what is the effect of random projection on the predicted class labels?’ and we therefore derive the exact probability of ‘label flipping’ under Gaussian random projection in order to quantify this effect precisely in our bounds.

[372]

Constructing a Focused Taxonomy from a Document Collection

Olena Medelyan, Steve Manion, Jeen Broekstra, Anna Divoli, Anna-Lan Huang, and Ian H. Witten. Constructing a focused taxonomy from a document collection. In Proc 10th European Semantic Web Conference, Montpellier, France, pages 367-381. Springer, 2013.
[ bib | http ]
We describe a new method for constructing custom taxonomies from document collections. It involves identifying relevant concepts and entities in text; linking them to knowledge sources like Wikipedia, DBpedia, Freebase, and any supplied taxonomies from related domains; disambiguating conflicting concept mappings; and selecting semantic relations that best group them hierarchically. An RDF model supports interoperability of these steps, and also provides a flexible way of including existing NLP tools and further knowledge sources. From 2000 news articles we construct a custom taxonomy with 10,000 concepts and 12,700 relations, similar in structure to manually created counterparts. Evaluation by 15 human judges shows the precision to be 89% and 90% for concepts and relations respectively; recall was 75% with respect to a manually generated taxonomy for the same domain.

[373]

Artificial neural network is highly predictive of outcome in paediatric acute liver failure

Jeremy Rajanayagam, Eibe Frank, Ross W. Shepherd, and Peter J. Lewindon. Artificial neural network is highly predictive of outcome in paediatric acute liver failure. Pediatric Transplantation, 2013.
[ bib | http ]
Current prognostic models in PALF are unreliable, failing to account for complex, non-linear relationships existing between multiple prognostic factors. A computational approach using ANN should provide superior modelling to PELD-MELD scores. We assessed the prognostic accuracy of PELD-MELD scores and ANN in PALF in children presenting to the QLTS, Australia. A comprehensive registry-based data set was evaluated in 54 children (32M, 22F, median age 17 month) with PALF. PELD-MELD scores calculated at (i) meeting PALF criteria and (ii) peak. ANN was evaluated using stratified 10-fold cross-validation. Outcomes were classified as good (transplant-free survival) or poor (death or LT) and predictive accuracy compared using AUROC curves. Mean PELD-MELD scores were significantly higher in non-transplanted non-survivors (i) 37 and (ii) 46 and transplant recipients (i) 32 and (ii) 43 compared to transplant-free survivors (i) 26 and (ii) 30. Threshold PELD-MELD scores ≥27 and ≥42, at meeting PALF criteria and peak, gave AUROC 0.71 and 0.86, respectively, for poor outcome. ANN showed superior prediction for poor outcome with AUROC 0.96, sensitivity 82.6%, specificity 96%, PPV 96.2% and NPV 85.7% (cut-off 0.5). ANN is superior to PELD-MELD for predicting poor outcome in PALF.

[374]

Identifying Market Price Levels Using Differential Evolution

Michael Mayo. Identifying market price levels using differential evolution. In Proc 16th European Conference on Applications of Evolutionary Computation, Vienna, Austria, pages 203-212, 2013.
[ bib | http ]
Evolutionary data mining is used in this paper to investigate the concept of support and resistance levels in financial markets. Specifically, Differential Evolution is used to learn support/resistance levels from price data. The presence of these levels is then tested in out-of-sample data. Our results from a set of experiments covering five years worth of daily data across nine different US markets show that there is statistical evidence for price levels in certain markets, and that Differential Evolution can uncover them.

[375]

A direct policy-search algorithm for relational reinforcement learning

Samuel Sarjant. A direct policy-search algorithm for relational reinforcement learning. PhD thesis, Department of Computer Science, University of Waikato, 2013.
[ bib | http ]
Relational Reinforcement Learning (RRL) is a subfield of machine learning in which a learning agent seeks to maximise a numerical reward within an environment, represented as collections of objects and relations, by performing actions that interact with the environment. The relational representation allows more dynamic environment states than an attribute-based representation of reinforcement learning, but this flexibility also creates new problems such as a potentially infinite number of states. This thesis describes an RRL algorithm named Cerrla that creates policies directly from a set of learned relational condition-action rules using the Cross-Entropy Method (CEM) to control policy creation. The CEM assigns each rule a sampling probability and gradually modifies these probabilities such that the randomly sampled policies consist of ‘better’ rules, resulting in larger rewards received. Rule creation is guided by an inferred partial model of the environment that defines: the minimal conditions needed to take an action, the possible specialisation conditions per rule, and a set of simplification rules to remove redundant and illegal rule conditions, resulting in compact, efficient, and comprehensible policies. Cerrla is evaluated on four separate environments, where each environment has several different goals. Results show that compared to existing RRL algorithms, Cerrla is able to learn equal or better behaviour in less time on the standard RRL environment. On other larger, more complex environments, it can learn behaviour that is competitive to specialised approaches. The simplified rules and CEM’s bias towards compact policies result in comprehensive and effective relational policies created in a relatively short amount of time.

[376]

Model selection based product kernel learning for regression on graphs

Madeleine Seeland, Stefan Kramer, and Bernhard Pfahringer. Model selection based product kernel learning for regression on graphs. In Proc 28th Annual ACM Symposium on Applied Computing, Coimbra, Portugal, pages 136-143. ACM, 2013.
[ bib | http ]
The choice of a suitable graph kernel is intrinsically hard and often cannot be made in an informed manner for a given dataset. Methods for multiple kernel learning offer a possible remedy, as they combine and weight kernels on the basis of a labeled training set of molecules to define a new kernel. Whereas most methods for multiple kernel learning focus on learning convex linear combinations of kernels, we propose to combine kernels in products, which theoretically enables higher expressiveness. In experiments on ten publicly available chemical QSAR datasets we show that product kernel learning is on no dataset significantly worse than any of the competing kernel methods and on average the best method available. A qualitative analysis of the resulting product kernels shows how the results vary from dataset to dataset.

[377]

Efficient data stream classification via probabilistic adaptive windows

Albert Bifet, Bernhard Pfahringer, Jesse Read, and Geoff Holmes. Efficient data stream classification via probabilistic adaptive windows. In Proc 28th Annual ACM Symposium on Applied Computing, Coimbra, Portugal, pages 801-806. ACM, 2013.
[ bib | http ]
In the context of a data stream, a classifier must be able to learn from a theoretically-infinite stream of examples using limited time and memory, while being able to predict at any point. Many methods deal with this problem by basing their model on a window of examples. We introduce a probabilistic adaptive window (PAW) for data-stream learning, which improves this windowing technique with a mechanism to include older examples as well as the most recent ones, thus maintaining information on past concept drifts while being able to adapt quickly to new ones. We exemplify PAW with lazy learning methods in two variations: one to handle concept drift explicitly, and the other to add classifier diversity using an ensemble. Along with the standard measures of accuracy and time and memory use, we compare classifiers against state-of-the-art classifiers from the data-stream literature.

[378]

An open-source toolkit for mining Wikipedia

David N. Milne and Ian H. Witten. An open-source toolkit for mining wikipedia. Artificial Intelligence, 194:222-239, 2013.
[ bib | http ]
The online encyclopedia Wikipedia is a vast, constantly evolving tapestry of interlinked articles. For developers and researchers it represents a giant multilingual database of concepts and semantic relations, a potential resource for natural language processing and many other research areas. This paper introduces the Wikipedia Miner toolkit, an open-source software system that allows researchers and developers to integrate Wikipedia's rich semantics into their own applications. The toolkit creates databases that contain summarized versions of Wikipedia's content and structure, and includes a Java API to provide access to them. Wikipedia's articles, categories and redirects are represented as classes, and can be efficiently searched, browsed, and iterated over. Advanced features include parallelized processing of Wikipedia dumps, machine-learned semantic relatedness measures and annotation features, and XML-based web services. Wikipedia Miner is intended to be a platform for sharing data mining techniques.