[1]

A Tool for Model Generation and Knowledge Acquisition
S.J. Cunningham and P. Denize.
A tool for model generation and knowledge acquisition.
In Proc International Workshop on Artificial Intelligence and
Statistics, pages 213222, Fort Lauderdale, Florida, USA, 1993.
[ bib 
.ps 
.pdf ]
Tools to automatically induce domain descriptions from examples are valuable aids for the knowledge acquisition stage of expert system construction. This paper presents a description of an algorithm that induces two domain descriptions: a conceptual model, which gives a broad understanding of variable interactions and their effect on the system, and a predictive model, which determines the system value associated with variable values input to the model. This induction algorithm is based on Entropy Data Analysis (EDA), which builds linear equations to approximate the system described by the training data.


[2]

Using Data Mining to Support the Construction and Maintenance of Expert Systems
G. Holmes and S.J. Cunningham.
Using data mining to support the construction and maintenance of
expert systems.
In Proc Artificial Neural Networks and Expert Systems, pages
156159, Dunedin, New Zealand, 1993.
[ bib 
.pdf ]
Many expert systems are constructed from and allied with a large collection of databases that are continually being updated. In this paper we address the issues of how such a knowledge base can be constructed using tools that search the databases for significant, unexpected correlations and present them to the knowledge engineer for review. Once a knowledge base has been constructed, there is the problem of keeping it consistent with changing conditions in the real world. Concepts which are embodied in the data may drift in time, and so some mechanism for updating the knowledge base must be developed. As it would prove costly to periodically revisit the knowledge acquisition phase, we propose to monitor the domain database automatically for significant concept change.


[3]

Expert Systems Development Using Data Mining
G. Holmes and S.J. Cunningham.
Expert systems development using data mining.
In Proc Expert Systems '93, pages 213222, Cambridge, England,
1993.
[ bib 
.ps 
.pdf ]
This paper presents a model of expert system construction beginning with an initial set of data chosen by a domain expert. The domain expert creates the knowledge base both through direct knowledge transfer (or through the construction of example sets for machine learning) and through mining one or more of the databases. As construction progresses, the domain expert may alter the composition of the databases by directing that additional or different data be collected. Interactively guided by the expert, the mining tool efficiently organises the search for findings in the database. These findings are then evaluated for inclusion in the knowledge base.


[4]

Practical Machine Learning and its Potential Application to Problems in Agriculture
I.H. Witten, S. Cunningham, G. Holmes, R.J. McQueen, and L.A. Smith.
Practical machine learning and its potential application to problems
in agriculture.
In Proc New Zealand Computer Conference, volume 1, pages
308325, Auckland, New Zealand, 1993.
[ bib 
.ps.gz 
.pdf ]
One of the most exciting and potentially farreaching developments in contemporary computer science is the invention and application of methods of machine learning. These have evolved from simple adaptive parameterestimation techniques to ways of (a) inducing classification rules from examples, (b) using prior knowledge to guide the interpretation of new examples, (c) using this interpretation to sharpen and refine the domain knowledge, and (d) storing and indexing example cases in ways that highlight their similarities and differences. Such techniques have been applied in domains ranging from the diagnosis of plant disease to the interpretation of medical test data. This paper reviews selected methods of machine learning with an emphasis on practical applications, and suggests how they might be used to address some important problems in the primary production industries, particularly agriculture.


[5]

Complexitybased Induction
D. Conklin and I.H. Witten.
Complexitybased induction.
Machine Learning, 16(3):203225, 1994.
[ bib 
.ps.gz 
.pdf ]
A central problem in inductive logic programming is theory evaluation. Without some sort of preference criterion, any two theories that explain the same set of examples are equally acceptable. This paper presents a scheme for evaluating alternative inductive theories based on an objective preference criterion. It strives to extract maximal redundancy from examples, transforming structure into randomness. A major strength of the method is its application to learning problems where negative examples of concepts are scarce or unavailable. A new measure called model complexity is introduced, and its use is illustrated and compared with a proof complexity measure on relational learning tasks. The complementarity of the model and proof complexity parallels that of model and prooftheoretic semantics. Model complexity, where applicable, seems to be an appropriate measure for evaluating inductive logic theories.


[6]

WEKA Machine Learning Project: Cow Culling
R.E. De War and D.L. Neal.
Weka machine learning project: Cow culling.
Technical report, The University of Waikato, Computer Science
Department, Hamilton, New Zealand, 1994.
[ bib 
.ps.gz 
.pdf ]
This document describes the results of applying the WEKA machine learning workbench to a large database of dairy cows. The aim of the project was to induce usable rules for culling decisions from the data, and hence find which attributes were most important to farmers in determining whether an animal should be culled.


[7]

WEKA: A Machine Learning Workbench
G. Holmes, A. Donkin, and I.H. Witten.
Weka: A machine learning workbench.
In Proc Second Australia and New Zealand Conference on
Intelligent Information Systems, Brisbane, Australia, 1994.
[ bib 
.ps.gz 
.pdf ]
WEKA is a workbench for machine learning that is intended to aid in the application of machine learning techniques to a variety of realworld problems, in particular, those arising from agricultural and horticultural domains. Unlike other machine learning projects, the emphasis is on providing a working environment for the domain specialist rather than the machine learning expert. Lessons learned include the necessity of providing a wealth of interactive tools for data manipulation, result visualization, database linkage, and crossvalidation and comparison of rule sets, to complement the basic machine learning tools.


[8]

Knowledgerich Induction of Classification Rules
B. Martin.
Knowledgerich induction of classification rules.
In Proc Canadian Machine Learning Workshop, University of
Calgary, Alberta, Canada, 1994.
[ bib 
.ps.gz 
.pdf ]
The purpose of this research was to produce a machine learning system that can take advantage of many forms of background knowledge to guide the induction of classification rules. This system will be used for knowledge discovery in databases (also known as database mining), as the use of background knowledge can considerably reduce the search space of a database knowledge search. A new system, MARVIN++, is introduced, that attempts to satisfy this aim.


[9]

The WEKA Machine Learning Workbench : Its Application to a Real World Agricultural Database
R.J. McQueen, D.L. Neal, R.E. DeWar, S.R. Garner, and C.G. NevillManning.
The weka machine learning workbench : Its application to a real world
agricultural database.
In Proc Canadian Machine Learning Workshop, Banff, Alberta,
Canada, 1994.
[ bib 
.ps.gz 
.pdf ]
Numerous techniques have been proposed for learning rules and relationships from diverse data sets, in the hope that machines can help in the often tedious and errorprone process of knowledge acquisition. While these techniques are plausible and theoretically wellfounded, they stand or fall on their ability to make sense of realworld data. This paper describes a project that aims to apply a range of learning strategies to problems in primary industry, in particular agriculture and horticulture.


[10]

Geometric Comparison of Classifications and Rule Sets
T.J. Monk, R.S. Mitchell, L.A. Smith, and G. Holmes.
Geometric comparison of classifications and rule sets.
In Proc AAAI Workshop on Knowledge Discovery in Databases,
pages 395406, Seattle, Washington, USA, 1994.
[ bib 
.ps.gz 
.pdf ]
We present a technique for evaluating classifications by geometric comparison of rules. Rules are represented as objects in an ndimensional hyperspace. The similarity of classes is computed from the overlap of the geometric class descriptions. The system produces a correlation matrix that indicates the degree of similarity between each pair of classes. The technique can be applied to classification generated by different algorithms, with different numbers of classes and different attribute sets. Experimental results from a case study in a medical domain are included.


[11]

Modelling Sequences Using Grammars and Automata
C.G. NevillManning and D.L. Maulsby.
Modelling sequences using grammars and automata.
In Proc Canadian Machine Learning Workshop, University of
Calgary, Alberta, Canada, 1994.
[ bib 
.pdf ]
This paper presents two sequence modelling techniques. The first induces a contextfree deterministic grammar from a sequence. It was motivated by a specific machine learning problem, that of modelling a sequence of actions performed by a computer user, but it can also be applied to other machine learning problems, and its performance as a data compression technique is the best in its class. The second technique induces a pushdown finitestate automaton from a sequence. It was designed to derive an executable program from a program execution trace expressed in highlevel language statements, and is capable of recognising branches and loops, as well as recursive and nonrecursive procedure calls. The inductive capabilities of these two techniques are complementary, and following their description we examine how they can be combined into a more powerful system.


[12]

Compression by Induction of Hierarchical Grammars
C.G. NevillManning, I.H. Witten, and D.L. Maulsby.
Compression by induction of hierarchical grammars.
In J.A. Storer and M. Cohn, editors, Proc Data Compression
Conference, pages 244253, Los Alamitos, CA, 1994. IEEE Press.
[ bib 
.pdf ]
This paper describes a technique that constructs models of symbol sequences in the form of small, humanreadable, hierarchical grammars. The grammars are both semantically plausible and compact. The technique can induce structure from a variety of different kinds of sequence, and examples are given of models derived from English text, C source code and a sequence of terminal control codes.


[13]

Data Transformation: A Semanticallybased Approach to Function Discovery
T.H. Phan and I.H. Witten.
Data transformation: A semanticallybased approach to function
discovery.
Technical report, The University of Waikato, Computer Science
Department, Hamilton, New Zealand, 1994.
[ bib 
.ps.gz 
.pdf ]
This paper presents the method of data transformation for discovering numeric functions from their examples. Based on the idea of transformations between functions, this method can be viewed as a semantic counterpart to the more common approach of formula construction used in most previous discovery systems. Advantages of the new method include a flexible implementation through the design of transformation rules, and a sound basis for rigorous mathematical analysis to characterize what can be discovered. The method has been implemented in a discovery system called Linus , which can identify a wide range of functions: rational functions, quadratic relations, and many transcendental functions, as well as those that can be transformed to rational functions by combinations of diferentiation, logarithm and function inverse operations.


[14]

Function Discovery using Data Transformation
T.H. Phan and I.H. Witten.
Function discovery using data transformation.
In Proc International Symposium on Artificial Intelligence and
Mathematics, Florida, USA, 1994.
[ bib 
.ps.gz 
.pdf ]
Function discovery is the problem of finding a symbolic formula for an unknown function from examples of the function's value on certain arguments. In most previous discovery systems, the description language has been restricted to rational functions so that symbolic descriptions can be easily enumerated. This papers shows how data transformation can be used as the basis of a far more comprehensive description language that includes all functions that can be transformed to rational functions by differentiation and logarithm operations. the main contribution of this paper is to define a transformationbased description language and characterize its representational power. We also briefly sketch a practical implementation of a function induction system that uses this approach.


[15]

The WEKA Machine Learning Workbench (Video)
WEKA Machine Learning Group.
The weka machine learning workbench (video), 1994.
[ bib ]
This is a 6minute video tape giving a brief introduction to the WEKA project and the WEKA workbench.


[16]

TransPacific Machine Learning Research: the Calgary/Waikato Axis
I.H. Witten.
Transpacific machine learning research: the calgary/waikato axis.
In Proc Canadian Machine Learning Workshop, Calgary, Alberta,
Canada, 1994.
[ bib 
.ps.gz 
.pdf ]
The following five contributions summarize selected research projects on machine learning that are being undertaken in the Computer Science Departments at the Universities of Calgary and Waikato:
(a) The WEKA machine learning workbench (Bob McQueen et al.);
(b) Knowledgerich induction of classification rules (Brent Martin);
(c) LINUS: a transformationbased system for function discovery (Thong Phan);
(d) Modeling sequences (Craig NevillManning et al.);
(e) Instructible agents (Dave Maulsby).


[17]

Experiments on the zero frequency problem
J.G. Cleary and W.J. Teahan.
Experiments on the zero frequency problem.
In J.A. Storer and M. Cohn, editors, Proc Data Compression
Conference, page 480, Los Alamitos, CA, 1995. IEEE Press.
[ bib 
.ps.gz 
.pdf ]
The best algorithms for lossless compression of text are those which adapt to the text being compressed [1]. Two classes of such adaptive techniques are commonly used. One class matches the text against a dictionary of strings seen and transforms the text into a list of indices into the dictionary. These techniques are usually formulated as a variant on ZivLempel (LZ) compression. While LZ compressors do not give the best compression they are widely used because of their simplicity and low execution overhead.


[18]

K*: An InstanceBased Learner Using an Entropic Distance Measure
J.G. Cleary and L.E. Trigg.
K*: An instancebased learner using an entropic distance measure.
In Proc Machine Learning Conference, pages 108114, Tahoe
City, CA, USA, 1995.
[ bib 
.ps.gz 
.pdf ]
The use of entropy as a distance measure has several benefits. Amongst other things it provides a consistent approach to handling of symbolic attributes, real valued attributes and missing values. The approach of taking all possible transformation paths is discussed. We describe K*, an instancebased learner which uses such a measure, and results are presented which compare favourably with several machine learning algorithms.


[19]

Unbounded length contexts for PPM
J.G. Cleary, W.J. Teahan, and I.H. Witten.
Unbounded length contexts for ppm.
In J.A. Storer and M. Cohn, editors, Proc Data Compression
Conference, pages 5261, Los Alamitos, CA, 1995. IEEE Press.
[ bib 
.ps.gz 
.pdf ]
The PPM data compression scheme has set the performance standard in lossless compression of text throughout the past decade. The original algorithm was first published in 1984 by Cleary and Witten [3], and a series of improvements was described by Moffat [6], culminating in a careful implementation, called PPMC, which has become the benchmark version. This still achieves results superior to virtually all other compression methods, despite many attempts to better it. Other methods such as those based on ZivLempel coding [9] are more commonly used in practice, but their attractiveness lies in their relative speed rather than any superiority in compressionindeed, their compression performance generally falls distinctly below that of PPM in practical benchmark tests [1].


[20]

Machine Learning and Statistics: a Matter of Perspective
S.J. Cunningham.
Machine learning and statistics: a matter of perspective.
New Zealand J Computing, 6(1a):6973, August 1995.
[ bib 
.ps.gz 
.pdf ]
Information has become an important commercial commodityindeed, possibly the most important product of the future. While we have welldeveloped technologies to store data, the analysis to extract information is timeconsuming and requires skilled human intervention. Machine learning algorithms augment statistical analysis by providing mechanisms that automate the information discovery process. These algorithms also tend to be more accessible to endusers and domain experts. The two analysis methods are converging, and the fields have much to offer each other.


[21]

Analysis of Cow Culling with a Machine Learning Workbench
R.E. DeWar and R.J. McQueen.
Analysis of cow culling with a machine learning workbench.
Technical Report 95/1, University of Waikato, Computer Science
Department, Hamilton, New Zealand, January 1995.
[ bib 
.ps.gz 
.pdf ]
This report discusses the use of machine learning tools to examine datasets from a database of dairy cows. The objective of the study was to investigate whether these machine learning tools could extract meaningful rules from the datasets, and in the process, understand more about how these tools manipulate data.


[22]

WEKA: The Waikato Environment for Knowledge Analysis
S.R. Garner.
Weka: The waikato environment for knowledge analysis.
In Proc New Zealand Computer Science Research Students
Conference, pages 5764, University of Waikato, Hamilton, New Zealand,
1995.
[ bib 
.ps.gz 
.pdf ]
WEKA is a workbench designed to aid in the application of machine learning technology to real world data sets, in particular, data sets from New Zealand's agricultural sector. In order to do this a range of machine learning techniques are presented to the user in such a way as to hide the idiosyncrasies of input and output formats, as well as allow an exploratory approach in applying the technology. The system presented is a component based one that also has application in machine learning research and education.


[23]

Applying a Machine Learning Workbench: Experience with Agricultural Databases
S.R. Garner, S.J. Cunningham, G. Holmes, C.G. NevillManning, and Witten I.H.
Applying a machine learning workbench: Experience with agricultural
databases.
In Proc Machine Learning in Practice Workshop, Machine Learning
Conference, pages 1421, Tahoe City, CA, USA, 1995.
[ bib 
.ps.gz 
.pdf ]
This paper reviews our experience with the application of machine learning techniques to agricultural databases. We have designed and implemented a machine learning workbench, WEKA, which permits rapid experimentation on a given dataset using a variety of machine learning schemes, and has several facilities for interactive investigation of the data: preprocessing attributes, evaluating and comparing the results of different schemes, and designing comparative experiments to be run offline. We discuss the partnership between agricultural scientist and machine learning researcher that our experience has shown to be vital to success. We review in some detail a particular agricultural application concerned with the culling of dairy herds.


[24]

Machine learning from agricultural databases: practice and experience
S.R. Garner, G. Holmes, R.J. McQueen, and I.H. Witten.
Machine learning from agricultural databases: practice and
experience.
New Zealand J Computing, 6(1a):6973, August 1996.
[ bib 
.ps.gz 
.pdf ]
Per capita, New Zealand is reputed to be one of the world's leading collectors of information in databases. The country, quite rightly, places a substantial investment in stored knowledge. As databases grow, however, interest tends to shift from techniques for retrieving individual items to largescale analysis of the data as a whole. Data can be analysed to view trends, pick out anomalies, discover relationships, check that policy is turned into practice, and so on. With large databases, such analyse scan be costly.
At Waikato University we are examining these issues in the context of the agricultural sector of the economy. Our project centers around techniques of data mining or machine learning, which automatically analyse large bodies of data to discover hidden dependencies. Other methods of exploratory data analysismany of them interactiveare being investigated as an adjunct to the machine learning algorithms. This paper reviews the software that has been developed to support the application of machine learning techniques, and comments on our success with particular agricultural databases. We discuss those aspects of the databases that hinder the analysis, to assist with future data collection activities.


[25]

Selection of attributes for modeling Bach chorales by a genetic algorithm
M.A. Hall.
Selection of attributes for modeling bach chorales by a genetic
algorithm.
In Proc Second New Zealand International TwoStream Conference
on Artificial Neural Networks and Expert Systems, pages 182185, Dunedin,
NZ, 1995. IEEE Computer Society Press.
[ bib 
.ps 
.pdf ]
A genetic algorithm selected combinations of attributes for a machine learning system. The algorithm used 90 Bach chorale melodies to train models and randomly selected sets of 10 chorales for evaluation. Compression of pitch was used as the fitness evaluation criterion. The best models were used to compress a different test set of chorales and their performance compared to human generated models. G.A. models outperformed the human models, improving compression by 10 percent.


[26]

Feature selection via the discovery of simple classification rules
G. Holmes and C.G. NevillManning.
Feature selection via the discovery of simple classification rules.
In Proc Symposium on Intelligent Data Analysis (IDA95), pages
7579, BadenBaden, Germany, 1995.
[ bib 
.ps.gz 
.pdf ]
It has been our experience that in order to obtain useful results using supervised learning of realworld datasets it is necessary to perform feature subset selection and to perform many experiments using computed aggregates from the most relevant features. It is, therefore, important to look for selection algorithms that work quickly and accurately so that these experiments can be performed in a reasonable length of time, preferably interactively. This paper suggests a method to achieve this using a very simple algorithm that gives good performance across different supervised learning schemes and when compared to one of the most common methods for feature subset selection.


[27]

Document zone classification using machine learning
S. Inglis and I.H. Witten.
Document zone classification using machine learning.
In Proc Digital Image Computing: Techniques and Applications,
pages 631636, Brisbane, Australia, 1995.
[ bib 
.pdf ]
When processing document images, an important step is classifying the zones they contain into meaningful categories such as text, halftone pictures, line drawings, and mathematical formulae. A character recognition system, for example, may confine its attention to zones that are classified as text, while in an image compressor may employ specialized techniques and models for zones such as halftone pictures. The penalty for incorrect classification may range from incorrect interpretation to reduced efficiency. In any case, the higher the classification accuracy, the better the results.
This classification problem is not new. For example, Wang and Srihari [1] described a method for zone classification that gave 100% accuracy on the images they used for testing. But these images contained just 41 zonesand one of the categories occurred only once. Previous approaches to the problem generally describe a new set of features and assign classes using some linear weighted formula or nonlinear heuristic. In contrast, our work uses predefined features and investigates the application of standard machine learning methods, using a large publiclyavailable document database as a source of training and test data.


[28]

Applications of machine learning in information retrieval
J.N. Littin.
Applications of machine learning in information retrieval.
Technical Report 0657.555, University of Waikato, Computer Science
Department, Hamilton, New Zealand, 1995.
Directed Research Project.
[ bib 
.pdf ]


[29]

InstanceBased Learning: Nearest Neighbour with Generalisation
B.I. Martin.
Instancebased learning: Nearest neighbour with generalisation.
Master's thesis, University of Waikato, Computer Science Department,
Hamilton, New Zealand, 1995.
[ bib 
.ps.gz 
.pdf ]


[30]

Interactive concept learning for enduser applications
D. Maulsby and I.H. Witten.
Interactive concept learning for enduser applications.
Technical Report 95/4, University of Waikato, Computer Science
Department, Hamilton, New Zealand, 1995.
[ bib 
.pdf ]
Personalizable software agents will learn new tasks from their users. This implies being able to learn from instructions users might give: examples, yes/no responses, and ambiguous, incomplete hints. Agents should also exploit background knowledge customized for applications such as drawing, word processing and formfilling. The task models that agents learn describe data, actions and their context. Learning about data from examples and hints is the subject of this paper.
The Cima learning system combines evidence from examples, task knowledge and user hints to form Disjunctive Normal Form (DNF) rules for classifying, generating or modifying data. Cima's dynamic bias manager generates candidate features (attribute values, functions or relations), from which its DNF learning algorithm selects relevant features and forms the rules. The algorithm is based on a classic greedy method, with two enhancements. First, the standard learning criterion, correct classification, is augmented with a set of utility and instructional criteria. Utility criteria ensures that descriptions are properly formed for use in actions, whether to classify, search for, generate or modify data. Instructional criteria ensure that descriptions include features that users suggest and avoid those that users reject. The second enhancement is to augment the usual statistical metric for selecting relevant attributes with a set of heuristics, including beliefs based on user suggestions and applicationspecific background knowledge. Using multiple heuristics increases the justification for selecting features; more important, it helps the learner choose among alternative interpretations of hints.
When tested on dialogues observed in a prior user study on a simulated interface agent, the learning algorithm achieves 95% of the learning efficiency standard established in that study.


[31]

Learning to describe data in actions
D. Maulsby and I.H. Witten.
Learning to describe data in actions.
In Proc PBD Workshop, Machine Learning Conference, pages
6573, Tahoe City, CA, July 1995.
[ bib 
.pdf ]
Traditional machine learning algorithms have failed to serve the needs of systems for Programming by Demonstration (PBD), which require interaction with a user (a teacher) and a task environment. We argue that traditional learning algorithms fail for two reasons: they do not cope with the ambiguous instructions that users provide in addition to examples; and their learning criterion requires only that concepts classify examples to some degree of accuracy, ignoring the other ways in which an active agent might use concepts. We show how a classic concept learning algorithm can be adapted for use in PBD by replacing the learning criterion with a set of instructional and utility criteria, and by replacing a statistical preference bias with a set of heuristics that exploit user hints and background knowledge to focus attention.


[32]

Applying Machine Learning to Agricultural Data
R.J. McQueen, S.R. Garner, C.G. NevillManning, and I.H. Witten.
Applying machine learning to agricultural data.
Computing and Electronics in Agriculture, 12:275293, 1995.
[ bib 
.ps.gz 
.pdf ]
Many techniques have been developed for abstracting, or learning, rules and relationships from diverse data sets, in the hope that machines can help in the often tedious and errorprone process of acquiring knowledge from empirical data. While these techniques are plausible, theoretically wellfounded, and perform well on more or less artificial test data sets, they stand or fall on their ability to make sense of realworld data. This paper describes a project that is applying a range of learning strategies to problems in primary industry, in particular agriculture and horticulture. We briefly survey some of the more readily applicable techniques that are emerging from the machine learning research community, describe a software workbench that allows users to experiment with a variety of techniques on realworld data sets, and detail the problems encountered and solutions developed in a case study of dairy herd management in which culling rules were inferred from a mediumsized database of herd information.


[33]

Application of Machine Learning Techniques to TimeSeries Data
R.S. Mitchell.
Application of machine learning techniques to timeseries data.
Technical Report 95/15, University of Waikato, Computer Science
Department, Hamilton, New Zealand, April 1995.
[ bib 
.ps.gz 
.pdf ]
This report documents new methods for discovering knowledge in real world timeseries data. Two complementary approaches were investigated: 1) manipulation of the original dataset into a form that is usable by conventional similaritybased learners; and 2) using sequence identification techniques to learn the concepts embedded in the database. Experimental results obtained from applying both techniques to a large agricultural database are presented and analysed.


[34]

Detecting sequential structure
C.G. NevillManning and I.H. Witten.
Detecting sequential structure.
In Proc Programming by Demonstration Workshop, Machine Learning
Conference, pages 4956, Tahoe City, CA, USA, 1995.
[ bib 
.ps.gz 
.pdf ]
Programming by demonstration requires detection and analysis of sequential patterns in a user's input, and the synthesis of an appropriate structural model that can be used for prediction. This paper describes SEQUITUR, a scheme for inducing a structural description of a sequence from a single example. SEQUITUR integrates several different inference techniques: identification of lexical subsequences or vocabulary elements, hierarchical structuring of such subsequences, identification of elements that have equivalent usage patterns, inference of programming constructs such as looping and branching, generalisation by unifying grammar rules, and the detection of procedural substructure. Although SEQUITUR operates with abstract sequences, a number of concrete illustrations are provided.


[35]

Learning from experience
C.G. NevillManning.
Learning from experience.
In Proc New Zealand Computer Science Research Students
Conference, pages 193201, University of Waikato, Hamilton, New Zealand,
1995.
[ bib 
.ps 
.pdf ]
Life is a sequence of events; some of them predictable, some unexpected. We instinctively look for structure in the stream of events we observe, and learning to anticipate these events is an important part of functioning successfully in the world. Machine learning emphasises learning from unordered facts, but for machines to be helpful in our daytoday activities, they also need to be able to recognise patterns in the sequences they observe.
The sequence of events that constitute our lives is often monotonous; we end up repeating certain sequences of actions again and again. We have built machines to relieve some of this monotony, but they often fail to be flexible enough to free us of all but the most common repetitive tasks.
SEQUITUR is a system which infers structure from a sequence, and uses the structure to explain and extrapolate the sequence. It is capable of recognising structure in a variety of realworld sequences, but is of particular utility in automating repetitive computing tasks.


[36]

The Development of Holte's 1R Classifier
C.G. NevillManning, G. Holmes, and I.H. Witten.
The development of holte's 1r classifier.
In Proc Artificial Neural Networks and Expert Systems, pages
239242, Dunedin, NZ, 1995.
[ bib 
.ps.gz 
.pdf ]
The 1R procedure for machine learning is a very simple one that proves surprisingly effective on the standard datasets commonly used for evaluation. This paper describes the method and discusses two areas that can be improved: the way that intervals are formed when discretizing continuouslyvalued attributes, and the way that missing values are treated. Then we show how the algorithm can be extended to avoid a problem endemic to most practical machine learning algorithmstheir frequent dismissal of an attribute as irrelevant when in fact it is highly relevant when combined with other attributes.


[37]

A genetic algorithm for the induction of natural language grammars
T.C. Smith and I.H. Witten.
A genetic algorithm for the induction of natural language grammars.
In Proc IJCAI95 Workshop on New Approaches to Learning for
Natural Language Processing, pages 1724, Montreal, Canada, 1995.
[ bib 
.ps.gz 
.pdf ]
Strict patternbased methods of grammar induction are often frustrated by the apparently inexhaustible variety of novel word combinations in large corpora. Statistical methods offer a possible solution by allowing frequent wellformed expressions to overwhelm the infrequent ungrammatical ones. They also have the desirable property of being able to construct robust grammars from positive instances alone. Unfortunately, the zerofrequency problem entails assigning a small probability to all possible word patterns, thus ungrammatical ngrams become as probable as unseen grammatical ones. Further, such grammars are unable to take advantage of inherent lexical properties that should allow infrequent words to inherit the syntactic properties of the class to which they belong.
This paper describes a genetic algorithm (GA) that adapts a population of hypothesis grammars towards a more effective model of language structure. The GA is statistically sensitive in that the utility of frequent patterns is reinforced by the persistence of efficient substructures. It also supports the view of language learning as a bootstrapping problem, a learning domain where it appears necessary to simultaneously discover a set of categories and a set of rules defined over them. Results from a number of tests indicate that the GA is a robust, faulttolerant method for inferring grammars from positive examples of natural language.


[38]

Probabilitydriven lexical classification: a corpusbased approach
T.C. Smith and I.H. Witten.
Probabilitydriven lexical classification: a corpusbased approach.
In Proc Pacific Association for Computational Linguistics
Conference, pages 271283, Brisbane, Australia, 1995.
[ bib 
.ps.gz 
.pdf ]
Successful grammatical inference from a corpus of linguistic material rests largely on the ability to tag the words of the corpus with appropriate lexical categories. Static tagging methods, such as dictionary lookup, often misclassify words associated with multiple categories, and adaptive statistical taggers or contextbased corrective taggers may still have errorrates of 3 or 4 percent. Even a small proportion of lexical misclassification may lead a syntax induction mechanism to produce an extraordinarily large number of specialcase rules.
By treating grammar induction as a bootstrapping problem in which it is necessary to simultaneously discover a set of categories and a set of rules defined over them, lexical tagging is relieved of constraints imposed by traditional notions of syntactic categories. Categories are created based on salient features in the corpus under study.
This paper describes a statistical corpusbased approach to lexical inference that uses linguistic notions of syntactic function to guide its analysis of a given text. Unlike other probabilitydriven inference techniques, it is simple enough to apply to the complete vocabularies of very large texts. It has been successfully incorporated into a syntactic inference system whose resulting grammars have been shown superior to two standard stochastic CFGs.


[39]

Subset selection using rough numeric dependency
T.C. Smith and G. Holmes.
Subset selection using rough numeric dependency.
In Proc Symposium on Intelligent Data Analysis (IDA95), pages
191195, BadenBaden, Germany, 1995.
[ bib 
.ps.gz 
.pdf ]
A large percentage of computer systems are presently dedicated to the collection of an incomprehensibly vast amount of information. The daily worldwide accrual of data has long since exceeded the ability for human analysis, and machine processing has been largely limited to collation, summation, sorting and a variety of basic statistical analyses.
Artificial intelligence research in the form of expert systems has attempted to relieve much of the burden of analysis from the shoulders of human beings by developing automated reasoning systems that can respond quickly to the flood of incoming data. Computer users have countered with the collection of an even broader range of information for which expert systems have either not yet been built or whose domains have proven as yet too difficult to formalise.


[40]

PBD systems: when will they ever learn?
I.H. Witten.
Pbd systems: when will they ever learn?
In Proc Programming by Demonstration Workshop, Machine Learning
Conference, pages 19, Tahoe City, CA, USA, 1995.
[ bib 
.ps.gz 
.pdf ]
We trace the tragicomic courtship between programming by demonstration and machine learning, two fields that seem made for each other but have never quite got together. A longterm historical perspective illuminates the twin roles of interaction and sequence learning in programming by demonstration, and reveals the parallel growth of machine learning as a research area in its own right. A view of the present shows a fewbut only a fewintimations that these two fields are beginning to develop a meaningful relationship. The longterm prognosis is not so much wedded bliss as assimilationultimately PBD and ML shall, in the words of Genesis, cleave unto each other: and they shall be one flesh.


[41]

Intelligent data analysis using the WEKA workbench
I.H. Witten, S.J. Cunningham, and G. Holmes.
Intelligent data analysis using the weka workbench.
In Conference on Artificial Neural Networks and Expert Systems,
Dunedin, NZ, 1995.
Tutorial Notes.
[ bib 
.ps 
.pdf ]


[42]

Active Learning of Soft Rules for System Modelling
Eibe Frank and KlausPeter Huber.
Active learning of soft rules for system modelling.
In Proc 2nd European Congress on Intelligent Techniques and Soft
Computing, Aachen, Germany, pages 14301434, Aachen, 1996. Verlag Mainz.
[ bib 
.ps.gz 
.pdf ]


[43]

A Computer Model of Blues Music and its Evaluation
Mark Andrew Hall and Lloyd Smith.
A computer model of blues music and its evaluation.
Journal of the Acoustical Society of America,
100(2):11631167, 1996.
[ bib ]


[44]

An MDL estimate of the significance of rules
J.G. Cleary, S. Legg, and I.H. Witten.
An mdl estimate of the significance of rules.
In Proc ISIS: Information, Statistics, and Induction in
Science, pages 4353, Melbourne, Australia, August 1996.
[ bib 
.ps 
.pdf ]
This paper proposes a new method for measuring the performance of modelswhether decision trees or sets of rulesinferred by machine learning methods. Inspired by the minimum description length (MDL) philosophy and theoretically rooted in information theory, the new method measures the complexity of test data with respect to the model. It has been evaluated on rule sets produced by several different machine learning schemes on a large number of standard data sets. When compared with the usual percentage correct measure, it is shown to agree with it in restricted cases. However, in other more general cases taken from real data setsfor example, when rule sets make multiple or no predictionsit disagrees substantially. It is argued that the MDL measure is more reasonable in these cases and represents a better way of assessing the significance of a rule set's performance. The question of the complexity of the rule set itself is not addressed in the paper.


[45]

MetaData for database mining
Cleary J.G., G. Holmes, S.J. Cunningham, and I.H. Witten.
Metadata for database mining.
In Proc IEEE Metadata Conference, Silver Spring, MD, April
1996.
[ bib 
.ps 
.pdf ]
At present, a machine learning application is accomplished by carefully crafting asingle table from an often complex, multitable database. The metadata necessary to create this table is rarely formally recorded, and is sometimes implicit in the structure of the database or the typing of the attributes. We categorize the types of metadata that we have encountered in our work with machine learning applications in agriculture, and describe a first generation tool that we have built to aid in the recording and use of metadata in database mining.


[46]

Dataset cataloguing metadata for machine learning applications and research
S.J. Cunningham.
Dataset cataloguing metadata for machine learning applications and
research.
Technical Report 96/26, University of Waikato, Computer Science
Department, Hamilton, New Zealand, October 1996.
[ bib ]
As the field of machine learning (ML) matures, two types of data archives are developing: collections of benchmark data sets used to test the performance of new algorithms, and data stores to which machine learning/data mining algorithms are applied to create scientific or commercial applications. At present, the catalogs of these archives are ad hoc and not tailored to machine learning analysis. This paper considers the cataloging metadata required to support these two types of repositories, and discusses the organizational support necessary for archive catalog maintenance.


[47]

Understanding what Machine Learning produces Part I: Representations and their comprehensibility
S.J. Cunningham, M.C. Humphrey, and I.H. Witten.
Understanding what machine learning produces part i: Representations
and their comprehensibility.
Technical Report 96/21, University of Waikato, Computer Science
Department, Hamilton, New Zealand, October 1996.
[ bib 
.ps 
.pdf ]
The aim of many machine learning users is to comprehend the structures that are inferred from a dataset, and such users may be far more interested in understanding the structure of their data than in predicting the outcome of new test data. Part I of this paper surveys representations based on decision trees, production rules and decision graphs, that have been developed and used for machine learning. These representations have differing degrees of expressive power, and particular attention is paid to their comprehensibility for nonspecialist users. The graphic form in which a structure is portrayed also has a strong effect on comprehensibility, and Part II of this paper develops knowledge visualization techniques that are particularly appropriate to help answer the questions that machine learning users typically ask about the structures produced.


[48]

Selecting multiway splits in decision trees
E. Frank and I.H. Witten.
Selecting multiway splits in decision trees.
Technical Report 96/31, University of Waikato, Department of Computer
Science, Hamilton, New Zealand, December 1996.
[ bib 
.ps 
.pdf ]
Decision trees in which numeric attributes are split several ways are more comprehensible than the usual binary trees because attributes rarely appear more than once in any path from root to leaf. There are efficient algorithms for finding the optimal multiway split for a numeric attribute, given the number of intervals in which it is to be divided. The problem we tackle is how to choose this number in order to obtain small, accurate trees.
We view each multiway decision as a model and a decision tree as a recursive structure of such models. Standard methods of choosing between competing models include resampling techniques (such as crossvalidation, holdout, or bootstrap) for estimating the classification error; and minimum description length techniques. However, the recursive situation differs from the usual one, and may call for new model selection methods.
This paper introduces a new criterion for model selection: a resampling estimate of the information gain. Empirical results are presented for building multiway decision trees using this new criterion, and compared with criteria adopted by previous authors. The new method generates multiway trees that are both smaller and more accurate than those produced previously, and their performance is comparable with standard binary decision trees.


[49]

Experimental comparison of optimisation techniques: computer optimisation of dairy farm management
R. Hart.
Experimental comparison of optimisation techniques: computer
optimisation of dairy farm management.
Master's thesis, University of Waikato, Department of Computer
Science, Hamilton, New Zealand, 1996.
[ bib ]


[50]

Understanding what Machine Learning produces Part II: Knowledge visualization techniques
M.C. Humphrey, S.J. Cunningham, and I.H. Witten.
Understanding what machine learning produces part ii: Knowledge
visualization techniques.
Technical Report 96/22, University of Waikato, Computer Science
Department, Hamilton, New Zealand, October 1996.
[ bib 
.ps 
.pdf ]
Researchers in machine learning use decision trees, production rules, and decision graphs for visualizing classification data. Part I of this paper surveyed these representations, paying particular attention to their comprehensibility for nonspecialist users. Part II turns attention to knowledge visualization the graphic form in which a structure is portrayed and its strong influence on comprehensibility. We analyze the questions that, in our experience, end users of machine learning tend to ask of the structures inferred from their empirical data. By mapping these questions onto visualization tasks, we have created new graphical representations that show the flow of examples through a decision structure. These knowledge visualization techniques are particularly appropriate in helping to answer the questions that users typically ask, and we describe their use in discovering new properties of a data set. In the case of decision trees, an automated software tool has been developed to construct the visualizations.


[51]

Learning relational rippledown rules
J.N. Littin.
Learning relational rippledown rules.
Master's thesis, University of Waikato, Computer Science Department,
Hamilton, New Zealand, 1996.
[ bib 
.ps 
.pdf ]


[52]

An investigation into the use of machine learning for the determining of oestrus in cows
R.S. Mitchell, R.A. Sherlock, and L.A. Smith.
An investigation into the use of machine learning for the determining
of oestrus in cows.
Computing and Electronics in Agriculture, 15:195215, 1996.
[ bib 
.ps 
.pdf ]
A preliminary investigation of the application of two wellknown machine learning schemesC4.5 and FOILto detection of oestrus in dairy cows has been made. This is a problem of practical economic significance as each missed opportunity for artificial insemination results in 21 days lost milk production. Classifications were made on normalised diviations of milk volume production and milking order time series data. The best learning scheme was C4.5 which was able to detect 69% of oestrus events, albeit with an unacceptably high rate of false positives (74%). Several directions for further work and improvements are identified.


[53]

Machine learning in programming by demonstration: lessons learned from CIMA
D. Maulsby and I.H. Witten.
Machine learning in programming by demonstration: lessons learned
from cima.
In Y. Gil, editor, Acquisition, learning and demonstration:
Automating tasks for users (Proc AAAI Symposium, Stanford), pages 6672,
Menlo Park, CA, 1996. AAAI Press.
[ bib 
.ps 
.pdf ]
Programmingbydemonstration (PBD) systems learn tasks by watching the user perform them. CIMA is an interactive learning system for modeling the data selected and modified by a user as he or she undertakes a task. Part of a PBD system, CIMA is invoked when user actions are matched to find a common description of their operands. Although the system's interfaces to users and applications are still too roughhewn to permit field trials, its performance on recorded dialogs between users and a simulated agent meets the design goals established prior to its implementation.
The contributions of this work lie in three areas:
(a) a design methodology for PBD systems;
(b) a framework for user actions in PBD;
(c) novel methods of interaction with the user.
These are discussed in separate sections below. First, however, it is necessary to convey the flavor of what it is like to use the CIMA system, and this is done in the following section.


[54]

Detecting sequential structure
C.G. NevillManning.
Detecting sequential structure.
PhD thesis, University of Waikato, Department of Computer Science,
Hamilton, New Zealand, 1996.
[ bib 
.ps 
.pdf ]


[55]

Compressing semistructured text using hierarchical phrase identification
C.G. NevillManning, I.H. Witten, and D.R. Olsen.
Compressing semistructured text using hierarchical phrase
identification.
In J.A. Storer and M. Cohn, editors, Proc Data Compression
Conference, pages 6372, Los Alamitos, CA, 1996. IEEE Press.
[ bib 
.ps 
.pdf ]
Many computer files contain highlystructured, predictable information interspersed with information which has less regularity and is therefore less predictablesuch as free text. Examples range from wordprocessing source files, which contain preciselyexpressed formatting specifications enclosing tracts of naturallanguage text, to files containing a sequence of filledout forms which have a predefined skeleton clothed with relatively unpredictable entries. These represent extreme ends of a spectrum. Wordprocessing files are dominated by free text, and respond well to generalpurpose compression techniques. Forms generally contain databasestyle information, and are most appropriately compressed by taking into account their special structure. But one frequently encounters intermediate cases. For example, in many email messages the formal header and the informal freetext content are equally voluminous. Short SGML files often contain comparable amounts of formal structure and informal text. Although such files may be compressed quite well by generalpurpose adaptive text compression algorithms, which will soon pick up the regular structure during the course of normal adaptation, better compression can often be obtained by methods that are equipped to deal with both formal and informal structure.


[56]

Automatic oestrus detection from milking data  a preliminary investigation
R.A. Sherlock, L.A. Smith, and R.S. Mitchell.
Automatic oestrus detection from milking data  a preliminary
investigation.
In Proc NZ Society for Animal Production, volume 65, pages
228229, Hamilton, New Zealand, February 1996.
[ bib 
.ps 
.pdf ]
Efficient oestrus detection is important economically, particularly with seasonalcalving herds employing artificial insemination, as every missed oestrus effectively costs 21 days milk production for that cow in that season which is about 8% of the total. Traditionally, oestrus detection relies mainly on visual observation of animal behaviour  a cow in oestrus will stand and allow herself to be mounted by herdmates. Such events are readily noted when they take place amongst assembled cows, such as at milking times, but may be missed entirely in herds of freegrazing animals which are only brought in for milking twice a day, as is usual in New Zealand style systems. A common lowcost aid to oestrus detection is the use of tailpaint (Macmillan and Curnow, 1977), and although this can be very effective when properly executed the regular visual inspection and repainting is labourintensive.


[57]

Learning language using genetic algorithms
T.C. Smith and I.H. Witten.
Learning language using genetic algorithms, volume 1040 of
LNCS: Connectionist, statistical and symbolic approaches to learning for
natural language processing, pages 132145.
SpringerVerlag, New York, 1996.
[ bib 
.ps 
.pdf ]
Strict patternbased methods of grammar induction are often frustrated by the apparently inexhaustible variety of novel word combinations in large corpora. Statistical methods offer a possible solution by allowing frequent wellformed expressions to overwhelm the infrequent ungrammatical ones. They also have the desirable property of being able to construct robust grammars from positive instances alone. Unfortunately, the zerofrequency problem entails assigning a small probability to all possible word patterns, thus ungrammatical ngrams become as probable as unseen grammatical ones. Further, such grammars are unable to take advantage of inherent lexical properties that should allow infrequent words to inherit the syntactic properties of the class to which they belong.
This paper describes a genetic algorithm (GA) that adapts a population of hypothesis grammars towards a more effective model of language structure. The GA is statistically sensitive in that the utility of frequent patterns is reinforced by the persistence of efficient substructures. It also supports the view of language learning as a bootstrapping problem, a learning domain where it appears necessary to simultaneously discover a set of categories and a set of rules defined over them. Results from a number of tests indicate that the GA is a robust, faulttolerant method for inferring grammars from positive examples of natural language.


[58]

The entropy of English using PPMbased models
W.J. Teahan and J.G. Cleary.
The entropy of english using ppmbased models.
In J.A. Storer and M. Cohn, editors, Proc Data Compression
Conference, pages 5362, Los Alamitos, CA, 1996. EEE Press.
[ bib 
.ps 
.pdf ]
Over 45 years ago Claude E. Shannon estimated the entropy of English to be about 1 bit per character [16]. He did this by having human subjects guess samples of text, letter by letter. From the number of guesses made by each subject he estimated upper and lower bounds of 1.3 and 0.6 bits per character (bpc) for the entropy of English. Shannon's methodology was not improved upon until 1978 when Cover and King [6] used a gambling approach to estimate the upper bound to be 1.25 bpc from the same text. In the cryptographic community ngram analysis suggests 1.5 bpc as the asymptotic limit for 26letter English (Tilbourg [19]).


[59]

Machine Learning applied to fourteen agricultural datasets
K. Thomson and R.J. McQueen.
Machine learning applied to fourteen agricultural datasets.
Technical Report 96/18, University of Waikato, Computer Science
Department, Hamilton, New Zealand, September 1996.
[ bib 
.ps 
.pdf ]
This document reports on an investigation conducted between November, 1995 and March, 1996 into the use of machine learning on 14 sets of data supplied by agricultural researchers in New Zealand. Our purpose here is to collect together short reports on trials with these datasets using the WEKA machine learning workbench, so that some understanding of the applicability and potential application of machine learning to similar datasets may result.


[60]

Decision combination based on the characterization of predictive accuracy
K.M. Ting.
Decision combination based on the characterization of predictive
accuracy.
In R.S. Michalski and J. Wnek, editors, Proc 3rd International
Workshop on Multistrategy Learning, pages 191202, Fairfax, VA, 1996.
[ bib ]


[61]

The characterization of predictive accuracy and decision combination
K.M. Ting.
The characterization of predictive accuracy and decision combination.
In Proc International Conference on Machine Learning, pages
498506, Bari, Italy, 1996.
[ bib 
.ps 
.pdf ]
In this paper, we first explore an intrinsic problem that exists in the theories induced by learning algorithms. Regardless of the selected algorithm, search methodology and hypothesis representation by which the theory is induced, one would expect the theory to make better predictions in some regions of the description space than others. We term the fact that an induced theory will have some regions of relatively poor performance the problem of locally low predictive accuracy. Having characterised the problem of locally low predictive accuracy in InstanceBased and Naive Bayesian classifiers, we propose to counter this problem using a composite learner that incorporates both classifiers. The strategy is to select an estimated better performing classifier to do the final prediction during classification. Empirical results show that the strategy is capable of partially overcoming the problem and at the same time improving the overall performance of its constituent classifiers. We provide explanations of why the proposed composite learner performs better than the crossvalidation method and the better of its constituent classifiers.


[62]

Theory combination: an alternative to data combination
K.M. Ting and B.T. Low.
Theory combination: an alternative to data combination.
Technical Report 96/19, University of Waikato, Department of Computer
Science, Hamilton, New Zealand, 1996.
[ bib 
.ps 
.pdf ]
The approach of combining theories learned from multiple batches of data provide an alternative to the common practice of learning one theory from all the available data (i.e., the data combination approach). This paper empirically examines the baseline behaviour of the theory combination approach in classification tasks. We find that theory combination can lead to better performance even if the disjoint batches of data are drawn randomly from a larger sample, and relate the relative performance of the two approaches to the learning curve of the classifier used.
The practical implication of our results is that one should consider using theory combination rather tan data combination, especially when multiple batches of data for the same task are readily available.
Another interesting result is that we empirically show that the nearasymptotic performance of a single theory, in some classification task, can be significantly improved by combining multiple theories (of the same algorithm) if the constituent theories are substantially different and there is some regularity in the theories to be exploited by the combination method used. Comparisons with known theoretical results are also provided.


[63]

Interacting with learning agents: implications for ML from HCI
I.H. Witten, C.G. NevillManning, and D.L. Maulsby.
Interacting with learning agents: implications for ml from hci.
In Workshop on Machine Learning meets HumanComputer
Interaction, ML'96, pages 5158, Bari, Italy, 1996.
[ bib 
.pdf ]
Computers excel at repetitive tasks. But automating them usually involves programming, which is beyond the reach of most nonspecialist users. One solution is for machines to learn procedures by observing users at workand if this enhanced users' productivity and sense of achievement, they might even be persuaded to help the system by supplying some additional information. In principle, combining machine learning with instructional interaction should increase the speed with which tasks are acquired, and enhance reliability too.


[64]

Applications of Machine Learning on two agricultural datasets
S. Yeates and K. Thomson.
Applications of machine learning on two agricultural datasets.
In Proc New Zealand Conference of Postgraduate Students in
Engineering and Technology, pages 495496, Christchurch, New Zealand, 1996.
[ bib 
.ps 
.pdf ]
The induction of decision trees from tabulated data is a field of machine learning which has been demonstrated successfully in several practical applications. This paper looks at the application of this technology to two datasets in the agricultural domain, and show why it was not possible to achieve the success obtained in other domains.


[65]

TutorialWeka: the Waikato environment of knowledge analysis
Waikato ML group.
Tutorialweka: the waikato environment of knowledge analysis,
November 1996.
[ bib ]


[66]

Dataset cataloging metadata for machine learning applications and research
S.J. Cunningham.
Dataset cataloging metadata for machine learning applications and
research.
In Proc International Artificial Intelligence and Statistics
Workshop, Ft. Lauderdale, Florida, 1997.
[ bib 
.ps 
.pdf ]
As the field of machine learning (ML) matures, two types of data archives are developing: collections of benchmark data sets used to test the performance of new algorithms, and data stores to which machine learning/data mining algorithms are applied to create scientific or commercial applications. At present, the catalogs of these archives are ad hoc and not tailored to machine learning analysis. This paper considers the cataloging metadata required to support these two types of repositories, and discusses the organizational support necessary for archive catalog maintenance.


[67]

Machine learning applications in anthropology: automated discovery over kinship structures
S.J. Cunningham.
Machine learning applications in anthropology: automated discovery
over kinship structures.
Computers and the Humanities, 30:401406, 1997.
[ bib 
.ps 
.pdf ]
A common problem in anthropological field work is generalizing rules governing social interactions and relations (particularly kinship) from a series of examples. One class of machine learning algorithms is particularly wellsuited to this task: inductive logic programming systems, as exemplified by FOIL. A knowledge base of relationships among individuals is established, in the form of a series of singlepredicate facts. Given a set of positive and negative examples of a new relationship, the machine learning programs build a Horn clause description of the target relationship. The power of these algorithms to derive complex hypotheses is demonstrated for a set of kinship relationships drawn from the anthropological literature. FOIL extends the capabilities of earlier anthropologyspecific learning programs by providing a more powerful representation for induced relationships, and is better able to learn in the face of noisy or incomplete data.


[68]

BrainLike Computing and Intelligent Information Systems
S.J. Cunningham, G. Holmes, J. Littin, R. Beale, and I.H. Witten.
BrainLike Computing and Intelligent Information Systems,
chapter Applying connectionist models to information retrieval, pages
435457.
SpringerVerlag, 1997.
[ bib 
.ps 
.pdf ]
Adaptive information retrieval (IR) systems based on connectionist architectures have captured the attention of researchers over the past decade. This paper provides a review of connectionist IR research, including the major models for connectionist document and query representation, techniques to enhance query reformulation, dynamic document routing (information filtering), and connectionist techniques for document clustering.


[69]

Applications of machine learning in information retrieval
S.J. Cunningham, J.N. Littin, and I.H. Witten.
Applications of machine learning in information retrieval.
Technical Report 97/6, University of Waikato, Department of Computer
Science, Hamilton, New Zealand, February 1997.
[ bib 
.ps 
.pdf ]
Information retrieval systems provide access to collections of thousands, or millions, of documents, from which, by providing an appropriate description, users can recover any one. Typically, users iteratively refine the descriptions they provide to satisfy their needs, and retrieval systems can utilize user feedback on selected documents to indicate the accuracy of the description at any stage. The style of description required from the user, and the way it is employed to search the document database, are consequences of the indexing method used for the collection. The index may take different forms, from storing keywords with links to individual document, to clustering documents under related topics.
Much of the work in information retrieval can be automated. Processes such as document indexing and query refinement are usually accomplished by computer, while document classification and index term selection are more often performed manually. However, manual development and maintenance of document databases is timeconsuming, tedious, and errorprone. Algorithms that mine documents for indexing information, and model user interests to help them formulate queries, reduce the workload and can ensure more consistent behavior. Such algorithms are based in machine learning, a dynamic, burgeoning area of computer science which is finding application in domains ranging from expert systems, where learning algorithms supplementor even supplantdomain experts for generating rules and explanations (Langley and Simon, 1995), to intelligent agents, which learn to play particular, highlyspecialized, support roles for individual people and are seen by some to herald a new renaissance of artificial intelligence in information technology (Hendler, 1997).


[70]

Feature subset selection: a correlation based filter approach
M.A. Hall and L.A. Smith.
Feature subset selection: a correlation based filter approach.
In N. Kasabov and et al., editors, Proc Fourth International
Conference on Neural Information Processing and Intelligent Information
Systems, pages 855858, Dunedin, New Zealand, 1997.
[ bib 
.ps 
.pdf ]
Recent work has shown that feature subset selection can have a positive affect on the performance of machine learning algorithms. Some algorithms can be slowed or their performance adversely affected by too much data, some of which may be irrelevant or redundant to the learning task. Feature subset selection, then, is a method for enhancing the performance of learning algorithms, reducing the storage requirement. This paper describes a feature subset selector that uses a correlation based heuristic to determine the goodness of feature subsets, and evaluates its effectiveness with three common ML algorithms: a decision tree inducer (C4.5), a naive Bayes classifier, and an instance based learner (IB1). Experiments using a number of standard data sets drawn from real and artificial domains are presented. Feature subset selection gave significant improvement for all three algorithms; C4.5 generated smaller decision trees.


[71]

Discovering interattribute relationships
G. Holmes.
Discovering interattribute relationships.
In N. Kasabov and et al., editors, Proc Fourth International
Conference on Neural Information Processing and Intelligent Information
Systems, pages 914917, Dunedin, New Zealand, 1997.
[ bib 
.ps 
.pdf ]
It is important to discover relationships between attributes being used to predict a class attribute in supervised learning situations for two reasons. First, any such relationship will be potentially interesting to the provider of a dataset in its own right. Second, it would simplify a learning algorithm's search space, and the related irrelevant feature and subset selection problem, if the relationships were removed from datasets ahead of learning. An algorithm to discover such relationships is presented in this paper. The algorithm is described and a surprising number of interattribute relationships are discovered in datasets from the University of California at Irvine (UCI) repository.


[72]

Mining for causes of cancer: machine learning experiments at various levels of detail
S. Kramer, B. Pfahringer, and C. Helma.
Mining for causes of cancer: machine learning experiments at various
levels of detail.
In Proc International Conference on Knowledge Discovery and Data
Mining, pages 223226, Menlo Park, 1997. AAAI Press.
[ bib 
.ps 
.pdf ]
This paper presents first results of an interdisciplinary project in scientific data mining. We analyze data about the carcinogenicity of chemicals derived from the carcinogenesis bioassay program performed by the US National Institute of Environmental Health Sciences. The database contains detailed descriptions of 6823 tests performed with more than 330 compounds and animals of different species, strains and sexes. The chemical structures are described at the atom and bond level, and in terms of various relevant structural properties. The goal of this paper is to investigate the effects that various levels of detail and amounts of information have on the resulting hypotheses, both quantitatively and qualitatively. We apply relational and propositional machine learning algorithms to learning problems formulated as regression or as classification tasks. In addition, these experiments have been conducted with two learning problems which are at different levels of detail. Quantitatively, our experiments indicate that additional information not necessarily improves accuracy. Qualitatively, a number of potential discoveries have been made by the algorithm for Relational Regression because it can utilize all the information contained in the relations of the database as well as in the numerical dependent variable.


[73]

CIMA: An interactive concept learning system for enduser applications
D. Maulsby and I.H. Witten.
Cima: An interactive concept learning system for enduser
applications.
Applied Artificial Intelligence, 11(78):653671, 1997.
[ bib 
.ps 
.pdf ]
Personalizable software agents will learn new tasks from their users. In many cases the most appropriate way for users to teach is to demonstrate examples. Learning complex concepts from examples alone is hard, but agents can exploit other forms of instruction that users might give, ranging from yes/no responses to ambiguous, incomplete hints. Agents can also exploit background knowledge customized for applications such as drawing, word processing and formfilling.
The Cima system learns generalized rules for classifying, generating, and modifying data, given examples, hints, and background knowledge. It copes with the ambiguity of user instructions by combining evidence from these sources. A dynamic bias manager generates candidate features (attribute values, functions or relation) from which the learning algorithm selects relevant ones and forms appropriate rules. When tested on dialogues observed in a prior user study on a simulated interface agent, the system achieved 95% of the learning efficiency observed in that study.


[74]

Learning agents: from user study to implementation
D. Maulsby and I.H. Witten.
Learning agents: from user study to implementation.
IEEE Computer, 30(11):3644, 1997.
[ bib 
.ps 
.pdf ]
This paper describes the design, implementation and evaluation of an agent that learns to automate repetitive tasks within the humancomputer interface. In contrast to most Artificial Intelligence projects, the design centers on a user study, with a humansimulated agent used to discover the interactions that people find natural. This study shows the ways in which users instinctively communicate via hints, or partiallyspecified, ambiguous, instructions. Hints may be communicated using speech, by pointing, or by selecting from menus. We develop a taxonomy of such instructions. The implementation demonstrates that computers can learn from examples and ambiguous hints.


[75]

Compression and explanation using hierarchical grammars
C.G. NevillManning and I.H. Witten.
Compression and explanation using hierarchical grammars.
Computer Journal, 40(2/3):103116, 1997.
[ bib 
.ps.gz 
.pdf ]
This paper describes an algorithm, called SEQUITUR, that identifies hierarchical structure in sequences of discrete symbols and uses that information for compression. On many practical sequences it performs well at both compression and structural inference, producing comprehensible descriptions of sequence structure in the form of grammar rules. The algorithm can be stated concisely in the form of two constraints on a contextfree grammar. Inference is performed incrementally, the structure faithfully representing the input at all times. It can be implemented efficiently and operates in a time that is approximately linear in sequence length. Despite its simplicity and efficiency, SEQUITUR succeeds in inferring a range of interesting hierarchical structures from naturally occurring sequences.


[76]

Identifying hierarchical structure in sequence: a lineartime algorithm
C.G. NevillManning and I.H. Witten.
Identifying hierarchical structure in sequence: a lineartime
algorithm.
Artificial Intelligence Research, 7:6782, 1997.
[ bib 
.ps 
.pdf ]
SEQUITUR is an algorithm that infers a hierarchical structure from a sequence of discrete symbols by replacing repeated phrases with a grammatical rule that generates the phrase, and continuing this process recursively. The result is a hierarchical representation of the original sequence, which offers insights into its lexical structure. The algorithm is driven by two constraints that reduce the size of the grammar, and produce structure as a byproduct. SEQUITUR breaks new ground by operating incrementally. Moreover, the method's simple structure permits a proof that it operates in space and time that is linear in the size of the input. Our implementation can process 50,000 symbols per second and has been applied to an extensive range of real world sequences.


[77]

Lineartime, incremental hierarchy inference for compression
C.G. NevillManning and I.H. Witten.
Lineartime, incremental hierarchy inference for compression.
In J.A. Storer and M. Cohn, editors, Proc Data Compression
Conference, pages 311, Los Alamitos, CA, 1997. IEEE Press.
[ bib 
.ps 
.pdf ]
Data compression and learning are, in some sense, two sides of the same coin. If we paraphrase Occam's razor by saying that a small theory is better than a larger theory with the same explanatory power, we can characterize data compression as a preoccupation with small, and learning as a preoccupation with better. NevillManning et al. (1994) presented an algorithm, since dubbed SEQUITUR, that presents both faces of the compression/learning coin. Its performance as a data compression scheme outstrips other dictionary schemes, and the structures that it learns from sequences as diverse as DNA and music are intuitively compelling.
In this paper, we present three new results that characterize SEQUITUR's computational and compression performance. First, we prove that SEQUITUR operates in time linear in n, the length of the input sequence, despite its ability to build a hierarchy as deep as log(n). Second, we show that a sequence can be compressed incrementally, improving on the nonincremental algorithm that was described by NevillManning et al. (1994), and making online compression feasible. Third, we present an intriguing result that emerged during benchmarking; whereas PPMC (Moffat, 1990) outperforms SEQUITUR on most files in the Calgary corpus, SEQUITUR regains the lead when tested on multimegabyte sequences. We make some tentative conclusions about the underlying reasons for this phenomenon, and about the nature of current compression benchmarking.


[78]

Compressionbased pruning of decision lists
B. Pfahringer.
Compressionbased pruning of decision lists.
In Proc European Conference on Machine Learning, volume 1224 of
LNAI, pages 199212, Prague, Czech Republic, 1997. SpringerVerlag.
[ bib 
.ps 
.pdf ]
We define a formula for estimating the coding costs of decision lists for propositional domains. This formula allows for multiple classes and both categorical and numerical attributes. For artificial domains the formula performs quite satisfactory, whereas results are rather mixed and inconclusive for natural domains. Further experiments lead to a principled simplification of the original formula which is robust in both artificial and natural domains. Simple hillclimbing search for the most compressive decision list significantly reduces the complexity of a given decision list while not impeding and sometimes even improving its predictive accuracy.


[79]

Probabilistic unification grammars
T.C. Smith and J.G. Cleary.
Probabilistic unification grammars.
In Workshop notes: ACSC'97 Australasian Natural Language
Processing Summer Workshop, pages 2532, 1997.
[ bib 
.ps 
.pdf ]
Recent research has shown that unification grammars can be adapted to incorporate statistical information, thus preserving the processing benefits of stochastic contextfree grammars while offering an efficient mechanism for handling dependencies. While complexity studies show that a probabilistic unification grammar achieves an appropriately lower entropy estimate than an equivalent PCFG, the problem of parameter estimation prevents results from reflecting the empirical distribution. This paper describes how a PUG can be implemented as a Prolog DCG annotated with weights, and how the weights can be interpreted to give accurate entropy estimates. An algorithm for learning correct weights is provided, along with results from some complexity analyses.


[80]

Models of English text
W.J. Teahan and J.G. Cleary.
Models of english text.
In J.A. Storer and M. Cohn, editors, Proc Data Compression
Conference, pages 1221, Los Alamitos, CA, 1997. IEEE Press.
[ bib 
.ps 
.pdf ]
The problem of constructing models of English text is considered. A number of applications of such models including cryptology, spelling correction and speech recognition are reviewed. The best current models for English text have been the result of research into compression. Not only is this an important application of such models but the amount of compression provides a measure of how well such models perform. Three main classes of models are considered: character based models, word based models, and models which use auxiliary information in the form of parts of speech. These models are compared in terms of their memory usage and compression.


[81]

Inducing costsensitive trees via instanceweighting
K.M. Ting.
Inducing costsensitive trees via instanceweighting.
Technical Report 97/22, University of Waikato, Department of Computer
Science, Hamilton, New Zealand, September 1997.
[ bib ]
We introduce an instanceweighting method to induce costsensitive trees in this paper. It is a generalization of the standard tree induction process where only the initial instance weights determine the type of tree (i.e., minimum error trees or minimum cost trees) to be induced. We demonstrate that it can be easily adopted to an existing tree learning algorithm.
Previous research gave insufficient evidence to support the fact that the greedy divideandconquer algorithm can effectively induce a truly costsensitive tree directly from the training data. We provide this empirical evidence in this paper. The algorithm employing the instanceweighting method is found to be comparable to or better than both C4.5 and C5 in terms of total misclassification costs, tree size and the number of high cost errors. The instanceweighting method is also simpler and more effective in implementation than a method based on altered priors.


[82]

Model combination in the multipledatabatched scenario
K.M. Ting and B.T. Low.
Model combination in the multipledatabatched scenario.
In Proc European Conference on Machine Learning, volume 1224 of
LNAI, pages 250265, Prague, Czech Republic, 1997. SpringerVerlag.
[ bib 
.ps 
.pdf ]
The approach of combining models learned from multiple batches of data provide an alternative to the common practice of learning one model from all the available data (i.e., the data combination approach). This paper empirically examines the baseline behaviour of the model combination approach in this multipledatabatches scenario. We find that model combination can lead to better performance even if the disjoint batches of data are drawn randomly from a larger sample, and relate the relative performance of the two approaches to the learning curve of the classifier used.
The practical implication of our results is that one should consider using model combination rather than data combination, especially when multiple batches of data for the same task are readily available.
Another interesting result is that we empirically show that the nearasymptotic performance of a single model, in some classification task, can be significantly improved by combining multiple models (derived from the same algorithm) if the constituent models are substantially different and there is some regularity in the models to be exploited by the combination method used. Comparisons with known theoretical results are also provided.


[83]

Learning from batched data: model combination vs data combination
K.M. Ting, B.T. Low, and I.H. Witten.
Learning from batched data: model combination vs data combination.
Technical Report 97/14, University of Waikato, Department of Computer
Science, Hamilton, New Zealand, May 1997.
[ bib ]
When presented with multiple batches of data, one can either combine them into a single batch before applying a machine learning procedure or learn from each batch independently and combine the resulting models. The former procedure, data combination, is straightforward; this paper investigates the latter, model combination. Given an appropriate combination method, one might expect model combination to prove superior when the data in each batch was obtained under somewhat different conditions or when different learning algorithms were used on the batches. Empirical results show that model combination often outperforms data combination even when the batches are drawn randomly from a single source of data and the same learning method is used on each. Moreover, this is not just an artifact of one particular method of combining models: it occurs with several different combination methods.
We relate this phenomenon to the learning curve of the classifiers being used. Early in the learning process when the learning curve is steep there is much to gain from data combination, but later when it becomes shallow there is less to gain and model combination achieves a greater reduction in variance and hence a lower error rate.
The practical implication of these results is that one should consider using model combination rather than data combination, especially when multiple batches of data for the same task are readily available. It is often superior even when the batches are drawn randomly from a single sample, and we expect its advantage to increase if genuine statistical differences between the batches exist.


[84]

Stacked generalization: when does it work?
K.M. Ting and I.H. Witten.
Stacked generalization: when does it work?
In Proc International Joint Conference on Artificial
Intelligence, pages 866871, Japan, 1997.
[ bib 
.ps 
.pdf ]
Stacked generalization is a general method of using a highlevel model to combine lowerlevel models to achieve greater predictive accuracy. In this paper we address two crucial issues which have been considered to be a 'black art' in classification tasks ever since the introduction of stacked generalization in 1992 by Wolpert: the type of generalizer that is suitable to derive the higherlevel model, and the kind of attributes that should be used as its input. We demonstrate the effectiveness of stacked generalization for combining three different types of learning algorithms.


[85]

Stacking bagged and dagged models
K.M. Ting and I.H. Witten.
Stacking bagged and dagged models.
In Proc International Conference on Machine Learning, pages
367375, Tennessee, USA, 1997.
[ bib 
.ps 
.pdf ]
In this paper, we investigate the method of stacked generalization in combining models derived from different subsets of a training dataset by a single learning algorithm, as well as different algorithms. The simplest way to combine predictions from competing models is majority vote, and the effect of the sampling regime used to generate training subsets has already been studied in this contextwhen bootstrap samples are used the method is called bagging, and for disjoint samples we call it dagging. This paper extends these studies to stacked generalization, where a learning algorithm is employed to combine the models. This yields new methods dubbed bagstacking and dagstacking.
We demonstrate that bagstacking and dagstacking can be effective for classification tasks even when the training samples cover just a small fraction of the full dataset. In contrast to earlier bagging results, we show that bagging and bagstacking work for stable as well as unstable learning algorithms, as do dagging and dagstacking. We find that bagstacking (dagstacking) almost always has higher predictive accuracy than bagging (dagging), and we also show that bagstacking models derived using two different algorithms is more effective than bagging.


[86]

Induction of model trees for predicting continuous classes
Y. Wang and I.H. Witten.
Induction of model trees for predicting continuous classes.
In Proc European Conference on Machine Learning Poster Papers,
pages 128137, Prague, Czech Republic, 1997.
[ bib 
.ps 
.pdf ]
Many problems encountered when applying machine learning in practice involve predicting a class that takes on a continuous numeric value, yet few machine learning schemes are able to do this. This paper describes a rational reconstruction of M5, a method developed by Quinlan (1992) for inducing trees of regression models. In order to accommodate data typically encountered in practice it is necessary to deal effectively with enumerated attributes and with missing values, and techniques devised by Breiman et al. (1984) are adapted for this purpose. The resulting system seems to outperform M5, based on the scanty published data that is available.


[87]

User manual  Weka: the Waikato environment for knowledge analysis
Waikato ML group.
User manual  weka: the waikato environment for knowledge analysis,
June 1997.
This is out of date and refers to the old C version of WEKA, which we
no longer support or distribute.
[ bib ]


[88]

Experience with OB1: an optimal Bayes decision tree learner
J.G. Cleary and L. Trigg.
Experience with ob1: an optimal bayes decision tree learner.
Technical Report 98/10, University of Waikato, Department of Computer
Science, Hamilton, New Zealand, May 1998.
[ bib 
.ps 
.pdf ]
Machine learning algorithms for inferring decision trees typically choose a single best tree to describe the training data. Recent research has shown that classification performance can be significantly improved by voting predictions of multiple, independently produced decision trees. This paper describes an algorithm, OB1, that makes a weighted sum over many possible models. We describe one instance of OB1, that includes all possible decision trees as well as naive Bayesian models. OB1 is compared with a number of other decision tree and instance based learning alogrithms on some of the data sets from the UCI repository. Both an information gain and an accuracy measure are used for the comparison. On the information gain measure OB1 performs significantly better than all the other algorithms. On the accuracy measure it is significantly better than all the algorithms except naive Bayes which performs comparably to OB1.


[89]

Generating Accurate Rule Sets Without Global Optimization
Eibe Frank and Ian H. Witten.
Generating accurate rule sets without global optimization.
In Proc 15th International Conference on Machine Learning,
Madison, Wisconsin, pages 144151. Morgan Kaufmann, 1998.
Also available as Working Paper 98/2, Department of Computer Science,
University of Waikato; January.
[ bib 
.ps 
.pdf ]
The two dominant schemes for rulelearning, C4.5 and RIPPER, both operate in two stages. First they induce an initial rule set and then they refine it using a rather complex optimization stage that discards (C4.5) or adjusts (RIPPER) individual rules to make them work better together. In contrast, this paper shows how good rule sets can be learned one rule at a time, without any need for global optimization. We present an algorithm for inferring rules by repeatedly generating partial decision trees, thus combining the two major paradigms for rule generationcreating rules from decision trees and the separateandconquer rulelearning technique. The algorithm is straightforward and elegant: despite this, experiments on standard datasets show that it produces rule sets that are as accurate as and of similar size to those generated by C4.5, and more accurate than RIPPER's. Moreover, it operates efficiently, and because it avoids postprocessing, does not suffer the extremely slow performance on pathological example sets for which the C4.5 method has been criticized.


[90]

Using a Permutation Test for Attribute Selection in Decision
Trees
Eibe Frank and Ian H. Witten.
Using a permutation test for attribute selection in decision trees.
In Proc 15th International Conference on Machine Learning,
Madison, Wisconsin, pages 152160. Morgan Kaufmann, 1998.
[ bib 
.ps 
.pdf ]
Most techniques for attribute selection in decision trees are biased towards attributes with many values, and several ad hoc solutions to this problem have appeared in the machine learning literature. Statistical tests for the existence of an association with a prespecified significance level provide a wellfounded basis for addressing the problem. However, many statistical tests are computed from a chisquared distribution, which is only a valid approximation to the actual distribution in the largesample caseand this patently does not hold near the leaves of a decision tree. An exception is the class of permutation tests. We describe how permutation tests can be applied to this problem. We choose one such test for further exploration, and give a novel twostage method for applying it to select attributes in a decision tree. Results on practical datasets compare favorably with other methods that also adopt a prepruning strategy.


[91]

Using Model Trees for Classification
Eibe Frank, Yong Wang, Stuart Inglis, Geoffrey Holmes, and Ian H. Witten.
Using model trees for classification.
Machine Learning, 32(1):6376, 1998.
Also available as Working Paper 97/12, Department of Computer
Science, University of Waikato; April.
[ bib 
.ps 
.pdf ]
Model trees, which are a type of decision tree with linear regression functions at the leaves, form the basis of a recent successful technique for predicting continuous numeric values. They can be applied to classification problems by employing a standard method of transforming a classification problem into a problem of function approximation. Surprisingly, using this simple transformation the model tree inducer M5', based on Quinlan's M5, generates more accurate classifiers than the stateoftheart decision tree learner C5.0, particularly when most of the attributes are numeric.


[92]

Practical feature subset selection for machine learning
Mark Andrew Hall and Lloyd Smith.
Practical feature subset selection for machine learning.
In Proc 21st Australian Computer Science Conference, pages
181191, Perth, Australia, 1998. Springer.
[ bib 
.ps 
.pdf ]
Machine learning algorithms automatically extract knowledge from machine readable information. Unfortunately, their success is usually dependant on the quality of the data that they operate on. If the data is inadequate, or contains extraneous and irrelevant information, machine learning algorithms may produce less accurate and less understandable results, or may fail to discover anything of use at all. Feature subset selectors are algorithms that attempt to identify and remove as much irrelevant and redundant information as possible prior to learning. Feature subset selection can result in enhanced performance, a reduced hypothesis search space, and, in some cases, reduced storage requirement. This paper describes a new feature selection algorithm that uses a correlation based heuristic to determine the goodness of feature subsets, and evaluates its effectiveness with three common machine learning algorithms. Experiments using a number of standard machine learning data sets are presented. Feature subset selection gave significant improvement for all three algorithms.


[93]

Optimisation techniques for a computer simulation of a pastoral dairy farm
Hart R., M.T. Larcombe, R.A. Sherlock, and L.A. Smith.
Optimisation techniques for a computer simulation of a pastoral dairy
farm.
Journal Computing and Electronics in Agriculture, 19:129153,
1998.
[ bib 
.ps 
.pdf ]
This paper compares different methods of optimising the management variables in UDDER, a commerciallyavailable computer simulation model of a pastoral dairy farm. The emphasis is on identifying the best optimisation strategy for this complex multidimensional system, taking the simulation model as a given constant. The optimisation methods studied are based on significantly different principles, with differing strengths and weaknesses: two hillclimbing algorithms (NelderMead Simplex and Powell's Direction Set), and a genetic algorithm. Rather than examine all facets of dairy farm management, a single problem is optimisedthat of maximising milkfat production while maintaining the health of the herd and pasture.
The results show that while the genetic algorithm can determine good regions within the search space quickly, it is considerably slower than either hillclimber at finding the optimal point within that region. The hillclimbers, in contrast, are fast but have a tendency to get trapped on local maxima and thus fail to find the true optimum. This led to the development of a hybrid algorithm which utilises the initial global search of the genetic algorithm, followed by the more efficient local search of a hillclimber. This hybrid algorithm discovered nearoptimal points much more quickly than the genetic algorithm, and with more reliability than the hillclimber.


[94]

Predicting apple bruising using machine learning
Holmes G., S.J. Cunningham, B.T. Dela Rue, and A.F. Bollen.
Predicting apple bruising using machine learning.
In L.M.M. Tijskens and M.L.A.T.M. Hertog, editors, ModelIT
Conference, Acta Horticulturae, volume 476, pages 289296, The Netherlands,
1998.
Also available as Working Paper 98/7, Department of Computer Science,
University of Waikato; April.
[ bib 
.ps 
.pdf ]
Many models have been used to describe the influence of internal or external factors on apple bruising. Few of these have addressed the application of derived relationships to the evaluation of commercial operations. From an industry perspective, a model must enable fruit to be rejected on the basis of a commercially significant bruise and must also accurately quantify the effects of various combinations of input features (such as cultivar, maturity, size, and so on) on bruise prediction. Input features must in turn have characteristics which are measurable commercially; for example the measure of force should be impact energy rather than energy absorbed. Further, as the commercial criteria for acceptable damage levels change, the model should be versatile enough to regenerate new bruise thresholds from existing data.
Machine learning is a burgeoning technology with a vast range of potential applications particularly in agriculture where large amounts of data can be readily collected [1]. The main advantage of using a machine learning method in an application is that the models built for prediction can be viewed and understood by the owner of the data who is in a position to determine the usefulness of the model, an essential component in a commercial environment.
Machine Learning software recently developed at Waikato University [2] offers potential as a prediction tool for the classification of bruising based on historical data. It gives the user the opportunity to select any number of measured input attributes and determine the influence of that combination on a range of bruise size categories. The user may require a high degree of accuracy in the classification and therefore prune the attributes or bruise classes accordingly, or alternatively seek to discover trends in the dataset (in which case a lower level of accuracy often clarifies implicit structures in the data).
Models such as the theory of elasticity suggest that impact energy and radius of curvature will have a significant effect on the bruise surface area. Cell structure is also thought to contribute to variation in bruise size. The experiment described in this paper uses the machine learning programs C5.4 [3] and M5' [4] to determine the influence of impact energy, radius of curvature and impact site location on bruise area.


[95]

Knowledge visualization techniques for machine learning
M. Humphrey, S.J. Cunningham, and I.H. Witten.
Knowledge visualization techniques for machine learning.
Intelligent Data Analysis, 2(14):333347, 1998.
[ bib ]
Researchers in machine learning primarily use decision trees, production rules, and decision graphs for visualizing classification data, with the graphic form in which a structure is portrayed as having a strong influence on comprehensibility. We analyze the questions that, in our experience, end users of machine learning tend to ask of the structures inferred from their empirical data. By mapping these questions onto visualization tasks, we have created new graphical representations that show the flow of examples through a decision structure. The knowledge visualization techniques are particularly appropriate in helping to answer the questions that users typically ask, and we describe their use in discovering new properties of a data set. In the case of decision trees, an automated software tool has been developed to construct the visualizations.


[96]

Objective measurement of mushroom quality
N. Kusabs, F. Bollen, L. Trigg, G. Holmes, and S. Inglis.
Objective measurement of mushroom quality.
In Proc New Zealand Institute of Agricultural Science and the
New Zealand Society for Horticultural Science Annual Convention, page 51,
Hawke's Bay, New Zealand, 1998.
[ bib ]
This paper describes a methodology for establishing an objective measurement of mushroom quality, based on a set of measured physical attributes. These attributes were used by a machine learning tool to model quality based on the classification of professional graders (experts).
Four experts visually evaluate 300 mushrooms and graded them into three major and eight subclasses of commercial quality. Weight, firmness and images of the top and bottom of each mushroom were then measured. These physical parameters were used to construct a model of the grades assigned by the four experts. Grader consistency was also assessed by repeated classification (four repetitions) of two 100mushroom sets. Grader repeatability ranged from 6 to 15% misclassification.
Misclassification by the model increased with grading complexity (1632% for the three major grades and 1763% for the eight subclasses). This depended heavily on grader classifications and the variables used for training the model (weight, firmness, estimated gill opening and red/green/blue histograms from the image). Accuracy of classification using various combinations of physical attributes, which indicate the relative weight placed on each attribute by the different graders, are detailed in the paper.


[97]

Predicting query times
McNab R., Y. Wang, I.H. Witten, and C. Gutwin.
Predicting query times.
In W.B. Croft and et al., editors, Proc ACM SIGIR Conference on
Research and Development in Information Retrieval, pages 355356,
Melbourne, Australia, 1998. ACM Press.
[ bib ]
We outline the need for search engines to provide user feedback on the expected time for a query, describe a scheme for learning a model of query time by observing sample queries, and discuss the results obtained for a set of actual user queries on a document collection using the MG search engine.


[98]

User perceptions of machine learning
R.J. McQueen and G. Holmes.
User perceptions of machine learning.
In E.D. Hoadley and I. Benbasat, editors, Proc Association of
Information Systems Conference, pages 180182, Maryland, Baltimore, 1998.
Association for Information Systems, Atlanta, GA.
[ bib 
.pdf ]
Machine learning has potential use in the understanding of information hidden in large datasets, but little is known about user's perceptions about the use of the technology. In this study, a number of datasets were solicited from agricultural researchers and processed using a machine learning workbench. The results were reported to the researchers, and then interviews were conducted with some of them to determine their perceptions about the use of machine learning as an additional analysis technique to traditional statistical analysis. An number of themes about their satisfaction with this technique were constructed from the interview transcripts, which generally indicate that machine learning may be able to contribute to analysis and understanding of these kinds of datasets.


[99]

User satisfaction with machine learning as a data analysis method in agricultural research
R.J. McQueen, G. Holmes, and L. Hunt.
User satisfaction with machine learning as a data analysis method in
agricultural research.
New Zealand Journal of Agricultural Research, 41(4):577584,
1998.
[ bib 
.pdf ]
Machine learning has potential use in the understanding of information hidden in large datasets, but little is know about user perceptions about the use of the technology. In this study, a number of datasets were solicited from agricultural researchers and processed using a machine learning workbench. The results were reported to the researchers, and then interviews were conducted with some of them to determine their perceptions about the use of machine learning as an additional analysis technique to traditional statistical analysis. An number of themes about their satisfaction with this technique were constructed from the interview transcripts, which generally indicate that machine learning may be able to contribute to analysis and understanding of these kinds of datasets.


[100]

Inferring lexical and grammatical structure from sequences
C.G. NevillManning and I.H. Witten.
Inferring lexical and grammatical structure from sequences.
In B. Carpentieri and et al., editors, Proc Compression and
Complexity of Sequences, pages 265274, Los Alamitos, CA, 1997. IEEE
Computer Society Press.
Published as NevillManning and Witten, 1998.
[ bib 
.ps 
.pdf ]
In a wide variety of sequences from various sources, from music and text to DNA and computer programs, two different but related kinds of structure can be discerned. First, some segments tend to be repeated exactly, such as motifs in music, words or phrases in text, identifiers and syntactic idioms in computer programs. Second, these segments interact with each other in variable but constrained ways. For example, in English text only certain syntactic word classes can appear after the word 'the'many parts of speech (such as verbs) are necessarily excluded.
This paper shows how these kinds of structure can be inferred automatically from sequences. Let us make clear at the outset what aspects of sequence structure we are not concerned with. We take no account of numerical frequencies other than the 'more than once' that defines repetition. We do not consider any similarity metrics between the individual symbols that make up the sequence, nor between 'similar' subsequences such as transposed or transformed motifs in music. Finally, although we are certainly interested in nested repetitions, we do not analyze recursive structure in sequencessuch as selfsimilarity in fractal sequences. All of these regularities are interesting ones that would be well worth taking into account, but lie beyond the scope of this paper.


[101]

Phrase hierarchy inference and compression in bounded space
C.G. NevillManning and I.H. Witten.
Phrase hierarchy inference and compression in bounded space.
In J.A. Storer and M. Cohn, editors, Proc Data Compression
Conference, pages 179188, Los Alamitos, CA, 1998. IEEE Press.
[ bib 
.ps 
.pdf ]
Text compression by inferring a phrase hierarch from the input is a recent technique that shows promise both as a compression scheme and as a machine learning method that extracts some comprehensible account of the structure of the input text. Its performance as a data compression scheme outstrips other dictionary schemes, and the structures that it learns from sequences have been put to such eclectic uses as phrase browsing in digital libraries, music analysis, and inferring rules for fractal images.


[102]

Fast convergence with a greedy tagphrase dictionary
T.C. Smith and R. Peeters.
Fast convergence with a greedy tagphrase dictionary.
In J.A. Storer and M. Cohn, editors, Proc Data Compression
Conference, pages 3342, Los Alamitos, CA, 1998. IEEE Press.
[ bib 
.ps 
.pdf ]
The best generalpurpose compression schemes make their gains by estimating a probability distribution over all possible next symbols given the context established by some number of previous symbols. Such context models typically obtain good compression results for plain text by taking advantage of regularities in character sequences. Frequent words and syllables can be incorporated into the model quickly and thereafter used for reasonably accurate prediction. However, the precise context in which frequent patterns emerge is often extremely varied, and each new word or phrase immediately introduces new contexts which can adversely affect the compression rate.
A great deal of the structural regularity in a natural language is given rather more by properties of its grammar than by the orthographic transcription of its phonology. This implies that access to a grammatical abstraction might lead to good compression. While grammatical models have been used successfully for compressing computer programs [4], grammarbased compression of plain text has received little attention, primarily because of the difficulties associated with constructing a suitable natural language grammar. But even without a precise formulation of the syntax of a language, there is a linguistic abstraction which is easily accessed and which demonstrates a high degree of regularity which can be exploited for compression purposesnamely, lexical categories.


[103]

Learning featurevalue grammars from plain text
T.C. Smith.
Learning featurevalue grammars from plain text.
In David M.W. Powers, editor, Proc Joint Conference on New
Methods in Language Processing and Computational Natural Language Learning,
pages 291294, Sydney, Australia, 1998.
[ bib 
.ps 
.pdf ]
This paper outlines preliminary work on learning featurevalue grammars from plain text. Common suffixes are gleaned from a word suffix tree and used to form a first approximation of how regular inflection is marked. Words are generalised into lexical categories according to regularities in how these suffixes appear in trigram context. The categories are expressed as a lexical feature whose value is given by the most frequent suffix for similar trigrams. The trigrams are subsequently used to infer agreement dependencies which are captured through the creation of additional feature structures. Agreement and linear precedence are preserved through the iterative creation of unification rules for pairs of terms.


[104]

Boosting trees for costsensitive classification
K.M. Ting and Z. Zheng.
Boosting trees for costsensitive classification.
In Proc European Conference on Machine Learning, volume 1398 of
LNAI, pages 190195, Berlin, 1998.
[ bib 
.ps 
.pdf ]
This paper explores two boosting techniques for costsensitive tree classifications in the situation where misclassification costs change very often. Ideally, one would like to have only one induction, and use the induced model for different misclassification costs. Thus, it demands robustness of the induced model against cost changes. Combining multiple trees gives robust predictions against this change. We demonstrate that the two boosting techniques are a good solution in different aspects under this situation.


[105]

An entropy gain measure of numeric prediction performance
L. Trigg.
An entropy gain measure of numeric prediction performance.
Technical Report 98/11, University of Waikato, Department of Computer
Science, Hamilton, New Zealand, May 1998.
[ bib 
.ps 
.pdf ]
Categorical classifier performance is typically evaluated with respect to error rate, expressed as a percentage of test instances that were not correctly classified. When a classifier produces multiple classifications for a test instance, the prediction is counted as incorrect (even if the correct class was one of the predictions). Although commonly used in the literature, error rate is a coarse measure of classifier performance, as it is based only on a single prediction offered for a test instance. Since many classifiers can produce a class distribution as a prediction, we should use this to provide a better measure of how much information the classifier is extracting from the domain.
Numeric classifiers are a relatively new development in machine learning, and as such there is no single performance measure that has become standard. Typically these machine learning schemes predict a single real number for each test instance, and the error between the predicted and actual value is used to calculate a myriad of performance measures such as correlation coefficient, root mean squared error, mean absolute error, relative absolute error, and root relative squared error. With so many performance measures it is difficult to establish an overall performance evaluation.
The next section describes a performance measure for machine learning schemes that attempts to overcome the problems with current measures. In addition, the same evaluation measure is used for categorical and numeric classifier.


[106]

Improving Browsing in Digital Libraries with Keyphrase Indexes
Carl Gutwin, Gordon W. Paynter, Ian H. Witten, Craig G. NevillManning, and
Eibe Frank.
Improving browsing in digital libraries with keyphrase indexes.
Decision Support Systems, 27(1/2):81104, 1999.
[ bib 
.ps.gz ]


[107]

Developing innovative applications of machine learning
S.J. Cunningham and G. Holmes.
Developing innovative applications of machine learning.
In Proc Southeast Asia Regional Computer Confederation
Conference, Singapore, 1999.
[ bib 
.ps 
.pdf ]
The WEKA (Waikato Environment for Knowledge Analysis) system provides a comprehensive suite of facilities for applying data mining techniques to large data sets. This paper discusses a process model for analyzing data, and describes the support that WEKA provides for this model. The domain model 'learned' by the data mining algorithm can then be readily incorporated into a software application. The WEKAbased analysis and application construction process is illustrated through a case study in the agricultural domain�mushroom grading.


[108]

Market Basket Analysis of Library Circulation Data
Sally Jo Cunningham and Eibe Frank.
Market basket analysis of library circulation data.
In T. Gedeon, P. Wong, S. Halgamuge, N. Kasabov, D. Nauck, and
K. Fukushima, editors, Proc 6th International Conference on Neural
Information Processing, volume II of Perth, Australia, pages 825830.
IEEE Service Center, 1999.
[ bib 
.ps 
.pdf ]
Market Basket Analysis algorithms have recently seen widespread use in analyzing consumer purchasing patterns�specifically, in detecting products that are frequently purchased together. We apply the Apriori market basket analysis tool to the task of detecting subject classification categories that cooccur in transaction records of books borrowed from a university library. This information can be useful in directing users to additional portions of the collection that may contain documents relevant to their information needs, and in determining a library's physical layout. These results can also provide insight into the degree of scatter that the classification scheme induces in a particular collection of documents.


[109]

DomainSpecific Keyphrase Extraction
Eibe Frank, Gordon W. Paynter, Ian H. Witten, Carl Gutwin, and Craig G.
NevillManning.
Domainspecific keyphrase extraction.
In Proc 16th International Joint Conference on Artificial
Intelligence, Stockholm, Sweden, pages 668673. Morgan Kaufmann, 1999.
[ bib 
.ps 
.pdf ]
Automatic keyphrase extraction is a promising area for applied machine learning because keyphrases are an important means for document summarization, clustering, and topic search. Only a small minority of documents have authorassigned keyphrases, and manually assigning keyphrases to existing documents is very laborious. Therefore, it is highly desirable to automate the keyphrase extraction process.
This paper presents a simple procedure for keyphrase extraction based on the naive Bayes learning scheme, which is shown to perform comparably to the stateoftheart. It goes on to explain how the performance of this procedure can be boosted by automatically tailoring the extraction process to the particular document collection at hand. Results on a large collection of technical reports in computer science show that the quality of the extracted keyphrases improves significantly if domainspecific information is exploited.


[110]

Making Better Use of Global Discretization
Eibe Frank and Ian H. Witten.
Making better use of global discretization.
In Proc 16th International Conference on Machine Learning,
Bled, Slovenia, pages 115123. Morgan Kaufmann, 1999.
[ bib 
.ps 
.pdf ]
Before applying learning algorithms to datasets, practitioners often globally discretize any numeric attributes. If the algorithm cannot handle numeric attributes directly, prior discretization is essential. Even if it can, prior discretization often accelerates induction, and may produce simpler and more accurate classifiers.
As it is generally done, global discretization denies the learning algorithm any chance of taking advantage of the ordering information implicit in numeric attributes. However, a simple transformation of discretized data preserves this information in a form that learners can use. We show that, compared to using the discretized data directly, this transformation significantly increases the accuracy of decision trees built by C4.5, decision lists built by PART, and decision tables built using the wrapper method, on several benchmark datasets. Moreover, it can significantly reduce the size of the resulting classifiers.
This simple technique makes global discretization an even more useful tool for data preprocessing.


[111]

Correlationbased feature selection for machine learning
M.A. Hall.
Correlationbased feature selection for machine learning.
PhD thesis, University of Waikato, Department of Computer Science,
Hamilton, New Zealand, April 1999.
[ bib 
.ps 
.pdf ]
A central problem in machine learning is identifying a representative set of features from which to construct a classification model for a particular task. This thesis addresses the problems of feature selection for machine learning through a correlation based approach. The central hypothesis is that good feature sets contain features that are highly correlated with the class, yet uncorrelated with each other. A feature evaluation formula, based on ideas from test theory, provides an operational definition of this hypothesis. CFS (Correlation based Feature Selection) is an algorithm that couples this evaluation formula with an appropriate correlation measure and a heuristic search strategy.


[112]

Feature selection for discrete and numeric class machine learning
M.A. Hall.
Feature selection for discrete and numeric class machine learning.
Technical Report 99/4, University of Waikato, Department of Computer
Science, Hamilton, New Zealand, April 1999.
[ bib 
.ps 
.pdf ]
Algorithms for feature selection fall into two broad categories: wrappersuse the learning algorithm itself to evaluate the usefulness of features, while filtersevaluate features according to heuristics based on general characteristics of the data. For application to large databases, filters have proven to be more practical than wrappers because they are much faster. However, most existing filter algorithms only work with discrete classification problems.
This paper describes a fast, correlationbased filter algorithm that can be applied to continuous and discrete problems. Experiments using the new method as a preprocessing step for naive Bayes, instancebased learning, decision trees, locally weighted regression, and model trees show it to be an effective feature selector it reduces the data in dimensionality by more than sixty percent in most cases without negatively affecting accuracy. Also, decision and model trees built from the preprocessed data are often significantly smaller.


[113]

Feature selection for machine learning: comparing a correlationbased filter approach to the wrapper
Mark Andrew Hall and Lloyd Smith.
Feature selection for machine learning: comparing a correlationbased
filter approach to the wrapper.
In A. N. Kumar and I. Russel, editors, Proc Florida Artificial
Intelligence Symposium, pages 235239, Orlando, Florida, 1999. AAAI Press.
[ bib 
.ps 
.pdf ]
Feature selection is often an essential data processing step prior to applying a learning algorithm. The removal of irrelevant and redundant information often improves the performance of machine learning algorithms. There are two common approaches: a wrapperuses the intended learning algorithm itself to evaluate the usefulness of features, while a filterevaluates features according to heuristics based on general characteristics of the data. The wrapper approach is generally considered to produce better feature subsets but runs much more slowly than a filter. This paper describes a new filter approach to feature selection that uses a correlation based heuristic to evaluate the worth of feature subsets. When applied as a data preprocessing step for two common machine learning algorithms, the new method compares favourably with the wrapper but requires much less computation.


[114]

A diagnostic tool for tree based supervised classification learning algorithms
G. Holmes and L. Trigg.
A diagnostic tool for tree based supervised classification learning
algorithms.
In Proc Sixth International Conference on Neural Information
Processing (ICONIP'99), volume II, pages 514519, Perth, Western Australia,
November 1999.
[ bib 
.ps 
.pdf ]
The process of developing applications of machine learning and data mining that employ supervised classification algorithms includes the important step of knowledge verification. Interpretable output is presented to a user so that they can verify that the knowledge contained in the output makes sense for the given application. As the development of an application is an iterative process it is quite likely that a user would wish to compare models constructed at various times or stages.
One crucial stage where comparison of models is important is when the accuracy of a model is being estimated, typically using some form of crossvalidation. This stage is used to establish an estimate of how well a model will perform on unseen data. This is vital information to present to a user, but it is also important to show the degree of variation between models obtained from the entire dataset and models obtained during crossvalidation. In this way it can be verified that the crossvalidation models are at least structurally aligned with the model garnered from the entire dataset.
This paper presents a diagnostic tool for the comparison of treebased supervised classification models. The method is adapted from work on approximate tree matching and applied to decision trees. The tool is described together with experimental results on standard datasets.


[115]

Generating Rule Sets from Model Trees
Geoffrey Holmes, Mark Hall, and Eibe Frank.
Generating rule sets from model trees.
In Proc 12th Australian Joint Conference on Artificial
Intelligence, Sydney, Australia, pages 112. Springer, 1999.
[ bib 
.ps 
.pdf ]
Model trees  decision trees with linear models at the leaf nodes  have recently emerged as an accurate method for numeric prediction that produces understandable models. However, it is known that decision lists  ordered sets of IfThen rules  have the potential to be more compact and therefore more understandable than their tree counterparts.
We present an algorithm for inducing simple, accurate decision lists from model trees. Model trees are built repeatedly and the best rule is selected at each iteration. This method produces rule sets that are as accurate but smaller than the model tree constructed from the entire dataset. Experimental results for various heuristics which attempt to find a compromise between rule accuracy and rule coverage are reported. We show that our method produces comparably accurate and smaller rule sets than the commercial stateoftheart rule learning system Cubist.


[116]

Fitting a mixture model to threemode threeway data with categorical and continuous variables
L.A. Hunt and K.E. Basford.
Fitting a mixture model to threemode threeway data with categorical
and continuous variables.
Journal of Classification, 16(2):283296, 1999.
[ bib ]
The mixture likelihood approach to clustering is most often used with twomode twoway data to cluster one of the modes (e.g., the entities) into homogeneous groups on the basis of the other mode (e.g., the attributes). In this case, the attributes can either be continuous or categorical. When the data set consists of a threemode threeway array (e.g., attributes measured on entities in different situations), an analogous procedure is needed to enable the clustering of the entities (i.e., one of the modes) on the basis of both of the other modes simultaneously (i.e., the attributes measured in different situations). In this paper, it is shown that the finite mixture approach to clustering can be extended to analyze threemode threeway data where some of the attributes care continuous and some are categorical. The methodology is illustrated by clustering the genotypes in a threeway soybean data set where various attributes were measured on genotypes grown in several environments.


[117]

Mixture model clustering using the MULTIMIX program
L. Hunt and M. Jorgensen.
Mixture model clustering using the multimix program.
Australian and New Zealand Journal of Statistics,
41(2):153171, 1999.
[ bib ]
Hunt (1996) implemented the finite mixture model approach to clustering in a program called MULTIMIX. The program is designed to cluster multivariate data that have categorical and continuous variables and that possibly contain missing values. This paper describes the approach taken to design MULTIMIX and how some of the statistical problems were dealt with. As an example, the program is used to cluster an large medical dataset.


[118]

Issues in stacked generalization
K.M. Ting and I.H. Witten.
Issues in stacked generalization.
Journal of Artificial Intelligence Research, 10:271289, May
1999.
[ bib 
.ps 
.pdf ]
Stacked generalization is a general method of using a highlevel model to combine lowerlevel models to achieve greater predictive accuracy. In this paper we address two crucial issues which have been considered to be a 'black art' in classification tasks ever since the introduction of stacked generalization in 1992 by Wolpert: the type of generalizer that is suitable to derive the higherlevel model, and the kind of attributes that should be used as its input. We find that best results are obtained when the higherlevel model combines the confidence (and not just the predictions) of the lowerlevel ones.
We demonstrate the effectiveness of stacked generalization for combining three different types of learning algorithms for classification tasks. We also compare the performance of stacked generalization with majority vote and published results of arcing and bagging.


[119]

Clustering with finite data from semiparametric mixture distributions
Y. Wang and I.H. Witten.
Clustering with finite data from semiparametric mixture
distributions.
In Proc Symposium on the Interface: Models, Predictions, and
Computing, Schaumburg, Illinois, 1999.
[ bib ]
Existing clustering methods for the semiparametric mixture distribution perform well as the volume of data increases. However, they all suffer from a serious drawback in finitedata situations: small outlying groups of data points can be completely ignored in the clusters that are produced, no matter how far away they lie from the major clusters. This can result in unbounded loss if the loss function is sensitive to the distance between clusters.
This paper proposes a new distancebased clustering method that overcomes the problem by avoiding global constraints. Experimental results illustrate its superiority to existing methods when small clusters are present in finite data sets; they also suggest that it is more accurate and stable than other methods even when there are no small clusters.


[120]

Pace regression
Y. Wang and I.H. Witten.
Pace regression.
Technical Report 99/12, University of Waikato, Department of Computer
Science, Hamilton, New Zealand, September 1999.
[ bib 
.ps 
.pdf ]
This paper articulates a new method of linear regression, pace regression, that addresses many drawbacks of standard regression reported in the literatureparticularly the subset selection problem. Pace regression improves on classical ordinary least squares (OLS) regression by evaluating the effect of each variable and using a clustering analysis to improve the statistical basis for estimating their contribution to the overall regression. As well as outperforming OLS, it also outperformsin a remarkably general senseother linear modeling techniques in the literature, including subset selection procedures, which seek a reduction in dimensionality that falls out as a natural byproduct of pace regression. The paper defines six procedures that share the fundamental idea of pace regression, all of which are theoretically justified in terms of asymptotic performance. Experiments confirm the performance improvement over other techniques.


[121]

KEA: Practical Automatic Keyphrase Extraction
Ian H. Witten, Gordon W. Paynter, Eibe Frank, Carl Gutwin, and Craig G.
NevillManning.
KEA: Practical automatic keyphrase extraction.
In Proc 4th ACM conference on Digital Libraries, Berkeley, CA,
pages 254255. ACM, August 1999.
[ bib 
http 
.ps 
.pdf ]
Keyphrases provide semantic metadata that summarize and characterize documents. This paper describes Kea, an algorithm for automatically extracting keyphrases from text. Kea identifies candidate keyphrases using lexical methods, calculates feature values for each candidate, and uses a machine learning algorithm to predict which candidates are good keyphrases. The machine learning scheme first builds a prediction model using training documents with known keyphrases, and then uses the model to find keyphrases in new documents. We use a large test corpus to evaluate Kea's effectiveness in terms of how many authorassigned keyphrases are correctly identified. The system is simple, robust, and publicly available.


[122]

Weka: Practical Machine Learning Tools and Techniques with Java Implementations
Ian H. Witten, Eibe Frank, Len Trigg, Mark Hall, Geoffrey Holmes, and Sally Jo
Cunningham.
Weka: Practical machine learning tools and techniques with Java
implementations.
In Nikola Kasabov and Kitty Ko, editors, Proceedings of the
ICONIP/ANZIIS/ANNES'99 Workshop on Emerging Knowledge Engineering and
ConnectionistBased Information Systems, pages 192196, 1999.
Dunedin, New Zealand.
[ bib 
.ps 
.pdf ]
The Waikato Environment for Knowledge Analysis (Weka) is a comprehensive suite of Java class libraries that implement many stateoftheart machine learning and data mining algorithms. Weka is freely available on the WorldWide Web and accompanies a new text on data mining [1] which documents and fully explains all the algorithms it contains. Applications written using the Weka class libraries can be run on any computer with a Web browsing capability; this allows users to apply machine learning techniques to their own data regardless of computer platform.


[123]

Text Categorization Using Compression Models
Eibe Frank, Chang Chui, and Ian H. Witten.
Text categorization using compression models.
In Data Compression Conference, Snowbird, Utah, page 555. IEEE
Computer Society, 2000.
Note: abstract only. Full paper is available as
[132].
[ bib 
.ps.gz 
.pdf ]


[124]

BottomUp Propositionalization
Stefan Kramer and Eibe Frank.
Bottomup propositionalization.
In J. Cussens and A. Frisch, editors, Proc Workinprogress
reports of the 10th International Conference on Inductive Logic Programming,
pages 156162. CEURWS.org, July 2000.
[ bib 
.ps 
.pdf ]
In this paper, we present a new method for propositionalization that works in a bottomup, datadriven manner. It is tailored for biochemical databases, where the examples are 2D descriptions of chemical compounds. The method generates all frequent fragments (i.e., linearly connected atoms) up to a userspecified length. A preliminary experiment in the domain of carcinogenicity prediction showed that bottomup propositionalization is a promising approach to feature construction from relational data.


[125]

Pruning Decision Trees and Lists
Eibe Frank.
Pruning Decision Trees and Lists.
PhD thesis, Department of Computer Science, University of Waikato,
2000.
[ bib 
.ps.gz 
.pdf ]


[126]

MetaLearning by Landmarking Various Learning Algorithms
B. Pfahringer, H. Bensusan, and C. GiraudCarrier.
Metalearning by landmarking various learning algorithms.
In P. Langley, editor, Proceedings of the 17th International
Conference on Machine Learning (ICML2000), July 2000.
[ bib 
.ps 
.pdf ]
Landmarking is a novel approach to describing tasks in metalearning. Previous approaches to metalearning mostly considered only statisticsinspired measures of the data as a source for the definition of metaattributes. Contrary to such approaches, landmarking tries to determine the location of a specific learning problem in the space of all learning problems by directly measuring the performance of some simple and efficient learning algorithms themselves. In the experiments reported we show how such a use of landmark values can help to distinguish between areas of the learning space favouring different learners. Experiments, both with artificial and realworld databases, show that landmarking selects, with moderate but reasonable level of success, the best performing of a set of learning algorithms.


[127]

Learning to Use Operational Advice
J. Fuernkranz, B. Pfahringer, H. Kaindl, and S. Kramer.
Learning to use operational advice.
In W. Horn, editor, Proceedings of the 14th European Conference
on Artificial Intelligence (ECAI 2000), pages 291295, August 2000.
[ bib ]


[128]

A new approach to fitting linear models in high dimensional spaces
Yong Wang.
A new approach to fitting linear models in high dimensional
spaces.
PhD thesis, University of Waikato, Department of Computer Science,
Hamilton, New Zealand, 2000.
[ bib 
.ps 
.pdf ]
This thesis presents a new approach to fitting linear models, called 'pace regression', which also overcomes the dimensionality determination problem. Its optimality in minimizing the expected prediction loss is theoretically established, when the number of free parameters is infinitely large. In this sense, pace regression outperforms existing procedures for fitting linear models. Dimensionality determination, a special case of fitting linear models, turns out to be a natural byproduct. A range of simulation studies are conducted; the results support the theoretical analysis.
Through the thesis, a deeper understanding is gained of the problem of fitting linear models. Many key issues are discussed. Existing procedures, namely OLS, AIC, BIC, RIC, CIC, CV(d), BS(m), RIDGE, NNGAROTTE and LASSO, are reviewed and compared, both theoretically and empirically, with the new methods.
Estimating a mixing distribution is an indispensable part of pace regression. A measurebased minimum distance approach, including probability measures and nonnegative measures, is proposed, and strongly consistent estimators are produced. Of all minimum distance methods for estimating a mixing distribution, only the nonnegativemeasurebased one solves the minority cluster problem, what is vital for pace regression.
Pace regression has striking advantages over existing techniques for fitting linear models. It also has more general implications for empirical modeling, which are discussed in the thesis.


[129]

Experiences with a weighted decision tree learner
J. G. Cleary, L. E. Trigg, G. Holmes, and M. A. Hall.
Experiences with a weighted decision tree learner.
In Proc 20th SGES International Conference on Knowledge Based
Systems and Applied Artificial Intelligence, pages 3547. Springer, 2000.
[ bib ]


[130]

Comparison of consumer and producer perceptions of mushroom quality
A.F. Bollen, N.J. Kusabs, G. Holmes, and M.A. Hall.
Comparison of consumer and producer perceptions of mushroom quality.
In W.J. Florkowski, S.E. Prussia, and R.L. Shewfelt, editors,
Proc Integrated View of Fruit and Vegetable Quality International
Multidisciplinary Conference, pages 303311, Georgia, USA, 2000.
[ bib ]
The marketing of mushrooms in New Zealand is highly subjective. No detailed grading specifications exist and they are graded based on the experience of the growers (experts). The requirements of consumers are three or four steps removed in the value chain. The objective of this research has been to develop a quantitative set of descriptors which describe the quality grading criteria actually used by the graders, and to develop a similar set of criteria for the consumer. These two sets of descriptors were then compared to determine the difference in the two interpretations of quality.
Generally the consumers are classifying solely on visual attributes. There was disagreement between the consumer and the grower that suggested that the grower has adopted a quality profile which considerably exceeds that expected by the consumer group studied here.
The Machine Learning technique has been shown to produce a useful set of quality characterisation models for consumers. These have easily interpretable decision trees which are built on the objective attributes measured. The techniques help to provide an insight into the complex decisions made by consumers considering the purchase of mushrooms.


[131]

Technical Note: Naive Bayes for regression
E. Frank, L. Trigg, G. Holmes, and I.H. Witten.
Technical note: Naive bayes for regression.
Machine Learning, 41(1):526, 2000.
[ bib 
.ps 
.pdf ]
Despite its simplicity, the naive Bayes learning scheme performs well on most classification tasks, and is often significantly more accurate than more sophisticated methods. Although the probability estimates that it produces can be inaccurate, it often assigns maximum probability to the correct class. This suggests that its good performance might be restricted to situations where the output is categorical. It is therefore interesting to see how it performs in domains where the predicted value is numeric, because in this case, predictions are more sensitive to inaccurate probability estimates.
This paper shows how to apply the naive Bayes methodology to numeric prediction (i.e. regression) tasks, and compares it to linear regression, instancebased learning, and a method that produces model treesdecision trees with linear regression functions at the leaves. Although we exhibit an artificial dataset for which naive Bayes is the method of choice, on realworld datasets it is almost uniformly worse than model trees. The comparison with linear regression depends on the error measure: for one measure naive Bayes performs similarly, for another it is worse. Compared to instancebased learning, it performs similarly with respect to both measures. These results indicate that the simplistic statistical assumption that naive Bayes makes is indeed more restrictive for regression than for classification.


[132]

Text Categorization Using Compression Models
Chang Chui Eibe Frank and Ian H. Witten.
Text categorization using compression models.
Technical Report 00/02, Department of Computer Science, University of
Waikato, January 2000.
[ bib 
.ps 
.pdf ]
Text categorization, or the assignment of natural language texts to predefined categories based on their content, is of growing importance as the volume of information available on the internet continues to overwhelm us. The use of predefined categories implies a supervised learning approach to categorization, where alreadyclassified articles  which effectively define the categories  are used as training data to build a model that can be used for classifying new articles that comprise the test data. This contrasts with unsupervised learning, where there is no training data and clusters of like documents are sought amongst the test articles. With supervised learning, meaningful labels (such as keyphrases) are attached to the training documents, and appropriate labels can be assigned automatically to test documents depending on which category they fall into.


[133]

Benchmarking attribute selection techniques for data mining
M.A. Hall.
Benchmarking attribute selection techniques for data mining.
Technical Report 00/10, University of Waikato, Department of Computer
Science, Hamilton, New Zealand, July 2000.
[ bib 
.ps 
.pdf ]
Data engineering is generally considered to be a central issue in the development of data mining applications. The success of many learning schemes, in their attempts to construct models of data, hinges on the reliable identification of a small set of highly predictive attributes. The inclusion of irrelevant, redundant and noisy attributes in the model building process phase can result in poor predictive performance and increased computation.
Attribute selection generally involves a combination of search and attribute utility estimation plus evaluation with respect to specific learning schemes. This leads to a large number of possible permutations and has led to a situation where very few benchmark studies have been conducted.
This paper presents a benchmark comparison of several attribute selection methods. All the methods produce an attribute ranking, a useful devise of isolating the individual merit of an attribute. Attribute selection is achieved by crossvalidating the rankings with respect to a learning scheme to find the best attributes. Results are reported for a selection of standard data sets and two learning schemes C4.5 and naive Bayes.


[134]

Correlationbased feature selection for discrete and numeric class machine learning
Mark Andrew Hall.
Correlationbased feature selection for discrete and numeric class
machine learning.
In Proc 17th International Conference on Machine Learning,
pages 359366. Morgan Kaufmann, June 2000.
[ bib 
.ps 
.pdf ]
Algorithms for feature selection fall into two broad categories: wrappersthat use the learning algorithm itself to evaluate the usefulness of features and filtersthat evaluate features according to heuristics based on general characteristics of the data. For application to large databases, filters have proven to be more practical than wrappers because they are much faster. However, most existing filter algorithms only work with discrete classification problems. This paper describes a fast, correlationbased filter algorithm that can be applied to continuous and discrete problems. The algorithm often outperforms the wellknown ReliefF attribute estimator when used as a preprocessing step for naive Bayes, instancebased learning, decision trees, locally weighted regression, and model trees. It performs more feature selection than ReliefF doesreducing the data dimensionality by fifty percent in most cases. Also, decision and model trees built from the prepocessed data are often significantly smaller.


[135]

Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations
Ian H. Witten and Eibe Frank.
Data Mining: Practical Machine Learning Tools and Techniques
with Java Implementations.
Morgan Kaufmann, San Francisco, 2000.
[ bib 
.html ]
This book complements the Weka software. It shows how to use Weka's Java algorithms to discern meaningful patterns in your data, how to adapt them for your specialized data mining applications, and how to develop your own machine learning schemes. It offers a thorough grounding in machine learning concepts as well as practical advice on applying machine learning tools and techniques in realworld data mining situations. Inside, you'll learn all you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining. If you're involved at any level in the work of extracting usable knowledge from large collections of data, this book will be a valuable resource.


[136]

Fitting a mixture model to threemode threeway data with missing information
Lynette A Hunt and Kaye E Basford.
Fitting a mixture model to threemode threeway data with missing
information.
Journal of Classification, 18(2):209226, 2001.
[ bib 
http ]
When the data consist of certain attributes measured on the same set of items in different situations, they would be described as a threemode threeway array. A mixture likelihood approach can be implemented to cluster the items (i.e., one of the modes) on the basis of both of the other modes simultaneously (i.e,, the attributes measured in different situations). In this paper, it is shown that this approach can be extended to handle threemode threeway arrays where some of the data values are missing at random in the sense of Little and Rubin (1987). The methodology is illustrated by clustering the genotypes in a threeway soybean data set where various attributes were measured on genotypes grown in several environments.


[137]

Tag Insertion Complexity
Stuart Yeates, Ian H. Witten, and David Bainbridge.
Tag insertion complexity.
In Data Compression Conference, pages 243252. IEEE Computer
Society, 2001.
[ bib ]


[138]

Applications of machine learning in information retrieval
Sally Jo Cunningham, James Littin, and Ian H. Witten.
Applications of machine learning in information retrieval.
In M. E. Williams, editor, Annual Review of Information Science
and Technology, pages 341419. American Society for Information Science and
Technology, 2001.
[ bib ]


[139]

Determining Progression in Glaucoma Using Visual Fields
Andrew Turpin, Eibe Frank, Mark Hall, Ian H. Witten, and Chris A. Johnson.
Determining progression in glaucoma using visual fields.
In Proc 5th PacificAsia Conference on Knowledge Discovery and
Data Mining, Hong Kong, China, pages 136147. Springer, 2001.
[ bib 
.ps 
.pdf ]
The standardized visual field assessment, which measures visual function in 76 locations of the central visual area, is an important diagnostic tool in the treatment of the eye disease glaucoma. It helps determine whether the disease is stable or progressing towards blindness, with important implications for treatment. Automatic techniques to classify patients based on this assessment have had limited success, primarily due to the high variability of individual visual field measurements. The purpose of this paper is to describe the problem of visual field classification to the data mining community, and assess the success of datamining techniques on it. Preliminary results show that machine learning methods rival existing techniques for predicting whether glaucoma is progressing  though we have not yet been able to demonstrate improvements that are statistically significant. It is likely that further improvement is possible, and we encourage others to work on this important practical data mining problem.


[140]

Optimizing the Induction of Alternating Decision Trees
Bernhard Pfahringer, Geoffrey Holmes, and Richard Kirkby.
Optimizing the induction of alternating decision trees.
In D. Cheung, G.J. Williams, and Q. Li, editors, Proc 5th
PacificAsia Conference on Knowledge Discovery and Data Mining, pages
477487. Springer, April 2001.
[ bib 
.ps 
.pdf ]
The alternating decision tree brings comprehensibility to the performance enhancing capabilities of boosting. A single interpretable tree is induced wherein knowledge is distributed across the nodes and multiple paths are traversed to form predictions. The complexity of the algorithm is quadratic in the number of boosting iterations and this makes it unsuitable for larger knowledge discovery in database tasks. In this paper we explore various heuristic methods for reducing this complexity while maintaining the performance characteristics of the original algorithm. In experiments using standard, artificial and knowledge discovery datasets we show that a range of heuristic methods with log linear complexity are capable of achieving similar performance to the original method. Of these methods, the random walk heuristic is seen to outperform all others as the number of boosting iterations increases. The average case complexity of this method is linear.


[141]

Prediction of Ordinal Classes Using Regression Trees
Stefan Kramer, Gerhard Widmer, Bernhard Pfahringer, and Michael de Groeve.
Prediction of ordinal classes using regression trees.
Fundam. Inform., 47(12):113, 2001.
[ bib 
.ps 
.pdf ]
This paper is devoted to the problem of learning to predict ordinal (i.e., ordered discrete) classes using classification and regression trees. We start with SCART, a tree induction algorithm, and study various ways of transforming it into a learner for ordinal classification tasks. These algorithm variants are compared on a number of benchmark data sets to verify the relative strengths and weaknesses of the strategies and to study the tradeoff between optimal categorical classification accuracy (hit rate) and minimum distancebased error. Preliminary results indicate that this is a promising avenue towards algorithms that combine aspects of classification and regression.


[142]

A Simple Approach to Ordinal Classification
Eibe Frank and Mark Hall.
A simple approach to ordinal classification.
Technical Report 01/05, Department of Computer Science, University of
Waikato, 2001.
[ bib 
.ps 
.pdf ]
This is an updated version of a paper with the same title that appeared at the European Conference on Machine Learning 2001, Freiburg, Germany. SpringerVerlag, pp. 145165.
Machine learning methods for classification problems commonly assume that the class values are unordered. However, in many practical applications the class values do exhibit a natural order  for example, when learning how to grade. The standard approach to ordinal classification converts the class value into a numeric quantity and applies a regression learner to the transformed data, translating the output back into a discrete class value in a postprocessing step. A disadvantage of this method is that it can only be applied in conjunction with a regression scheme. In this paper we present a simple method that enables standard classification algorithms to make use of ordering information in class attributes. By applying it in conjunction with a decision tree learner we show that it outperforms the naive approach, which treats the class values as an unordered set. Compared to specialpurpose algorithms for ordinal classification our method has the advantage that it can be applied without any modification to the underlying learning scheme.


[143]

A Simple Approach to Ordinal Classification
Eibe Frank and Mark Hall.
A simple approach to ordinal classification.
In Proc 12th European Conference on Machine Learning, Freiburg,
Germany, pages 145156. Springer, 2001.
Note: there is a small bug in the description of the algorithm.
Please consult [142] instead.
[ bib ]


[144]

Interactive machine learning: letting users build classifiers
Malcolm Ware, Eibe Frank, Geoffrey Holmes, Mark Hall, and Ian H. Witten.
Interactive machine learning: letting users build classifiers.
Int. J. Hum.Comput. Stud., 55(3):281292, 2001.
[ bib 
.ps 
.pdf ]
According to standard procedure, building a classifier using machine learning is a fully automated process that follows the preparation of training data by a domain expert. In contrast, interactive machine learning engages users in actually generating the classifier themselves. This offers a natural way of integrating background knowledge into the modeling stage  so long as interactive tools can be designed that support efficient and effective communication. This paper shows that appropriate techniques can empower users to create models that compete with classifiers built by stateoftheart learning algorithms. It demonstrates that users  even users who are not domain experts  can often construct good classifiers, without any help from a learning algorithm, using a simple twodimensional visual interface. Experiments on real data demonstrate that, not surprisingly, success hinges on the domain: if a few attributes can support good predictions, users generate accurate classifiers, whereas domains with many highorder attribute interactions favor standard machine learning techniques. We also present an artificial example where domain knowledge allows an expert user to create a much more accurate model than automatic learning algorithms. These results indicate that our system has the potential to produce highly accurate classifiers in the hands of a domain expert who has a strong interest in the domain and therefore some insights into how to partition the data. Moreover, small expertdefined models offer the additional advantage that they will generally be more intelligible than those generated by automatic techniques.


[145]

(The Futility of) Trying to Predict Carcinogenicity of Chemical Compounds
B. Pfahringer.
(the futility of) trying to predict carcinogenicity of chemical
compounds.
In The Predictive Toxicology Challenge Workshop, Twelfth
European Conference on Machine Learning (ECML2001), Freiburg, Germany, 2001.
[ bib 
.ps 
.pdf ]
This paper describes my submission to one of the subproblems formulated for the Predictive Toxicology Challenge 2001. The challenge is to predict the carcinogenicity of chemicals based on structural information only. I have only tackled such predictions for bioessays involving male rats. As we currently do not know the true predictions for the testset, all we can say is that one of the models supplied by us seems to be optimal over some subrange of the ROC spectrum. The successful model uses a voting approach based on most of the sets of structural features made available by various other contestants as well as the organizers in an earlier phase of the Challenge. The WEKA Machine Learning workbench served as the core learning utility. Based on a preliminary examination of our submission we conclude that reliable prediction of carcinogenicity is still a far away goal.


[146]

Wrapping Boosters against Noise
Bernhard Pfahringer, Geoffrey Holmes, and Gabi Schmidberger.
Wrapping boosters against noise.
In Proc 14th Australian Joint Conference on Artificial
Intelligence, pages 402413. Springer, 2001.
[ bib 
.pdf 
.ps ]
Wrappers have recently been used to obtain parameter optimizations for learning algorithms. In this paper we investigate the use of a wrapper for estimating the correct number of boosting ensembles in the presence of class noise. Contrary to the naive approach that would be quadratic in the number of boosting iterations, the incremental algorithm described is linear.
Additionally, directly using the ksized ensembles generated during kfold crossvalidation search for prediction usually results in further improvements in classification performance. This improvement can be attributed to the reduction of variance due to averaging k ensembles instead of using only one ensemble. Consequently, crossvalidation in the way we use it here, termed wrapping, can be viewed as yet another ensemble learner similar in spirit to bagging but also somewhat related to stacking.


[147]

Investigation of association models to describe consumer purchase patterns
Mark Andrew Hall, N. J. Kusabs, D. Gillgren, and A. F. Bollen.
Investigation of association models to describe consumer purchase
patterns.
In Proc International Symposium on Applications of Modeling as
an Innovative Technology in the AgriFoodChain, pages 167173. ISHS, 2001.
[ bib ]


[148]

Food Process Modelling
G. Holmes and T. Smith.
Food Process Modelling, chapter Data mining.
Woodhead Publishing Ltd, Cambridge, UK, 2001.
[ bib ]


[149]

Accuracy bounds for ensembles under 0  1 loss
R.R. Bouckaert.
Accuracy bounds for ensembles under 0  1 loss.
Technical Report 04/02, University of Waikato, Computer Science
Department, Hamilton, New Zealand, 2002.
[ bib 
.ps 
.pdf ]
This paper is an attempt to increase the understanding in the behavior of ensembles for discrete variables in a quantitative way. A set of tight upper and lower bounds for the accuracy of an ensemble is presented for wide classes of ensemble algorithms, including bagging and boosting. The ensemble accuracy is expressed in terms of the accuracies of the members of the ensemble.
Since those bounds represent best and worst case behavior only, we study typical behavior as well, and discuss its properties. A parameterized bound is presented which describes ensemble behavior as a mixture of dependent base classifier and independent base classifier areas. Some empirical results are presented to support our conclusions.


[150]

Low level information extraction: a Bayesian network based approach
R.R. Bouckaert.
Low level information extraction: a bayesian network based approach.
In Workshop on Text Learning (TextML2002), 2002.
[ bib 
.ps 
.pdf ]
In this article, a contribution is made to information extraction and Bayesian network learning motivated by two practical information extraction tasks.
It is shown that some information extraction tasks can be approached as a classification problem where the text is split in tokens and each token is assigned a class. Hidden Markov models are a popular formalism for this task, however they do not deal with tokens having a set of attributes instead of a single one. A new algorithm for this task is presented using various Bayesian networks architectures that deals with multiple attributes per token. Experiments suggest most Bayesian networks architectures perform better than Naive Bayes in our problem domain.
Hopefully, this article helps in making Bayesian networks more accessible for the information extraction community.


[151]

Learning Structure from Sequences, with Applications in
a Digital Library
Ian H. Witten.
Learning structure from sequences, with applications in a digital
library.
In ALT, pages 4256, 2002.
[ bib ]


[152]

Optimising tabling structures for bottomup logic programming
R. J. Clayton, J. G. Cleary, B. Pfahringer, and B. M. Utting.
Optimising tabling structures for bottomup logic programming.
In Proc International Workshop on Logic Based Program
Development and Transformation LOPSTR'02, pages 5775, Madrid, Spain, 2002.
Fundacion General de la Universidad Politechica de Madrid, Madrid.
[ bib ]


[153]

Racing Committees for Large Datasets
Eibe Frank, Geoffrey Holmes, Richard Kirkby, and Mark Hall.
Racing committees for large datasets.
In S. Lange, K. Satoh, and C. H. Smith, editors, Proc 5th
International Conference on Discovery Science, volume 2534 of LNCS,
pages 153164. Springer, 2002.
Also published as Working Paper 03/02, Computer Science Department,
University of Waikato, Hamilton.
[ bib 
.ps 
.pdf ]
This paper proposes a method for generating classifiers from large datasets by building a committee of simple base classifiers using a standard boosting algorithm. It permits the processing of large datasets even if the underlying base learning algorithm cannot efficently do so. The basic idea is to split incoming data into chunks and build a committee based on classifiers built from these individual chunks. Our method extends earlier work by introducing a method for adaptively pruning the committee. This is essential when applying the algorithm in practice because it dramatically reduces the algorithm's running time and memory consumption. It also makes it possible to efficently race committees corresponding to different chunk sizes. This is important because our empirical results show that the accuracy of the resulting committee can vary significantly with the chunk size. They also show that pruning is indeed crucial to make the method practical for large datasets in terms of running time and memory requirements. Surprisingly, the results demonstrate that pruning can also improve accuracy.


[154]

An experimental speech to graphics system
A. C. Golightly and T. C. Smith.
An experimental speech to graphics system.
In S. Jones and M. Masoodian, editors, Proc SIGCHINZ Symposium
on ComputerHuman Interaction, pages 9196, Hamilton, New Zealand, 2002.
[ bib ]


[155]

Benchmarking attribute selection techniques for discrete class data mining
M. Hall and G. Holmes.
Benchmarking attribute selection techniques for discrete class data
mining.
Technical Report 02/02, The University of Waikato, Department of
Computer Science, Hamilton, New Zealand, 2002.
[ bib ]
Data engineering is generally considered to be a central issue in the development of data mining applications. The success of many learning schemes, in their attempts to construct models of data, hinges on the reliable identification of a small set of highly predictive attributes. The inclusion of irrelevant, redundant and noisy attributes in the model building process phase can result in poor predictive performance and increased computation.
Attribute selection generally involves a combination of search and attribute utility estimation plus evaluation with respect to specific learning schemes. This leads to a large number of possible permutation and has led to a situation where very few benchmark studies have been conducted.
This paper presents a benchmark comparison of several attribute selection methods for supervised classification. All the methods produce an attribute ranking, a useful devise for isolating the individual merit of an attribute. Attribute selection is achieved by crossvalidating the attribute rankings with respect to a classification learner to find the best attributes. Results are reported for a selection of standard data sets and two diverse learning schemes C4.5 and na�ve Bayes.


[156]

A development environment for predictive modelling in foods
G. Holmes and M. Hall.
A development environment for predictive modelling in foods.
International Journal of Food Microbiology, 73(23):351362,
2002.
[ bib ]


[157]

A logistic boosting approach to inducing multiclass alternating decision trees
G. Holmes, B. Pfahringer, E. T. Frank, R. B. Kirkby, and M. A. Hall.
A logistic boosting approach to inducing multiclass alternating
decision trees.
Technical Report 01/02, The University of Waikato, Department of
Computer Science, Hamilton, New Zealand, 2002.
[ bib ]
The alternating decision tree (ADTree) is a successful classification technique that combine decision trees with the predictive accuracy of boosting into a ser to interpretable classification rules. The original formulation of the tree induction algorithm restricted attention to binary classification problems. This paper empirically evaluates several methods for extending the algorithm to the multiclass case by splitting the problem into several twoclass LogitBoost procedure to induce alternating decision trees directly. Experimental results confirm that this procedure is comparable with methods that are based on the original ADTree formulation in accuracy, while inducing much smaller trees.


[158]

Multiclass Alternating Decision Trees
Geoffrey Holmes, Bernhard Pfahringer, Richard Kirkby, Eibe Frank, and Mark
Hall.
Multiclass alternating decision trees.
In T. Elomaa, H. Mannila, and H. Toivonen, editors, Proc 13th
European Conference on Machine Learning, volume 2430 of LNCS, pages
161172. Springer, 2002.
[ bib 
.ps 
.pdf ]
The alternating decision tree (ADTree) is a successful classification technique that combines decision trees with the predictive accuracy of boosting into a set of interpretable classification rules. The original formulation of the tree induction algorithm restricted attention to binary classification problems. This paper empirically evaluates several wrapper methods for extending the algorithm to the multiclass case by splitting the problem into several twoclass problems. Seeking a more natural solution we then adapt the multiclass LogitBoost and AdaBoost.MH procedures to induce alternating decision trees directly. Experimental results confirm that these procedures are comparable with wrapper methods that are based on the original ADTree formulation in accuracy, while inducing much smaller trees.


[159]

Fragment Generation and Support Vector Machines for Inducing SARs
Stefan Kramer, Eibe Frank, and Christoph Helma.
Fragment generation and support vector machines for inducing SARs.
SAR and QSAR in Environmental Research, 13(5):509523, 2002.
[ bib 
http ]
We present a new approach to the induction of SARs based on the generation of structural fragments and support vector machines (SVMs). It is tailored for biochemical databases, where the examples are twodimensional descriptions of chemical compounds. The fragment generator finds all fragments (i.e. linearly connected atoms) that satisfy userspecified constraints regarding their frequency and generality. In this paper, we are querying for fragments within a minimum and a maximum frequency in the dataset. After fragment generation, we propose to apply SVMs to the problem of inducing SARs from these fragments. We conjecture that the SVMs are particularly useful in this context, as they can deal with a large number of features. Experiments in the domains of carcinogenicity and mutagenicity prediction show that the minimum and the maximum frequency queries for fragments can be answered within a reasonable time, and that the predictive accuracy obtained using these fragments is satisfactory. However, further experiments will have to confirm that this is a viable approach to inducing SARs.


[160]

Modeling for optimal probability prediction
Y. Wang and I.H. Witten.
Modeling for optimal probability prediction.
In C. Sammut and A. Hoffmann, editors, Proceedings of the
Nineteenth International Conference on Machine Learning, San Francisco,
California, pages 650657. Morgan Kaufmann, 2002.
[ bib 
.ps 
.pdf ]
We present a general modeling method for optimal probability prediction over future observations, in which model dimensionality is determined as a natural byproduct. This new method yields several estimators, and we establish theoretically that they are optimal (either overall or under stated restrictions) when the number of free parameters is infinite. As a case study, we investigate the problem of fitting logistic models in finitesample situations. Simulation results on both artificial and practical datasets are supportive.


[161]

Mixture model clustering for mixed data with missing information
Lynette Hunt and Murray Jorgensen.
Mixture model clustering for mixed data with missing information.
Computational Statistics & Data Analysis, 41(3–4):429 
440, 2003.
[ bib 
http ]
One difficulty with classification studies is unobserved or missing observations that often occur in multivariate datasets. The mixture likelihood approach to clustering has been well developed and is much used, particularly for mixtures where the component distributions are multivariate normal. It is shown that this approach can be extended to analyse data with mixed categorical and continuous attributes and where some of the data are missing at random in the sense of Little and Rubin (Statistical Analysis with Mixing Data, Wiley, New York).


[162]

Token Identification Using HMM and PPM Models.
Yingying Wen, Ian H. Witten, and Dianhui Wang.
Token identification using hmm and ppm models.
In Proc 16th Australian Joint Conference on Artificial
Intelligence, pages 173185. Springer, 2003.
[ bib ]
Hidden markov models (HMMs) and prediction by partial matching models (PPM) have been successfully used in language pro cessing tasks including learningbased token identification. Most of the existing systems are domain and languagedependent. The power of re targetability and applicability of these systems is limited. This paper investigates the effect of the combination of HMMs and PPM on to ken identification. We implement a system that bridges the two well known methods through words new to the identification model. The sys tem is fully domain and languageindependent. No changes of code are necessary when applying to other domains or languages. The only re quired input of the system is an annotated corpus. The system has been tested on two corpora and achieved an overall Fmeasure of 69.02% for TCC, and 76.59% for BIB. Although the performance is not as good as that obtained from a system with languagedependent components, our proposed system has power to deal with large scope of domain and languageindependent problem. Identification of date has the best result, 73% and 92% of correct tokens are identified for two corpora respectively. The system also performs reasonably well on people’s name with correct tokens of 68% for TCC, and 76% for BIB.


[163]

A probabilistic line breaking algorithm
R.R. Bouckaert.
A probabilistic line breaking algorithm.
In Proc Sixteenth Australian Joint Conference on Artificial
Intelligence, 2003.
[ bib 
.pdf ]
We show how a probabilistic interpretation of an ill defined problem, the problem of finding line breaks in a paragraph, can lead to an efficient new algorithm that performs well. The graphical model that results from the probabilistic interpretation has the advantage that it is easy to tune due to the probabilistic approach. Furthermore, the algorithm optimizes the probability a break up is acceptable over the whole paragraph, it does not show threshold effects and it allows for easy incorporation of subtle typographical rules. Thanks to the architecture of the Bayesian network, the algorithm is linear in the number of characters in a paragraph. Empirical evidence suggests that this algorithm performs closer to results published through desk top publishing than a number of existing systems.


[164]

Choosing between two learning algorithms based on calibrated tests
R.R. Bouckaert.
Choosing between two learning algorithms based on calibrated tests.
In T. Fawcett and N. Mishra, editors, Proc Twentieth
International Conference on Machine Learning, pages 5158, Washington, USA,
2003. Morgan Kaufmann.
[ bib 
.pdf ]
Designing a hypothesis test to determine the best of two machine learning algorithms with only a small data set available is not a simple task. Many popular tests suffer from low power (5x2 cv), or high Type I error (Weka�s 10x10 cross validation). Furthermore, many tests show a low level of replicability, so that tests performed by different scientists with the same pair of algorithms, the same data sets and the same hypothesis test still may present different results. We show that 5x2 cv, resampling and 10 fold cv suffer from low replicability. The main complication is due to the need to use the data multiple times. As a consequence, independence assumptions for most hypothesis tests are violated. In this paper, we pose the case that reuse of the same data causes the effective degrees of freedom to be much lower than theoretically expected. We show how to calibrate the effective degrees of freedom empirically for various tests. Some tests are not calibratable, indicating another flaw in the design. However the ones that are calibratable all show very similar behavior. Moreover, the Type I error of those tests is on the mark for a wide range of circumstances, while they show a power and replicability that is a considerably higher than currently popular hypothesis tests.


[165]

Choosing learning algorithms using sign tests with high replicability
R.R. Bouckaert.
Choosing learning algorithms using sign tests with high
replicability.
In Proc Sixteenth Australian Joint Conference on Artificial
Intelligence, 2003.
[ bib 
.ps ]
An important task in machine learning is determining which learning algorithm works best for a given data set. When the amount of data is small the same data needs to be used repeatedly in order to get a reasonable estimate of the accuracy of the learning algorithms. This results in violations of assumptions on which standard tests are based and makes it hard to design a good test.
In this article, we investigate sign tests to address the problem of choosing the best of two learning algorithms when only a small data set is available. Sign tests are conceptually simple and no assumption about underlying distributions is required. We show that simplistic sample generation can lead to flawed test outcomes. Furthermore, we identify a test that performs well based on Type I error (showing a difference between algorithms when there is none), power (showing a difference when it indeed exists) and replicability.
Replicability is a novel measure of a quality of a test that gives an indication of how likely it is that the test outcome will be the same when the same test on the same data with the same sampling scheme and same pair of algorithms is executed, but with a difference randomization of the data. A new definition of replicability is provided and its benefits highlighted. A new definition of replicability is provided and its benefits highlighted. Empirical evidence is provided to show the test is robust under a varied range of circumstances.


[166]

Applying Propositional Learning Algorithms to Multiinstance data
Eibe Frank and Xin Xu.
Applying propositional learning algorithms to multiinstance data.
Technical Report 06/03, Department of Computer Science, University of
Waikato, 2003.
[ bib 
.ps.gz 
.pdf ]
Multiinstance learning is commonly tackled using specialpurpose algorithms. Development of these algorithms has started because early experiments with standard propositional learners have failed to produce satisfactory results on multiinstance datamore specifically, the Musk data. In this paper we present evidence that this is not necessarily the case. We introduce a simple wrapper for applying standard propositional learners to multiinstance problems and present empirical results for the Musk data that are competitive with genuine multiinstance algorithms. The key features of our new wrapper technique are: (1) it discards the standard multiinstance assumption that there is some inherent difference between positive and negative bags, and (2) it introduces weights to treat instances from different bags differently. We show that these two modifications are essential for producing good results on the Musk benchmark datasets.


[167]

Locally Weighted Naive Bayes
Eibe Frank, Mark Hall, and Bernhard Pfahringer.
Locally weighted naive Bayes.
In U. Kjaerulff and C. Meek, editors, Proc 19th Conference in
Uncertainty in Artificial Intelligence, Acapulco, Mexico, pages 249256.
Morgan Kaufmann, 2003.
Also published as Working Paper 04/03, Department of Computer
Science, The University of Waikato, Hamilton.
[ bib 
.ps.gz 
.pdf ]
Despite its simplicity, the naive Bayes classifier has surprised machine learning researchers by exhibiting good performance on a variety of learning problems. Encouraged by these results, researchers have looked to overcome naive Bayes' primary weaknessattribute independenceand improve the performance of the algorithm. This paper presents a locally weighted version of naive Bayes that relaxes the independence assumption by learning local models at prediction time. Experimental results show that locally weighted naive Bayes rarely degrades accuracy compared to standard naive Bayes and, in many cases, improves accuracy dramatically. The main advantage of this method compared to other techniques for enhancing naive Bayes is its conceptual and computational simplicity.


[168]

Predicting Library of Congress Classifications from Library of Congress Subject Headings
E. T. Frank and G. W. Paynter.
Predicting library of congress classifications from library of
congress subject headings.
Technical Report 01/03, The University of Waikato, Department of
Computer Science, Hamilton, New Zealand, 2003.
[ bib 
.ps 
.pdf ]
This paper addresses the problem of automatically assigning a Library of Congress Classification (LCC) to work given its set of Library of Congress Subject Headings (LCSH). LCC are organized in a tree: the root node of this hierarchy comprises all possible topics, and leaf nodes correspond to the most specialized topic areas defined. We describe a procedure that, given a resource identified by its LCSH, automatically places that resource in the LCC hierarchy. The procedure uses machine learning techniques and training data from a large library catalog to learn a classification model mapping from sets of LCSH to nodes in the LCC tree. We present empirical results for our technique showing its accuracy on an independent collection of 50,000 LCSH/LCC pairs.


[169]

Visualizing Class Probability Estimators
Eibe Frank and Mark Hall.
Visualizing class probability estimators.
In N. Lavrac and et al., editors, Proc 7th European Conference
on Principles and Practice of Knowledge Discovery in Databases, volume 2838
of LNCS, pages 168179. Springer, 2003.
Also published as Working Paper 02/03, Department of Computer
Science, The University of Waikato, Hamilton.
[ bib 
.ps.gz 
.pdf ]
Inducing classifiers that make accurate predictions on future data is a driving force for research in inductive learning. However, also of importance to the users is how to gain information from the models produced. Unfortunately, some of the most powerful inductive learning algorithms generate black boxes�that is, the representation of the model makes it virtually impossible to gain any insight into what has been learned. This paper presents a technique that can help the user understand why a classifier makes the predictions that it does by providing a twodimensional visualization of its class probability estimates. It requires the classifier to generate class probabilities but most practical algorithms are able to do so (or can be modified to this end).


[170]

Benchmarking Attribute Selection Techniques for Discrete Class Data Mining
Mark Andrew Hall and Geoffrey Holmes.
Benchmarking attribute selection techniques for discrete class data
mining.
IEEE Transactions on Knowledge and Data Engineering,
15(3):14371447, November/December 2003.
[ bib 
.pdf ]
Data engineering is generally considered to be a central issue in the development of data mining applications. The success of many learning schemes, in their attempts to construct models of data, hinges on the reliable identification of a small set of highly predictive attributes. The inclusion of irrelevant, redundant and noisy attributes in the model building process phase can result in poor predictive performance and increased computation.
Attribute selection generally involves a combination of search and attribute utility estimation plus evaluation with respect to specific learning schemes. This leads to a large number of possible permutations and has led to a situation where very few benchmark studies have been conducted.
This paper presents a benchmark comparison of several attribute selection methods for supervised classification. All the methods produce an attribute ranking, a useful devise for isolating the individual merit of an attribute. Attribute selection is achieved by crossvalidating the attribute rankings with respect to a classification learner to find the best attributes. Results are reported for a selection of standard data sets and two diverse learning schemes C4.5 and naive Bayes.


[171]

Mining data streams using option trees
G. Holmes, B. Pfahringer, and R. B. Kirkby.
Mining data streams using option trees.
Technical Report 08/03, The University of Waikato, Department of
Computer Science, Hamilton, New Zealand, 2003.
[ bib ]
The data stream model for data mining places harsh restrictions on a learning algorithm. A model must be induced following the briefest interrogation of the data, must use only available memory and must update itself over time within these constraints. Additionally, the model must be able to be used for data mining at any point in time. This paper describes a data stream classification algorithm using an ensemble of option trees. The ensemble of trees is induced by boosting and iteratively combined into a single interpretable model. The algorithm is evaluated using benchmark datasets for accuracy against stateoftheart algorithms that make use of the entire dataset.


[172]

Logistic Model Trees
Niels Landwehr, Mark Hall, and Eibe Frank.
Logistic model trees.
In N. Lavrac and et al., editors, Proc 14th European Conference
on Machine Learning, volume 2837 of LNCS, pages 241252,
CavtatDubrovnik, Croatia, 2003. Springer.
[ bib 
.ps 
.pdf ]
Tree induction methods and linear models are popular techniques for supervised learning tasks, both for the prediction of nominal classes and continuous numeric values. For predicting numeric quantities, there has been work on combining these two schemes into �model trees�, i.e. trees that contain linear regression functions at the leaves. In this paper, we present an algorithm that adapts this idea for classification problems, using logistic regression instead of linear regression. We use a stagewise fitting process to construct the logistic regression models that can select relevant attributes in the data in a natural way, and show how this approach can be used to build the logistic regression models at the leaves by incrementally refining those constructed at higher levels in the tree. We compare the performance of our algorithm against that of decision trees and logistic regression on 32 benchmark UCI datasets, and show that it achieves a higher classification accuracy on average than the other two methods.


[173]

The ability for fourier transform infrared spectroscopy to classify persimmon genotypes by epicuticular leaf waxes
A.D. Mowat and G. Holmes.
The ability for fourier transform infrared spectroscopy to classify
persimmon genotypes by epicuticular leaf waxes.
In Proc Proc Second International Persimmon Symposium, ISHS Acta
Horticulturae 601, pages 6569, Queensland, Australia, 2003.
[ bib ]
The ability of Fourier Transform Infrared (FTIR) spectroscopic measurements of epicuticular leaf waxes to classify five nonastringent persimmon cultivars ( Diospyros kaki L. cv. Fuyu, cv. Matsumoto wase Fuyu, cv. Hanna Fuyu, cv. Oku Gosho and cv. IIIQ12) was investigated. Principal component analysis was used to reduce the FTIR spectral data to 20 variables. Machine learning models classified the cultivars into five groups. Three of the cultivars, Oku Gosho, Hanna Fuyu and IIIQ12 were readily distinguished from each other and from Fuyu and Matsumoto wase Fuyu. However, the leaf wax data was unable to distinguished between two closely related persimmon genotypes, Fuyu and the early maturing L1 mutation of Fuyu, Matsumoto wase Fuyu. Here, epicuticular wax biosynthesis appears unaffected by the LI mutation found in Matsumoto wase Fuyu. When the classification was based on four groups, where Fuyu and Matsumoto wase Fuyu samples were combined, the best performing machine learning model correctly classified 98.7% of the samples.


[174]

Propositionalization through Stochastic Discrimination
B. Pfahringer and G. Holmes.
Propositionalization through stochastic discrimination.
In Proceedings of the WorkinProgress Track at the 13th
International Conference on Inductive Logic Programming, pages 6068.
Department of Informatics, University of Szeged, Hungary, September 2003.
[ bib 
.ps 
.pdf ]
A simple algorithm based on the theory of stochastic discrimination is developed for the fast extraction of subgraphs with potential discriminative power from a given set of preclassified graphs. A preliminary experimental evaluation indicates the potential of the approach. Limitations are discussed as well as directions for future research.


[175]

Text Categorisation Using Document Profiling
Maximilien Sauban and Bernhard Pfahringer.
Text categorisation using document profiling.
In N Lavrac, editor, Proc 7th European Conference on Principles
and Practice of Knowledge Discovery in Databases, volume 2838 of LNCS,
pages 411422. Springer, 2003.
[ bib 
.ps 
.pdf ]
This paper presents an extension of prior work by Michael D. Lee on psychologically plausible text categorisation. Our approach utilises Lee�s model as a preprocessing filter to generate a dense representation for a given text document (a document profile) and passes that on to an arbitrary standard propositional learning algorithm. Similarly to standard feature selection for text classification, the dimensionality of instances is drastically reduced this way, which in turn greatly lowers the computational load for the subsequent learning algorithm. The filter itself is very fast as well, as it basically is just an interesting variant of Naive Bayes. We present different variations of the filter and conduct an evaluation against the Reuters21578 collection that shows performance comparable to previously published results on that collection, but at a lower computational cost.


[176]

All of MEDLINE indexed to the Gene Ontology
A. C. Smith and J. G. Cleary.
All of medline indexed to the gene ontology.
In Proc Sixth Annual BioOntologies Meeting, pages 711,
Brisbane, Australia, 2003.
[ bib ]
Most of what is known about genes and proteins is contained in the biomedical literature. The problem for the biologist is how to connect novel sequence data to relevant published documents. One way is to BLAST the sequence and then follow the literature links established in genomic/proteomic databases for known sequences with similar structure. Another way is to find the closelymatching genes or proteins in the Gene Ontology, then retrieve documents associated with GO terms. The advantage of this approach is that it provides a conceptual context for discovering possible genetic roles or molecular functions for a new sequence. The problems with both search strategies, however, is that they return only a small portion of the available literature. We are solving this problem by amplifying the available documents associated with GO terms to cover the entirety of the MEDLINE corpus.


[177]

A TwoLevel Learning Method for Generalized Multiinstance
Problems
Nils Weidmann, Eibe Frank, and Bernhard Pfahringer.
A twolevel learning method for generalized multiinstance problems.
In N. Lavrac and et al., editors, Proc 14th European Conference
on Machine Learning, CavtatDubrovnik, Croatia, pages 468479. Springer,
2003.
[ bib 
.ps.gz 
.pdf ]
In traditional multiinstance (MI) learning, a single positive instance in a bag produces a positive class label. Hence, the learner knows how the bag�s class label depends on the labels of the instances in the bag and can explicitly use this information to solve the learning task. In this paper we investigate a generalized view of the MI problem where this simple assumption no longer holds. We assume that an �interaction� between instances in a bag determines the class label. Our twolevel learning method for this type of problem transforms an MI bag into a single metainstance that can be learned by a standard propositional method. The metainstance indicates which regions in the instance space are covered by instances of the bag. Results on both artificial and realworld data show that this twolevel classification approach is well suited for generalized MI problems.


[178]

Statistical learning in multiple instance problem
Xin Xu.
Statistical learning in multiple instance problem.
Master's thesis, University of Waikato, Hamilton, NZ, 2003.
0657.594.
[ bib 
http 
.ps.gz ]
Multiple instance (MI) learning is a relatively new topic in machine learning. It is concerned with supervised learning but differs from normal supervised learning in two points: (1) it has multiple instances in an example (and there is only one instance in an example in standard supervised learning), and (2) only one class label is observable for all the instances in an example (whereas each instance has its own class label in normal supervised learning). In MI learning there is a common assumption regarding the relationship between the class label of an example and the unobservable class labels of the instances inside it. This assumption, which is called the MI assumption in this thesis, states that An example is positive if at least one of its instances is positive and negative otherwise.
In this thesis, we first categorize current MI methods into a new framework. According to our analysis, there are two main categories of MI methods, instancebased and metadatabased approaches. Then we propose a new assumption for MI learning, called the collective assumption. Although this assumption has been used in some previous MI methods, it has never been explicitly stated, and this is the first time that it is formally specified. Using this new assumption we develop new algorithms  more specifically two instancebased and one metadatabased methods. All of these methods build probabilistic models and thus implement statistical learning algorithms. The exact generative models underlying these methods are explicitly stated and illustrated so that one may clearly understand the situations to which they can best be applied. The empirical results presented in this thesis show that they are competitive on standard benchmark datasets. Finally, we explore some practical applications of MI learning, both existing and new ones.
This thesis makes three contributions: a new framework for MI learning, new MI methods based on this framework and experimental results for new applications of MI learning.


[179]

Unsupervised learning from incomplete data using a mixture model approach
Lynette Hunt and Murray Jorgensen.
Unsupervised learning from incomplete data using a mixture model
approach.
In Statistical Data Mining and Knowledge Discovery, pages
173188. Chapman & Hall/CRC, 2004.
[ bib ]
Many unsupervised learning tasks involve high dimensional data sets where some of the attributes are continuous and some are categorical. One possible approach to clustering such data is to assume that the data to be clustered arise from a finite mixture of populations. The mixture likelihood approach has been well developed and much used, especially for mixtures where the component distributions are multivariate normal. Hunt and Jorgensen [17] presented methodology that enabled the clustering of mixed categorical and continuous data using a mixture approach. However many multivariate data sets encountered also contain unobserved or missing values. In this paper we demonstrate that the mixture approach to clustering can handle mixed categorical and continuous data where data are missing at random in the sense of Little and Rubin [30]. The methodology is illustrated by clustering two data sets.


[180]

Experiments in predicting biodegradability
H. Blockeel, S. Dzeroski, B. Kompare, S. Kramer, B. Pfahringer, and
W. Van Laer.
Experiments in predicting biodegradability.
Journal of Applied Artificial Intelligence, 18(2), Feburary
2004.
[ bib 
.ps 
.pdf ]
This paper is concerned with the use of AI techniques in ecology. More specifically, we present a novel application of inductive logic programming (ILP) in the area of quantitative structureactivity relationships (QSARs). The activity we want to predict is the biodegradability of chemical compounds in water. In particular, the target variable is the halflife for aerobic aqueous biodegradation. Structural descriptions of chemicals in terms of atoms and bonds are derived from the chemicals' SMILES encodings. Definition of substructures are used as background knowledge. Predicting biodegradability is essentially a regression problem, but we also consider a discretized version of the target variable. We thus employ a number of relational classification and regression methods on the relational representation and compare these to propositional methods applied to different propositionalizations of the problem. We also experiment with the prediction technique that consists of merging upper and lower bound predictions into one prediction. Some conclusions are drawn concerning the applicability of machine learning systems and the merging technique in this domain and concerning the evaluation of hypotheses.


[181]

Bayesian network classifiers in Weka
R. Bouckaert.
Bayesian network classifiers in weka.
Technical Report 14/2004, The University of Waikato, Department of
Computer Science, Hamilton, New Zealand, 2004.
[ bib 
.pdf ]
Various Bayesian network classifier learning algorithms are implemented in Weka. This note provides some user documentation and implementation details. Summary of the main capabilities:
(a) Structure learning of Bayesian networks using various hill climbing (K2, B, etc) and general purpose (simulated annealing, tabu search) algorithms.
(b) Local score metrics implemented; Bayes, BDe, MDL, entropy, AIC.
(c) Global score metrics implemented; leave one out cv, kfold cv and cumulative cv.
(d) Conditional independence based causal recovery algorithm available.
(e) Parameter estimation using direct estimates and Bayesian model averaging.
(f) GUI for easy inspection of Bayesian networks.
(g) Part of Weka allowing systematic experiments to compare Bayes net performance with general purpose classifiers like C4.5, nearest neighbor, support vector, etc.
(h) Source code available under GPL allows for integration in other systems and makes it easy to extend.


[182]

Estimating replicability of classifier learning
R. Bouckaert.
Estimating replicability of classifier learning.
In Proc International Conference on Machine Learning, 2004.
[ bib 
.ps 
.pdf ]
Replicability of machine learning experiments measures how likely it is that the outcome of one experiment is repeated when performed with a different randomization of the data. In this paper, we present an estimator or replicability of an experiment that is efficient. More precisely, the estimator is unbiased and has lowest variance in the class of estimators formed by a linear combination of outcomes of experiments on a given data set.
We gathered empirical data for comparing experiments consisting of different sampling schemes and hypothesis tests. Both factors are shown to have an impact on replicability of experiments. The data suggests that sign tests should not be used due to low replicability. Ranked sum tests show better performance, but the combination of a sorted runs sampling scheme with a ttest gives the most desirable performance judged on Type I and II error and replicability.


[183]

Evaluating the Replicability of Significance Tests for Comparing
Learning Algorithms
Remco R. Bouckaert and Eibe Frank.
Evaluating the replicability of significance tests for comparing
learning algorithms.
In Proc 8th PacificAsia Conference on Knowledge Discovery and
Data Mining, volume 3056 of LNAI, pages 312, Sydney, Australia,
2004. Springer.
[ bib 
.ps 
.pdf ]
Empirical research in learning algorithms for classification tasks generally requires the use of significance tests. The quality of a test is typically judged on Type I error (how often the test indicates a difference when it should not) and Type II error (how often it indicates no difference when it should). In this paper we argue that the replicability of a test is also of importance. We say that a test has low replicability if its outcome strongly depends on the particular random partitioning of the data that is used to perform it. We present empirical measures of replicability and use them to compare the performance of several popular tests in a realistic setting involving standard learning algorithms and benchmark datasets. Based on our results we give recommendations on which test to use.


[184]

Naive Bayes classifiers that perform well with continuous variables
R. Bouckaert.
Naive bayes classifiers that perform well with continuous variables.
In G.I. Webb and X. Yu, editors, Proc Seventeenth Australian
Joint Conference on Artificial Intelligence (AI 2004), Advances in Artificial
Intelligence, volume 3339 of LNAI, pages 10891094, Cairns,
Australia, December 2004. Springer.
[ bib ]
There are three main methods for handling continuous variables in naive Bayes classifiers, namely, the normal method (parametric approach), the kernel method (non parametric approach) and discretization. In this article, we perform a methodologically sound comparison of the three methods, which shows large mutual differences of each of the methods and no single method being universally better. This suggests that a method for selecting one of the three approaches to continuous variables could improve overall performance of the naive Bayes classifier. We present three methods that can be implemented efficiently vfold cross validation for the normal, kernel and discretization method. Empirical evidence suggests that selection using 10 fold cross validation (especially when repeated 10 times) can largely and significantly improve over all performance of naive Bayes classifiers and consistently outperform any of the three popular methods for dealing with continuous variables on their own. This is remarkable, since selection among more classifiers does not consistently result in better accuracy.


[185]

Data mining in bioinformatics using Weka
Eibe Frank, Mark Hall, Leonard E. Trigg, Geoffrey Holmes, and Ian H. Witten.
Data mining in bioinformatics using Weka.
Bioinformatics, 20(15):24792481, April 2004.
[ bib 
http 
.ps 
.pdf ]
The Weka machine learning workbench provides a generalpurpose environment for automatic classification, regression, clustering, and feature selectioncommon data mining problems in bioinformatics research. It contains an extensive collection of machine learning algorithms and data preprocessing methods complemented by graphical user interfaces for data exploration and the experimental comparison of different machine learning techniques on the same problem. Weka can process data given in the form of a single relational table. Its main objectives are to (a) assist users in extracting useful information from data and (b) enable them to easily identify a suitable algorithm for generating an accurate predictive model from it.


[186]

Predicting Library of Congress classifications from Library
of Congress subject headings
Eibe Frank and Gordon W. Paynter.
Predicting library of congress classifications from library of
congress subject headings.
JASIST, 55(3):214227, 2004.
Also published as Working Paper 01/2003, Department of Computer
Science, The University of Waikato, Hamilton.
[ bib 
.ps 
.pdf ]
This paper addresses the problem of automatically assigning a Library of Congress Classification (LCC) to work given its set of Library of Congress Subject Headings (LCSH). LCC are organized in a tree: the root node of this hierarchy comprises all possible topics, and leaf nodes correspond to the most specialized topic areas defined. We describe a procedure that, given a resource identified by its LCSH, automatically places that resource in the LCC hierarchy. The procedure uses machine learning techniques and training data from a large library catalog to learn a classification model mapping from sets of LCSH to nodes in the LCC tree. We present empirical results for our technique showing its accuracy on an independent collection of 50,000 LCSH/LCC pairs.


[187]

Ensembles of nested dichotomies for multiclass problems
Eibe Frank and Stefan Kramer.
Ensembles of nested dichotomies for multiclass problems.
In R. Greiner and D. Schuurmans, editors, Proc 21st
International Conference on Machine Learning, pages 305312, Banff, Canada,
2004. ACM Press.
Also published as Working Paper 06/2004, Department of Computer
Science, The University of Waikato, Hamilton.
[ bib 
.ps.gz 
.pdf ]
Nested dichotomies are a standard statistical technique for tackling certain polytomous classification problems with logistic regression. They can be represented as binary trees that recursively split a multiclass classification task into a system of dichotomies and provide a statistically sound way of applying twoclass learning algorithms to multiclass problems (assuming these algorithms generate class probability estimates). However, there are usually many candidate trees for a given problem and in the standard approach the choice of a particular tree is based on domain knowledge that may not be available in practice. An alternative is to treat every system of nested dichotomies as equally likely and to form an ensemble classifier based on this assumption. We show that this approach produces more accurate classifications than applying C4.5 and logistic regression directly to multiclass problems. Our results also show that ensembles of nested dichotomies produce more accurate classifiers than pairwise classification if both techniques are used with C4.5, and comparable results for logistic regression. Compared to errorcorrecting output codes, they are preferable if logistic regression is used, and comparable in the case of C4.5. An additional benefit is that they generate class probability estimates. Consequently they appear to be a good generalpurpose method for applying binary classifiers to multiclass problems.


[188]

An Instrument Control System Using Predictive Modelling
Geoffrey Holmes and Dale Fletcher.
An instrument control system using predictive modelling.
In H. Araujo and et al., editors, Proceedings of the First
International Conference on Informatics in Control, Automation and Robotics,
pages 292295, Setubal, Portugal, 2004. INSTICC Press.
[ bib ]
We describe a system for providing early warning of possible error to an operator in control of an instrument providing results in batches from samples, for example, chemical elements found in soil or water samples. The system has the potential to be used with any form of instrument that provides multiple results for a given sample. The idea is to train models for each measurement, using historical data. The set of trained models are then capable of making predictions on new data based on the values of the other measurements. This approach has the potential to uncover previously unknown relationships between the measurements. An example application has been constructed that highlights the difference of the actual value for a measurement from its predicted value. The operator is provided with sliders to attenuate the sensitivity of the measurement perhaps based on its importance or its known sensitivity.


[189]

Mining data streams using option trees (revised edition, 2004)
G. Holmes, R. Kirkby, and B. Pfahringer.
Mining data streams using option trees (revised edition, 2004).
Technical Report 03/2004, The University of Waikato, Department of
Computer Science, Hamilton, New Zealand, 2004.
[ bib 
.pdf ]
The data stream model for data mining places harsh restrictions on a learning algorithm. A model must be induced following the briefest interrogation of the data, must use only available memory and must update itself over time within these constraints. Additionally, the model must be able to be used for data mining at any point in time. This paper describes a data stream classification algorithm using an ensemble of option trees. The ensemble of trees is induced by boosting and iteratively combined into a single interpretable model. The algorithm is evaluated using benchmark datasets for accuracy against stateoftheart algorithms that make use of the entire dataset.


[190]

Mining data streams using option trees
G. Holmes, R. Kirkby, and B. Pfahrinnger.
Mining data streams using option trees.
In J. AguilarRuiz and J. Gama, editors, Proc Workshop W10 on
Knowledge Discovery in Data Streams, Fifthteenth Conference on Machine
Learning, Eighth European Conference on Principles and Practice of Knowledge
Discovery in Databases (ECML/PKDD), pages 7584, Pisa, Italy, 2004.
[ bib ]
The data stream model for data mining places harsh restrictions on a learning algorithm. A model must be induced following the briefest interrogation of the data, must use only available memory and must update itself over time within these constraints. Additionally, the model must be able to be used for data mining at any point in time. This paper describes a data stream classification algorithm using an ensemble of option trees. This ensemble is iteratively combined into a single interpretable model. The algorithm is evaluated using benchmark datasets for accuracy against stateoftheart algorithms that make use of the entire dataset.


[191]

Multinomial Naive Bayes for Text Categorization Revisited
Ashraf M. Kibriya, Eibe Frank, Bernhard Pfahringer, and Geoffrey Holmes.
Multinomial naive Bayes for text categorization revisited.
In G. I. Webb and X. Yu, editors, Proc 17th Australian Joint
Conference on Artificial Intelligence, volume 3339 of LNAI, pages
488499, Cairns, Australia, 2004. Springer.
[ bib 
.ps.gz 
.pdf ]
This paper presents empirical results for several versions of the multinomial naive Bayes classifier on four text categorization problems, and a way of improving it using locally weighted learning. More specifically, it compares standard multinomial naive Bayes to the recently proposed transformed weightnormalized complement naive Bayes classifier (TWCNB) [1], and shows that some of the modifications included in TWCNB may not be necessary to achieve optimum performance on some datasets. However, it does show that TFIDF conversion and document length normalization are important. It also shows that support vector machines can, in fact, sometimes very significantly outperform both methods. Finally, it shows how the performance of multinomial naive Bayes can be improved using locally weighted learning. However, the overall conclusion of our paper is that support vector machines are still the method of choice if the aim is to maximize accuracy.


[192]

Clustering Large Datasets Using Cobweb and KMeans in Tandem
Mi Li, Geoffrey Holmes, and Bernhard Pfahringer.
Clustering large datasets using cobweb and kmeans in tandem.
In G. I. Webb and X. Yu, editors, Proc 17th Australian Joint
Conference on Artificial Intelligence, volume 3339 of LNAI, pages
368379, Cairns, Australia, 2004. Springer.
[ bib ]
This paper presents a single scan algorithm for clustering large datasets based on a two phase process which combines two well known clustering methods. The Cobweb algorithm is modified to produce a balanced tree with subclusters at the leaves, and then Kmeans is applied to the resulting subclusters. The resulting method, Scalable Cobweb, is then compared to a single pass Kmeans algorithm and standard Kmeans. The evaluation looks at error as measured by the sum of squared error and vulnerability to the order in which data points are processed.


[193]

Flow clustering using machine learning techniques
A. McGregor, M. Hall, P. Lorier, and J. Brunskill.
Flow clustering using machine learning techniques.
In C. Barakat and I. Pratt, editors, Proc Fifth International
Workshop on Passive and Active Network Measurement (PAM 2004), volume 3015
of LNCS, pages 205214, Antibes JuanlesPins, France, 2004. Springer.
[ bib ]
Packet header traces are widely used in network analysis. Header traces are the aggregate of traffic from many concurrent applications. We present a methodology, based on machine learning, that can break the trace down into clusters of traffic where each cluster has different traffic characteristics. Typical clusters include bulk transfer, single and multiple transactions and interactive traffic, amongst others. The paper includes a description of the methodology, a visualisation of the attribute statistics that aids in recognising cluster types and a discussion of the stability and effectiveness of the methodology.


[194]

Using Classification to Evaluate the Output of ConfidenceBased
Association Rule Mining
Stefan Mutter, Mark Hall, and Eibe Frank.
Using classification to evaluate the output of confidencebased
association rule mining.
In G. I. Webb and X. Yu, editors, Proc 17th Australian Joint
Conference on Artificial Intelligence, volume 3339 of LNAI, pages
538549, Cairns, Australia, 2004. Springer.
[ bib 
.ps.gz 
.pdf ]
Association rule mining is a data mining technique that reveals interesting relationships in a database. Existing approaches employ different parameters to search for interesting rules. This fact and the large number of rules make it difficult to compare the output of confidencebased association rule miners. This paper explores the use of classification performance as a metric for evaluating their output. Previous work on forming classifiers from association rules has focussed on accurate classification, whereas we concentrate on using the properties of the resulting classifiers as a basis for comparing confidencebased association rule learners. Therefore, we present experimental results on 12 UCI datasets showing that the quality of small rule sets generated by Apriori can be improved by using the predictive Apriori algorithm. We also show that CBA, the standard method for classification using association rules, is generally inferior to standard rule learners concerning both running time and size of rule sets.


[195]

Applying machine learning to programming by demonstration
Gordon W. Paynter and Ian H. Witten.
Applying machine learning to programming by demonstration.
J. Exp. Theor. Artif. Intell., 16(3):161188, 2004.
[ bib ]
'Familiar' is a tool that helps endusers automate iterative tasks in their applications by showing examples of what they want to do. It observes the user's actions, predicts what they will do next, and then offers to complete their task. Familiar learns in two ways. First, it creates a model, based on data gathered from training tasks, that selects the best prediction from among several candidates. Experiments show that decision trees outperform heuristic methods, and can be further improved by incrementally updating the classifier at task time. Second, it uses decision stumps inferred from analogous examples in the event trace to predict the parameters of conditional rules. Because data is sparsefor most users balk at giving more than a few training examplespermutation tests are used to calculate the statistical significance of each stump, successfully eliminating bias towards attributes with many different values.


[196]

A Toolbox for Learning from Relational Data with Propositional
and Multiinstance Learners
Peter Reutemann, Bernhard Pfahringer, and Eibe Frank.
A toolbox for learning from relational data with propositional and
multiinstance learners.
In G. I. Webb and X. Yu, editors, Proc 17th Australian Joint
Conference on Artificial Intelligence, volume 3339 of LNAI, pages
10171023, Cairns, Australia, 2004. Springer.
[ bib 
.ps.gz 
.pdf ]
Most databases employ the relational model for data storage. To use this data in a propositional learner, a propositionalization step has to take place. Similarly, the data has to be transformed to be amenable to a multiinstance learner. The Proper Toolbox contains an extended version of RELAGGS, the MultiInstance Learning Kit MILK, and can also combine the multiinstance data with aggregated data from RELAGGS. RELAGGS was extended to handle arbitrarily nested relations and to work with both primary keys and indices. For MILK the relational model is flattened into a single table and this data is fed into a multiinstance learner. REMILK finally combines the aggregated data produced by RELAGGS and the multiinstance data, flattened for MILK, into a single table that is once again the input for a multiinstance learner. Several wellknown datasets are used for experiments which highlight the strengths and weaknesses of the different approaches.


[197]

Computer Aided Software Engineering (CASE), empowered by Natural Language Processing (NLP) for object oriented analysis and design using Unified Modeling Language (UML)
T. C. Smith and R. T. Giganto.
Computer aided software engineering (case), empowered by natural
language processing (nlp) for object oriented analysis and design using
unified modeling language (uml).
In Proc Second Mindanao Conference on Information Technology
Education (MCITE'04), pages 1923, Davao City, Philippines, 2004.
[ bib ]
Computer Aided Software Engineering (CASE) tools are often regarded as the ultimate aide for software engineers in developing software. However, studies show that CASE tools are not enough to assist developers fully in their analysis tasks. Natural Language Processing (NLP) has been empowering these CASE tools to at least aid developers in the analysis phases. Still, NLPbased CASE tools have certain weaknesses when it comes to detecting classes or objects in the system. With the advent of object oriented analysis and design (OOAD), Unified Modeling Language (UML) has come to be the preferred medium for modeling systems. Existing NLPbased CASE tools produce object models in UML; however, unnecessary classes are often created in the process, while other object models are not produced at all. This paper introduces a research motivation aimed at designing a new NLPbased CASE system that minimizes generation of unnecessary classes and expresses object models in UML.


[198]

A textclassification approach to the prediction and characterization of signal peptides
T. C. Smith.
A textclassification approach to the prediction and characterization
of signal peptides.
In Proc X Convencion Internacional y Feria, Libro Reumen,
Informatica 2004, pages 361362, La Habana, Cuba, 2004.
[ bib ]
This paper describes an unconventional machine learning approach to predicting signal peptide cleavage points through a kind of pseudotext classification procedure. It is based on the view that the first amino residue in the mature protein can be characterized with a textual summary of some of its specific properties. By creating a textual description of every amino residuethat is, an account of such things as the chemical properties of the residue, its context in the aminoacid chain, its relative proximity to the amino terminus, and so fortha text classification algorithm can infer a characteristic model of those textual features that imply whether or not any one residue is or is not the start of the mature protein. More to the point, insofar as a biologist might give an account, in English, as to why some particular residue in the aminoacid chain actually marks the cleavage site, a ranked list of the textual attributes inferred by a text classification algorithm can do the same.


[199]

Data mining bread quality and process data in a plant bakery
A.J. Wilson, M.P. Morgenstern, B. Pfahringer, and C. Leschi.
Data mining bread quality and process data in a plant bakery.
In Proc Twelfth ICC Cereal and Bread Congress, Harrogate, UK,
May 2004.
[ bib 
.ps 
.pdf ]
In modern automated plant bakeries a large amount of data is collected on the operation of the plant. When this data is combined with product quality data such as loaf colour, appearance, consumer complaints sales data etc, it has the potential to be used to improve processing efficiency, final product quality, and product marketability. However the huge volume of this data means it is often ignored as being too hard to analyse in any meaningful way. Data mining, which is a combination of techniques that produces information from large data sets, has the potential to be applied to this data to extract useful information.
This paper describes our experience of applying data mining techniques to a plant bakery in New Zealand. The process involved setting up the systems required to extract data from the bakeries SCADA system, setting up sensors to automatically measure and record quality parameters, cleaning the data to remove faulted or anomalous results and then combining all the separate data blocks into one complete database for analysis.
Data were analysed at two levels. Firstly, selected data were analysed for simple trends on an individual loaf basis which served to identify variability caused by divider pockets, tin positioning etc. Secondly data mining techniques such as various classifiers and principal components were applied to the whole data set to find relationships between process data and product quality.


[200]

Text mining in a digital library
Ian H. Witten, Katherine J. Don, Michael Dewsnip, and Valentin Tablan.
Text mining in a digital library.
Int. J. on Digital Libraries, 4(1):5659, 2004.
[ bib ]
Digital librarians strive to add value to the collections they create and maintain. One way is through selectivity: a carefully chosen set of authoritative documents in a particular topic area is far more useful to those working in the area than a huge, unfocused collections (like the Web). Another is by augmenting the collection with highquality metedata, which supports activities of searching and browsing in a uniform and useful way. A third way, and our topic here, is to enrich the documents by examining their content, extracting information, and using it to enhance the ways they can be located and presented.


[201]

Adaptive text mining: inferring structure from sequences
Ian H. Witten.
Adaptive text mining: inferring structure from sequences.
J. Discrete Algorithms, 2(2):137159, 2004.
[ bib ]
Text mining is about inferring structure from sequences representing natural language text, and may be defined as the process of analyzing text to extract information that is useful for particular purposes. Although handcrafted heuristics are a common practical approach for extracting information from text, a general, and generalizable, approach requires adaptive techniques. This paper studies the way in which the adaptive techniques used in text compression can be applied to text mining. It develops several examples: extraction of hierarchical phrase structures from text, identification of keyphrases in documents, locating proper names and quantities of interest in a piece of text, text categorization, word segmentation, acronym extraction, and structure recognition. We conclude that compression forms a sound unifying principle that allows many text mining problems to be tacked adaptively.


[202]

Logistic Regression and Boosting for Labeled Bags of Instances
Xin Xu and Eibe Frank.
Logistic regression and boosting for labeled bags of instances.
In H. Dai, R. Srikant, and C. Zhang, editors, Proc 8th
PacificAsia Conference on Knowledge Discovery and Data Mining, volume 3056
of LNAI, pages 272281, Sydney, Australia, 2004. Springer.
[ bib 
.ps 
.pdf ]
In this paper we upgrade linear logistic regression and boosting to multiinstance data, where each example consists of a labeled bag of instances. This is done by connecting predictions for individual instances to a baglevel probability estimate by simple averaging and maximizing the likelihood at the bag levelin other words, by assuming that all instances contribute equally and independently to a bag�s label. We present empirical results for artificial data generated according to the underlying generative model that we assume, and also show that the two algorithms produce competitive results on the Musk benchmark datasets.


[203]

A novel two stage scheme utilizing the test set for model selection in text classification
Bernhard Pfahringer, Peter Reutemann, and Mike Mayo.
A novel two stage scheme utilizing the test set for model selection
in text classification.
In Ranadhir Ghosh, Brijesh Verma, and Xue Li, editors, Proc
Workshop on Learning Algorithms for Pattern Recognition, Eighteenth
Australian Joint Conference on Artificial Intelligence (AI'05), pages
6065, Sydney, Australia, 2005. University of Technology.
59 December 2005.
[ bib 
.pdf ]
Text classification is a natural application domain for semi
supervised learning, as labeling documents is expensive, but
on the other hand usually an abundance of unlabeled
documents is available. We describe a novel simple two
stage scheme based on dagging which allows for utilizing
the test set in model selection. The dagging ensemble can
also be used by itself instead of the original classifier. We
evaluate the performance of a meta classifier choosing
between various base learners and their respective dagging
ensembles. The selection process seems to perform robustly
especially for small percentages of available labels for
training.


[204]

Tie Breaking in Hoeffding trees
Geoffrey Holmes, Richard Kirkby, and Bernhard Pfahringer.
Tie breaking in hoeffding trees.
In J. Gama and J. S. AguilarRuiz, editors, Proc Workshop W6:
Second International Workshop on Knowledge Discovery in Data Streams, pages
107116, 2005.
[ bib ]


[205]

Cache hierarchy inspired compression: a novel architecture for data streams
Geoffrey Holmes, Bernhard Pfahringer, and Richard Kirkby.
Cache hierarchy inspired compression: a novel architecture for data
streams.
In Narayanan Kulathuramaiyer, Alvin W. Yeo, Wang Yin Chai, and
Tan Chong Eng, editors, Proc Fourth International Conference on
Information Technology in Asia (CITA'05), pages 130136, 2005.
1215 December 2005.
[ bib 
.pdf ]
We present an architecture for data streams
based on structures typically found in web cache
hierarchies. The main idea is to build a meta level
analyser from a number of levels constructed over time
from a data stream. We present the general architecture
for such a system and an application to classification.
This architecture is an instance of the general wrapper
idea allowing us to reuse standard batch learning
algorithms in an inherently incremental learning
environment. By artificially generating data sources we
demonstrate that a hierarchy containing a mixture of
models is able to adapt over time to the source of the
data. In these experiments the hierarchies use an
elementary performance based replacement policy and
unweighted voting for making classification decisions.


[206]

Inductive Logic Programming, 15th International Conference,
ILP 2005, Bonn, Germany, August 1013, 2005, Proceedings
Stefan Kramer and Bernhard Pfahringer, editors.
Inductive Logic Programming, 15th International Conference, ILP
2005, Bonn, Germany, August 1013, 2005, Proceedings, volume 3625 of
Lecture Notes in Computer Science. Springer, 2005.
[ bib ]


[207]

Unsupervised Discretization Using TreeBased Density Estimation
Gabi Schmidberger and Eibe Frank.
Unsupervised discretization using treebased density estimation.
In Proc 9th European Conference on Principles and Practice of
Knowledge Discovery in Databases, Porto, Portugal, pages 240251. Springer,
2005.
[ bib 
.ps.gz 
.pdf ]
This paper presents an unsupervised discretization method
that performs density estimation for univariate data. The subintervals
that the discretization produces can be used as the bins of a histogram.
Histograms are a very simple and broadly understood means for display
ing data, and our method automatically adapts bin widths to the data.
It uses the loglikelihood as the scoring function to select cut points and
the crossvalidated loglikelihood to select the number of intervals. We
compare this method with equalwidth discretization where we also se
lect the number of bins using the crossvalidated loglikelihood and with
equalfrequency discretization.


[208]

Ensembles of Balanced Nested Dichotomies for Multiclass
Problems
Lin Dong, Eibe Frank, and Stefan Kramer.
Ensembles of balanced nested dichotomies for multiclass problems.
In Proc 9th European Conference on Principles and Practice of
Knowledge Discovery in Databases, Porto, Portugal, pages 8495. Springer,
2005.
[ bib 
.ps.gz 
.pdf ]
A system of nested dichotomies is a hierarchical
decomposition of a multiclass problem with c classes into c  1
twoclass problems and can be represented as a tree
structure. Ensembles of randomlygenerated nested dichotomies have
proven to be an effective approach to multiclass learning problems
[1]. However, sampling trees by giving each tree equal probability
means that the depth of a tree is limited only by the number of
classes, and very unbalanced trees can negatively affect runtime. In
this paper we investigate two approaches to building balanced nested
dichotomiesclassbalanced nested dichotomies and databalanced nested
dichotomiesand evaluate them in the same ensemble setting. Using
C4.5 decision trees as the base models, we show that both approaches
can reduce runtime with little or no effect on accuracy, especially
on problems with many classes. We also investigate the effect of
caching models when building ensembles of nested dichotomies.


[209]

Speeding Up Logistic Model Tree Induction
Marc Sumner, Eibe Frank, and Mark A. Hall.
Speeding up logistic model tree induction.
In Proc 9th European Conference on Principles and Practice of
Knowledge Discovery in Databases, Porto, Portugal, pages 675683. Springer,
2005.
[ bib 
.ps.gz 
.pdf ]
Logistic Model Trees have been shown to be very accurate
and compact classifiers [8]. Their greatest disadvantage is the computa
tional complexity of inducing the logistic regression models in the tree.
We address this issue by using the AIC criterion [1] instead of cross
validation to prevent overfitting these models. In addition, a weight trim
ming heuristic is used which produces a significant speedup. We compare
the training time and accuracy of the new induction process with the
original one on various datasets and show that the training time often
decreases while the classification accuracy diminishes only slightly.


[210]

StressTesting Hoeffding Trees
Geoffrey Holmes, Richard Kirkby, and Bernhard Pfahringer.
Stresstesting hoeffding trees.
In Proc 9th European Conference on Principles and Practice of
Knowledge Discovery in Databases, Porto, Portugal, pages 495502. Springer,
2005.
[ bib ]


[211]

Logistic Model Trees
Niels Landwehr, Mark Hall, and Eibe Frank.
Logistic model trees.
Machine Learning, 59(12):161205, 2005.
[ bib 
.ps.gz 
.pdf ]
Tree induction methods and linear models are popular techniques for supervised learning tasks, both for the prediction of nominal classes and numeric values. For predicting numeric quantities, there has been work on combining these two schemes into 'model trees', i.e. trees that contain linear regression functions at the leaves. In this paper, we present an algorithm that adapts this idea for classification problems, using logistic regression instead of linear regression. We use a stagewise fitting process to construct the logisitic regression models that can select relevant attributes in the data in a natural way, and show how this approach can be used to build the logistic regression models at the leaves by incrementally refining those constructed at higher levels in the tree. We compare the performance of our algorithm to several other stateoftheart learning schemes on 36 benchmark UCI datasets, and show that it produces accurate and compact classifiers.


[212]

Gene selection from microarray data for cancer classification
 a machine learning approach
Yu Wang, Igor V. Tetko, Mark A. Hall, Eibe Frank, Axel Facius, Klaus F. X.
Mayer, and HansWerner Mewes.
Gene selection from microarray data for cancer classification  a
machine learning approach.
Computational Biology and Chemistry, 29(1):3746, 2005.
[ bib 
http ]


[213]

Kea: Practical automatic keyphrase extraction
Ian H. Witten, Gordon W. Paynter, Eibe Frank, Carl Gutwin, and Craig G.
NevillManning.
Kea: Practical automatic keyphrase extraction.
In Y.L. Theng and S. Foo, editors, Design and Usability of
Digital Libraries: Case Studies in the Asia Pacific, pages 129152.
Information Science Publishing, London, 2005.
[ bib 
.ps.gz 
.pdf ]
Keyphrases provide semantic metadata that summarize and characterize documents. This chapter describes Kea, an algorithm for automatically extracting keyphrases from text. Kea identifies candidate keyphrases using lexical methods, calculates feature values for each candidate, and uses a machinelearning algorithm to predict which candidates are good keyphrases. The machinelearning scheme first builds a prediction model using training documents with known keyphrases, and then uses the model to find keyphrases in new documents. We use a large text corpus to evaluate Kea's effectiveness in terms of how many authorassigned keyphrases are correctly identified. The system is simple, robust, and available under the GNU General Public License; the chapter gives instructions for use.


[214]

Data Mining: Practical Machine Learning Tools and Techniques
Ian H. Witten and Eibe Frank.
Data Mining: Practical Machine Learning Tools and Techniques.
Morgan Kaufmann, San Francisco, 2 edition, 2005.
[ bib 
.html ]


[215]

WEKA  A Machine Learning Workbench for Data Mining
Eibe Frank, Mark A. Hall, Geoffrey Holmes, Richard Kirkby, Bernhard Pfahringer,
Ian H. Witten, and Leonhard Trigg.
Weka  a machine learning workbench for data mining.
In Oded Maimon and Lior Rokach, editors, The Data Mining and
Knowledge Discovery Handbook, pages 13051314. Springer, 2005.
[ bib 
.ps.gz 
.pdf ]
The Weka workbench is an organized collection of stateoftheart machine learning algorithms and data preprocessing tools. The basic way
of interacting with these methods is by invoking them from the com
mand line. However, convenient interactive graphical user interfaces are
provided for data exploration, for setting up largescale experiments on
distributed computing platforms, and for designing configurations for
streamed data processing. These interfaces constitute an advanced en
vironment for experimental data mining. The system is written in Java
and distributed under the terms of the GNU General Public License.


[216]

Semantic RoleLabeling via Consensus PatternMatching
ChiSan Lin and Tony C. Smith.
Semantic rolelabeling via consensus patternmatching.
In Proceedings of the Ninth Conference on Computational Natural
Language Learning, pages 185188, Ann Arbor, Michigan, June 2005.
[ bib 
.pdf ]
This paper describes a system for semantic
role labeling for the CoNLL2005 Shared
task. We divide the task into two subtasks:
boundary recognition by a general tree
based predicateargument recognition algo
rithm to convert a parse tree into a flat rep
resentation of all predicates and their
related boundaries, and role labeling by a
consensus model using a patternmatching
framework to find suitable roles for core
constituents and adjuncts. We describe the
system architecture and report results for
the CoNLL2005 development dataset.


[217]

Racing for Conditional Independence Inference
Remco R. Bouckaert and Milan Studeny.
Racing for conditional independence inference.
In Symbolic and Quantitative Approaches to Reasoning with
Uncertainty, 8th European Conference, ECSQARU 2005, volume 3571 of
LNCS, pages 221232, 2005.
[ bib 
.pdf ]
In this article, we consider the computational aspects of deciding
whether a conditional independence statement t is implied by a list
of conditional independence statements L using the implication related
to the method of structural imsets.
We present two methods which have the interesting complementary properties
that one method performs well to prove that t is implied by L, while the
other performs well to prove that t is not implied by L. However, both
methods do not perform well the opposite. This gives rise to a parallel
algorithm in which both methods race against each other in order to
determine effectively whether t is or is not implied.
Some empirical evidence is provided that suggest this racing algorithms
method performs a lot better than an existing method based on socalled
skeletal characterization of the respective implication.
Furthermore, the method is able to handle more than five variables.


[218]

Low Replicability of Machine Learning Experiments is not a Small Data Set Phenomenon
R. R. Bouckaert.
Low replicability of machine learning experiments is not a small data
set phenomenon.
In Proceedings of the ICML'05 Workshop on MetaLearning, 2005.
[ bib 
.pdf ]
This paper investigates the relation between replicability of
experiments
for deciding which of two algorithms performs better on a given data set.
We prove that lack of replicability is not just a small data phenomenon
(as was shown before),
but is present in experiments on medium and large data sets as well.
We establish intuition in the relation between data set size, power and
replicability.
The main method for improving replicability is to increase the number of
samples. For large data sets and/or inefficient learning algorithms, this
implies that exerperiments may take a long time to completion.
We propose a procedure for deciding which of two learning algorithms is
best that has a high replicability but takes moderate computational effort.


[219]

Bayesian Sequence Learning for Predicting Protein Cleavage Points
Mike Mayo.
Bayesian sequence learning for predicting protein cleavage points.
In Advances in Knowledge Discovery and Data Mining: 9th
PacificAsia Conference, PAKDD 2005, page 192, Hanoi, Vietnam, 2005.
May 1820, 2005.
[ bib 
.pdf ]
A challenging problem in data mining is the application of efficient techniques to automatically annotate the vast databases of biological sequence data. This paper describes one such application in this area, to the prediction of the position of signal peptide cleavage points along protein sequences. It is shown that the method, based on Bayesian statistics, is comparable in terms of accuracy to the existing stateoftheart neural network techniques while providing explanatory information for its predictions.


[220]

Learning Petri Net Models of NonLinear Gene Interactions
Mike Mayo.
Learning petri net models of nonlinear gene interactions.
BioSystems, 85(1):7482, 2005.
[ bib 
.pdf ]
Understanding how an individual's genetic makeup influences their risk of disease is a problem of paramount importance. Although machine learning techniques are able to uncover the relationships between genotype and disease, the problem of automatically building the best biochemical model or explanation of the relationship has received less attention. In this paper, I describe a method based on random hill climbing that automatically builds Petri net models of nonlinear (or multifactorial) diseasecausing genegene interactions. Petri nets are a suitable formalism for this problem, because they are used to model concurrent, dynamic processes analogous to biochemical reaction networks. I show that this method is routinely able to identify perfect Petri net models for three diseasecausing genegene interactions recently reported in the literature.


[221]

Generalized Unified Decomposition of Ensemble Loss
Remco R. Bouckaert, Michael Goebel, and Patricia J. Riddle.
Generalized unified decomposition of ensemble loss.
In Proc 19th Australian Joint Conference on Artificial
Intelligence, pages 11331139, 2006.
[ bib 
.pdf ]
Goebel et al. [4] presented a unified
decomposition of ensemble loss for explaining ensemble
performance. They considered democratic voting schemes with uniform
weights, where the various base classifiers each can vote for a single
class once only. In this article, we generalize their decomposition to
cover weighted, probabilistic voting schemes and nonuniform
(progressive) voting schemes. Empirical results suggest that
democratic voting schemes can be outperformed by probabilistic and
progressive voting schemes. This makes the generalization worth
exploring and we show how to use the generalization to analyze
ensemble loss.


[222]

Efficient AUC Learning Curve Calculation
Remco R. Bouckaert.
Efficient auc learning curve calculation.
In Proc 19th Australian Joint Conference on Artificial
Intelligence, pages 181191, 2006.
[ bib 
.pdf ]
A learning curve of a performance measure provides a graphical method
with many benefits for judging classifier properties. The area under
the ROC curve (AUC) is a useful and increasingly popular performance
measure. In this paper, we consider the computational aspects of
calculating AUC learning curves. A new method is provided for
incrementally updating exact AUC curves and for calculating approximate
AUC curves for datasets with millions of instances. Both theoretical
and empirical justifications are given for the approximation. Variants
for incremental exact and approximate AUC curves are provided as well.


[223]

Voting Massive Collections of Bayesian Network Classifiers
for Data Streams
Remco R. Bouckaert.
Voting massive collections of bayesian network classifiers for data
streams.
In Proc 19th Australian Joint Conference on Artificial
Intelligence, pages 243252, 2006.
[ bib 
.pdf ]
We present a new method for voting exponential (in the
number of attributes) size sets of Bayesian classifiers in polynomial
time with polynomial memory requirements. Training is linear in the
number of instances in the dataset and can be performed
incrementally. This allows the collection to learn from massive data
streams. The method allows for flexibility in balancing computational
complexity, memory requirements and classification performance. Unlike
many other incremental Bayesian methods, all statistics kept in memory
are directly used in classification.
Experimental results show that the classifiers perform well on both
small and very large data sets, and that classification performance
can be weighed against computational and memory costs.


[224]

A treebased algorithm for predicateargument recognition
ChiSan Lin and Tony Smith.
A treebased algorithm for predicateargument recognition.
Association for Computing Machinery New Zealand Bulletin,
2(1):online journal, January 2006.
[ bib ]


[225]

Improving on Bagging with Input Smearing
Eibe Frank and Bernhard Pfahringer.
Improving on bagging with input smearing.
In Proc 10th PacificAsia Conference on Knowledge Discovery and
Data Mining, Singapore. Springer, 2006.
[ bib 
.ps.gz 
.pdf ]
Bagging is an ensemble learning method that has proved to
be a useful tool in the arsenal of machine learning
practitioners. Commonly applied in conjunction with decision tree
learners to build an ensemble of decision trees, it often leads to
reduced errors in the predictions when compared to using a single
tree. A single tree is built from a training set of size N . Bagging
is based on the idea that, ideally, we would like to eliminate the
variance due to a particular training set by combining trees built
from all training sets of size N . However, in practice, only one
training set is available, and bagging simulates this platonic method
by sampling with replacement from the original training data to form
new training sets. In this paper we pursue the idea of sampling from a
kernel density estimator of the underlying distribution to form new
training sets, in addition to sampling from the data itself. This can
be viewed as smearing out the resampled training data to generate
new datasets, and the amount of smear is controlled by a parameter.
We show that the resulting method, called input smearing, can lead
to improved results when compared to bagging. We present results for
both classification and regression problems.


[226]

Using Weighted Nearest Neighbor to Benefit from Unlabeled Data
Kurt Driessens, Peter Reutemann, Bernhard Pfahringer, and Claire Leschi.
Using weighted nearest neighbor to benefit from unlabeled data.
In Wee Keong Ng, Masaru Kitsuregawa, Jianzhong Li, and Kuiyu Chang,
editors, Advances in Knowledge Discovery and Data Mining, 10th
PacificAsia Conference, PAKDD 2006, volume 3918 of LNCS, pages
6069, 2006.
[ bib 
.pdf ]
The development of datamining applications such as textclassification and molecular profiling has shown the need for machine learning algorithms that can benefit from both labeled and unlabeled data, where often the unlabeled examples greatly outnumber the labeled examples. In this paper we present a twostage classifier that improves its predictive accuracy by making use of the available unlabeled data. It uses a weighted nearest neighbor classification algorithm using the combined examplesets as a knowledge base. The examples from the unlabeled set are prelabeled by an initial classifier that is build using the limited available training data. By choosing appropriate weights for this prelabeled data, the nearest neighbor classifier consistently improves on the original classifier.


[227]

A SemiSupervised Spam Mail Detector
Bernhard Pfahringer.
A semisupervised spam mail detector.
In Steffen Bickel, editor, Proceedings of the ECML/PKDD 2006
Discovery Challenge Workshop, pages 4853. Humboldt University Berlin,
2006.
[ bib 
.pdf ]
This document describes a novel semisupervised approach to spam classification, which was successful at the ECML/PKDD2006 spam classification challenge. A local learning method based on lazy projections was successfully combined with a variant of a standard semisupervised learning algorithm.


[228]

Naive Bayes for Text Classification with Unbalanced Classes
Eibe Frank and Remco R. Bouckaert.
Naive bayes for text classification with unbalanced classes.
In Proc 10th European Conference on Principles and Practice of
Knowledge Discovery in Databases, Berlin, Germany, pages 503510. Springer,
2006.
[ bib 
.pdf ]
Multinomial naive Bayes (MNB) is a popular method for document classification due to its computational efficiency and relatively good predictive performance. It has recently been established that predictive performance can be improved further by appropriate data transformations. In this paper we present another transformation that is designed to combat a potential problem with the application of MNB to unbalanced datasets. We propose an appropriate correction by adjusting attribute priors. This correction can be implemented as another data normalization step, and we show that it can significantly improve the area under the ROC curve. We also show that the modified version of MNB is very closely related to the simple centroidbased classifier and compare the two methods empirically.


[229]

Automatic Species Identification of Live Moths
Michael Mayo.
Automatic species identification of live moths.
In Ellis et. al, editor, Proc. of the 26th SGAI International
Conference on Innovative Techniques and Applications of Artificial
Intelligence, pages 195202, 2006.
[ bib 
.pdf ]
A collection consisting of the images of 774 live moth individuals, each moth belonging to one of 35 different UK species, was analysed to determine if data mining techniques could be used effectively for automatic species identification. Feature vectors were extracted from each of the moth images and the machine learning toolkit WEKA was used to classify the moths by species using the feature vectors. Whereas a previous analysis of this image dataset reported in the literature [1] required that each moth's least worn wing region be highlighted manually for each image, WEKA was able to achieve a greater level of accuracy (85%) using support vector machines without manual specification of a region of interest at all. This paper describes the features that were extracted from the images, and the various experiments using different classifiers and datasets that were performed. The results show that data mining can be usefully applied to the problem of automatic species identification of live specimens in the field.


[230]

Random Relational Rules
Grant Anderson and Bernhard Pfahringer.
Random relational rules.
In Stephen Muggleton and Ramon Otero, editors, Extended
Abstracts for 16th International Conference on Inductive Logic Programming,
2006.
[ bib 
.pdf ]
Exhaustive search in relational learning is generally
infeasible, therefore some form of heuristic search is usually
employed, such as in FOIL[1]. On the other hand, socalled stochastic
discrimination provides a framework for combining arbitrary numbers of
weak classifiers (in this case randomly generated relational rules) in
a way where accuracy improves with additional rules, even after maxi
mal accuracy on the training data has been reached.[2] The weak
classifiers must have a slightly higher probability of covering
instances of their target class than of other classes. As the rules
are also independent and identically distributed, the Central Limit
theorem applies and as the number of weak classifiers/rules grows,
coverages for different classes resemble wellseparated normal
distribu tions. Stochastic discrimination is closely related to other
ensemble methods like Bagging, Boosting, or Random forests, all of
which have been tried in relational learning[3, 4, 5].


[231]

A Comparison of Multiinstance Learning Algorithms
Lin Dong.
A comparison of multiinstance learning algorithms.
Master's thesis, Department of Computer Science, University of
Waikato, 2006.
[ bib 
http ]
Motivated by various challenging realworld applications,
such as drug activity prediction and image retrieval, multiinstance
(MI) learning has attracted considerable interest in recent
years. Compared with standard supervised learning, the MI learning
task is more difficult as the label information of each training
example is incomplete. Many MI algorithms have been proposed. Some of
them are specifically designed for MI problems whereas others have
been upgraded or adapted from standard singleinstance learning
algorithms. Most algorithms have been evaluated on only one or two
benchmark datasets, and there is a lack of systematic comparisons of
MI learning algorithms. This thesis presents a comprehensive study of
MI learning algorithms that aims to compare their performance and find
a suitable way to properly address different MI problems. First, it
briefly reviews the history of research on MI learning. Then it
discusses five general classes of MI approaches that cover a total of
16 MI algorithms. After that, it presents empirical results for these
algorithms that were obtained from 15 datasets which involve five
different realworld application domains. Finally, some conclusions
are drawn from these results: (1) applying suitable standard
singleinstance learners to MI problems can often generate the best
result on the datasets that were tested, (2) algorithms exploiting the
standard asymmetric MI assumption do not show significant advantages
over approaches using the socalled collective assumption, and (3)
different MI approaches are suitable for different application
domains, and no MI algorithm works best on all MI problems.


[232]

Reinforcement Learning for Racecar Control
Benjamin George Cleland.
Reinforcement learning for racecar control.
Master's thesis, Department of Computer Science, University of
Waikato, 2006.
[ bib 
http ]
This thesis investigates the use of reinforcement learning to learn to drive a racecar in the simulated environment of the Robot Automobile Racing Simulator. Reallife race driving is known to be difficult for humans, and expert human drivers use complex sequences of actions. There are a large number of variables, some of which change stochastically and all of which may affect the outcome. This makes driving a promising domain for testing and developing Machine Learning techniques that have the potential to be robust enough to work in the real world. Therefore the principles of the algorithms from this work may be applicable to a range of problems. The investigation starts by finding a suitable data structure to represent the information learnt. This is tested using supervised learning. Reinforcement learning is added and roughly tuned, and the supervised learning is then removed. A simple tabular representation is found satisfactory, and this avoids difficulties with more complex methods and allows the investigation to concentrate on the essentials of learning. Various reward sources are tested and a combination of three are found to produce the best performance. Exploration of the problem space is investigated. Results show exploration is essential but controlling how much is done is also important. It turns out the learning episodes need to be very long and because of this the task needs to be treated as continuous by using discounting to limit the size of the variables stored. Eligibility traces are used with success to make the learning more efficient. The tabular representation is made more compact by hashing and more accurate by using smaller buckets. This slows the learning but produces better driving. The improvement given by a rough form of generalisation indicates the replacement of the tabular method by a function approximator is warranted. These results show reinforcement learning can work within the Robot Automobile Racing Simulator, and lay the foundations for building a more efficient and competitive agent.


[233]

Improving Hoeffding Trees
Richard Kirkby.
Improving Hoeffding Trees.
PhD thesis, Department of Computer Science, University of Waikato,
2007.
[ bib 
http ]
Modern information technology allows information to be
collected at a far greater rate than ever before. So fast, in fact,
that the main problem is making sense of it all. Machine learning
offers promise of a solution, but the field mainly focusses on
achieving high accuracy when data supply is limited. While this has
created sophisticated classification algorithms, many do not cope
with increasing data set sizes. When the data set sizes get to a
point where they could be considered to represent a continuous
supply, or data stream, then incremental classification algorithms
are required. In this setting, the effectiveness of an algorithm
cannot simply be assessed by accuracy alone. Consideration needs to
be given to the memory available to the algorithm and the speed at
which data is processed in terms of both the time taken to predict
the class of a new data sample and the time taken to include this
sample in an incrementally updated classification model. The
Hoeffding tree algorithm is a stateoftheart method for inducing
decision trees from data streams. The aim of this thesis is to
improve this algorithm. To measure improvement, a comprehensive
framework for evaluating the performance of data stream algorithms
is developed. Within the framework memory size is fixed in order to
simulate realistic application scenarios. In order to simulate
continuous operation, classes of synthetic data are generated
providing an evaluation on a large scale. Improvements to many
aspects of the Hoeffding tree algorithm are demonstrated. First, a
number of methods for handling continuous numeric features are
compared. Second, tree prediction strategy is investigated to
evaluate the utility of various methods. Finally, the possibility of
improving accuracy using ensemble methods is explored. The
experimental results provide meaningful comparisons of accuracy and
processing speeds between different modifications of the Hoeffding
tree algorithm under various memory limits. The study on numeric
attributes demonstrates that sacrificing accuracy for space at the
local level often results in improved global accuracy. The
prediction strategy shown to perform best adaptively chooses between
standard majority class and Naive Bayes prediction in the
leaves. The ensemble method investigation shows that combining trees
can be worthwhile, but only when sufficient memory is available, and
improvement is less likely than in traditional machine learning. In
particular, issues are encountered when applying the popular
boosting method to streams.


[234]

Racing algorithms for conditional independence inference
Remco R. Bouckaert and Milan Studeny.
Racing algorithms for conditional independence inference.
Int. J. Approx. Reasoning, 45(2):386401, 2007.
[ bib 
.pdf ]
In this article, we consider the computational aspects of
deciding whether a conditional independence statement t is implied
by a list of conditional independence statements L using the
implication related to the method of structural imsets. We present
two methods which have the interesting complementary properties that
one method performs well to prove that t is implied by L, while
the other performs well to prove that t is not implied by
L. However, both methods do not well perform the opposite. This
gives rise to a parallel algorithm in which both methods race against
each other in order to determine effectively whether t is or is not
implied.
Some empirical evidence is provided that suggest this racing
algorithms method performs considerably better than an existing method
based on socalled skeletal characterization of the respective
implication. Furthermore, unlike previous methods, the method is able
to handle more than five variables.


[235]

Syntaxdriven argument identification and multiargument classification for semantic role labeling
ChiSan Althon Lin.
Syntaxdriven argument identification and multiargument
classification for semantic role labeling.
PhD thesis, Department of Computer Science, University of Waikato,
2007.
[ bib 
http ]
Semantic role labeling is an important stage in systems for Natural Language Understanding. The basic problem is one of identifying who did what to whom for each predicate in a sentence. Thus labeling is a twostep process: identify constituent phrases that are arguments to a predicate, then label those arguments with appropriate thematic roles. Existing systems for semantic role labeling use machine learning methods to assign roles oneatatime to candidate arguments. There are several drawbacks to this general approach. First, more than one candidate can be assigned the same role, which is undesirable. Second, the search for each candidate argument is exponential with respect to the number of words in the sentence. Third, singlerole assignment cannot take advantage of dependencies known to exist between semantic roles of predicate arguments, such as their relative juxtaposition. And fourth, execution times for existing algorithm are excessive, making them unsuitable for realtime use. This thesis seeks to obviate these problems by approaching semantic role labeling as a multiargument classification process. It observes that the only valid arguments to a predicate are unembedded constituent phrases that do not overlap that predicate. Given that semantic role labeling occurs after parsing, this thesis proposes an algorithm that systematically traverses the parse tree when looking for arguments, thereby eliminating the vast majority of impossible candidates. Moreover, instead of assigning semantic roles one at a time, an algorithm is proposed to assign all labels simultaneously; leveraging dependencies between roles and eliminating the problem of duplicate assignment. Experimental results are provided as evidence to show that a combination of the proposed argument identification and multiargument classification algorithms outperforms all existing systems that use the same syntactic information.


[236]

An Empirical Comparison of Exact Nearest Neighbour Algorithms
Ashraf M. Kibriya and Eibe Frank.
An empirical comparison of exact nearest neighbour algorithms.
In Proc 11th European Conference on Principles and Practice of
Knowledge Discovery in Databases, Warsaw, Poland, pages 140151. Springer,
2007.
[ bib 
.pdf ]
Abstract. Nearest neighbour search (NNS) is an old problem that is
of practical importance in a number of fields. It involves finding, for
a given point q, called the query, one or more points from a given set
of points that are nearest to the query q. Since the initial inception of
the problem a great number of algorithms and techniques have been
proposed for its solution. However, it remains the case that many of the
proposed algorithms have not been compared against each other on a
wide variety of datasets. This research attempts to fill this gap to some
extent by presenting a detailed empirical comparison of three prominent
data structures for exact NNS: KDTrees, Metric Trees, and Cover Trees.
Our results suggest that there is generally little gain in using Metric Trees
or Cover Trees instead of KDTrees for the standard NNS problem.


[237]

Effective Classifiers for Detecting Objects
Michael Mayo.
Effective classifiers for detecting objects.
In Proc. of the Fourth International Conference on Computational
Intelligence, Robotics, and Autonomous Systems (CIRAS '07), 2007.
[ bib 
.pdf ]
Several stateoftheart machine learning classifiers are compared for the purposes of object detection in complex images, using global image features derived from the Ohta color space and Local Binary Patterns. Image complexity in this sense refers to the degree to which the target objects are occluded and/or nondominant (i.e. not in the foreground) in the image, and also the degree to which the images are cluttered with nontarget objects. The results indicate that a voting ensemble of Support Vector Machines, Random Forests, and Boosted Decision Trees provide the best performance with AUC values of up to 0.92 and Equal Error Rate accuracies of up to 85.7% in stratified 10fold cross validation experiments on the GRAZ02 complex image dataset.


[238]

Random Convolution Ensembles
Michael Mayo.
Random convolution ensembles.
In Hi Shing Ip et. al, editor, Advances in Multimedia
Information Processing  PCM 2007, 8th Pacific Rim Conference on Multimedia,
Lecture Notes in Computer Science 4810, pages 216225. Springer, 2007.
[ bib 
.pdf ]
A novel method for creating diverse ensembles of image classifiers is proposed. The idea is that, for each base image classifier in the ensemble, a random image transformation is generated and applied to all of the images in the labeled training set. The base classifiers are then learned using features extracted from these randomly transformed versions of the training data, and the result is a highly diverse ensemble of image classifiers. This approach is evaluated on a benchmark pedestrian detection dataset and shown to be effective.


[239]

A discriminative approach to structured biological data
S. Mutter and B. Pfahringer.
A discriminative approach to structured biological data.
In Proc NZCSRSC'07, the Fifth New Zealand Computer Science
Research Student Conference, Hamilton, New Zealand, April 2007.
[ bib 
.pdf ]
This paper introduces the first author's PhD pro ject which
has just got out of its initial stage. Biological sequence data is, on the
one hand, highly structured. On the other hand there are large amounts
of unlabelled data. Thus we combine probabilistic graphical models and
semisupervised learning. The former to handle structured data and the
latter to deal with unlabelled data. We apply our models to genotype
phenotype modelling problems. In particular we predict the set of Single
Nucleotide Polymorphisms which underlie a specific phenotypical trait.


[240]

Scaling Up Semisupervised Learning: An Efficient and Effective LLGC Variant
Bernhard Pfahringer, Claire Leschi, and Peter Reutemann.
Scaling up semisupervised learning: An efficient and effective llgc
variant.
In Proc 11th PacificAsia Conference on Knowledge Discovery and
Data Mining, Nanjing, China, pages 236247. Springer, 2007.
[ bib 
http ]
Domains like text classification can easily supply large amounts of unlabeled data, but labeling itself is expensive. Semi supervised learning tries to exploit this abundance of unlabeled training data to improve classification. Unfortunately most of the theoretically wellfounded algorithms that have been described in recent years are cubic or worse in the total number of both labeled and unlabeled training examples. In this paper we apply modifications to the standard LLGC algorithm to improve efficiency to a point where we can handle datasets with hundreds of thousands of training data. The modifications are priming of the unlabeled data, and most importantly, sparsification of the similarity matrix. We report promising results on large text classification problems.


[241]

New Options for Hoeffding Trees
Bernhard Pfahringer, Geoffrey Holmes, and Richard Kirkby.
New options for hoeffding trees.
In Proc 20th Australian Conference on Artificial Intelligence,
Gold Coast, Australia, pages 9099. Springer, 2007.
[ bib 
http ]
Hoeffding trees are stateoftheart for processing highspeed data streams. Their ingenuity stems from updating sufficient statistics, only addressing growth when decisions can be made that are guaranteed to be almost identical to those that would be made by conventional batch learning methods. Despite this guarantee, decisions are still subject to limited lookahead and stability issues. In this paper we explore Hoeffding Option Trees, a regular Hoeffding tree containing additional option nodes that allow several tests to be applied, leading to multiple Hoeffding trees as separate paths. We show how to control tree growth in order to generate a mixture of paths, and empirically determine a reasonable number of paths. We then empirically evaluate a spectrum of Hoeffding tree variations: single trees, option trees and bagged trees. Finally, we investigate pruning. We show that on some datasets a pruned option tree can be smaller and more accurate than a single tree.


[242]

Clustering Relational Data Based on Randomized Propositionalization
Grant Anderson and Bernhard Pfahringer.
Clustering relational data based on randomized propositionalization.
In Proc 17th International Conference on Inductive Logic
Programming, Corvallis, OR, pages 3948. Springer, 2007.
[ bib 
http ]
Clustering of relational data has so far received a lot less attention than classification of such data. In this paper we investigate a simple approach based on randomized propositionalization, which allows for applying standard clustering algorithms like KMeans to multirelational data. We describe how random rules are generated and then turned into booleanvalued features. Clustering generally is not straightforward to evaluate, but preliminary experimental results on a number of standard ILP datasets show promising results. Clusters generated without class information usually agree well with the true class labels of cluster members, i.e. class distributions inside clusters generally differ significantly from the global class distributions. The twotiered algorithm described shows good scalability due to the randomized nature of the first step and the availability of efficient propositional clustering algorithms for the second step.


[243]

Clustering for Classification
Reuben Evans.
Clustering for classification.
Master's thesis, Department of Computer Science, University of
Waikato, 2007.
[ bib 
http ]
Advances in technology have provided industry with an array of devices
for collecting data. The frequency and scale of data collection means
that there are now many large datasets being generated. To find
patterns in these datasets it would be useful to be able to apply
modern methods of classification such as support vector
machines. Unfortunately these methods are computationally expensive,
quadratic in the number of data points in fact, so cannot be applied
directly. This thesis proposes a framework whereby a variety of
clustering methods can be used to summarise datasets, that is, reduce
them to a smaller but still representative dataset so that these
advanced methods can be applied. It compares the results of using this
framework against using random selection on a large number of
classification and regression problems. Results show that the
clustered datasets are on average fifty percent smaller than the
original datasets without loss of classification accuracy which is
significantly better than random selection. They also show that there
is no free lunch, for each dataset it is important to choose a
clustering method carefully.


[244]

Fast Algorithms for Nearest Neighbour Search
Ashraf Masood Kibriya.
Fast algorithms for nearest neighbour search.
Master's thesis, Department of Computer Science, University of
Waikato, 2007.
[ bib 
http ]
The nearest neighbour problem is of practical significance
in a number of fields. Often we are interested in finding an object
near to a given query object. The problem is old, and a large number
of solutions have been proposed for it in the literature. However, it
remains the case that even the most popular of the techniques proposed
for its solution have not been compared against each other. Also, many
techniques, including the old and popular ones, can be implemented in
a number of ways, and often the different implementations of a
technique have not been thoroughly compared either. This research
presents a detailed investigation of different implementations of two
popular nearest neighbour search data structures, KDTrees and Metric
Trees, and compares the different implementations of each of the two
structures against each other. The best implementations of these
structures are then compared against each other and against two other
techniques, Annulus Method and Cover Trees. Annulus Method is an old
technique that was rediscovered during the research for this
thesis. Cover Trees are one of the most novel and promising data
structures for nearest neighbour search that have been proposed in the
literature.


[245]

Effective LinearTime Feature Selection
Nripendra Pradhananga.
Effective lineartime feature selection.
Master's thesis, Department of Computer Science, University of
Waikato, 2007.
[ bib 
http ]
The classification learning task requires selection of a
subset of features to represent patterns to be classified. This is
because the performance of the classifier and the cost of
classification are sensitive to the choice of the features used to
construct the classifier. Exhaustive search is impractical since it
searches every possible combination of features. The runtime of
heuristic and random searches are better but the problem still
persists when dealing with highdimensional datasets. We investigate a
heuristic, forward, wrapperbased approach, called Linear Sequential
Selection, which limits the search space at each iteration of the
feature selection process. We introduce randomization in the search
space. The algorithm is called Randomized Linear Sequential
Selection. Our experiments demonstrate that both methods are faster,
find smaller subsets and can even increase the classification
accuracy. We also explore the idea of ensemble learning. We have
proposed two ensemble creation methods, Feature Selection Ensemble and
Random Feature Ensemble. Both methods apply a feature selection
algorithm to create individual classifiers of the ensemble. Our
experiments have shown that both methods work well with
highdimensional data.


[246]

Bestfirst Decision Tree Learning
Haijian Shi.
Bestfirst decision tree learning.
Master's thesis, Department of Computer Science, University of
Waikato, 2007.
[ bib 
http ]
In bestfirst topdown induction of decision trees, the
best split is added in each step (e.g. the split that maximally
reduces the Gini index). This is in contrast to the standard
depthfirst traversal of a tree. The resulting tree will be the same,
just how it is built is different. The objective of this project is to
investigate whether it is possible to determine an appropriate tree
size on practical datasets by combining bestfirst decision tree
growth with crossvalidationbased selection of the number of
expansions that are performed. Prepruning, postpruning, CARTpruning
can be performed this way to compare.


[247]

Prediction Intervals for Class Probabilities
Xiaofeng Yu.
Prediction intervals for class probabilities.
Master's thesis, Department of Computer Science, University of
Waikato, 2007.
[ bib 
http ]
Prediction intervals for class probabilities are of
interest in machine learning because they can quantify the uncertainty
about the class probability estimate for a test instance. The idea is
that all likely class probability values of the test instance are
included, with a prespecified confidence level, in the calculated
prediction interval. This thesis proposes a probabilistic model for
calculating such prediction intervals. Given the unobservability of
class probabilities, a Bayesian approach is employed to derive a
complete distribution of the class probability of a test instance
based on a set of class observations of training instances in the
neighbourhood of the test instance. A random decision tree ensemble
learning algorithm is also proposed, whose prediction output
constitutes the neighbourhood that is used by the Bayesian model to
produce a PI for the test instance. The Bayesian model, which is used
in conjunction with the ensemble learning algorithm and the standard
nearestneighbour classifier, is evaluated on artificial datasets and
modified real datasets.


[248]

Learning from the Past with Experiment Databases
Joaquin Vanschoren, Bernhard Pfahringer, and Geoffrey Holmes.
Learning from the past with experiment databases.
In Proc 10th Pacific Rim International Conference on Artificial
Intelligence, Hanoi, Vietnam, pages 485496. Springer, 2008.
[ bib ]
Thousands of Machine Learning research papers contain experimental comparisons that usually have been conducted with a single focus of interest, often losing detailed results after publication. Yet, when collecting all these past experiments in experiment databases, they can readily be reused for additional and possibly much broader investigation. In this paper, we make use of such a database to answer various interesting research questions about learning algorithms and to verify a number of recent studies. Alongside performing elaborate comparisons of algorithms, we also investigate the effects of algorithm parameters and data properties, and seek deeper insights into the behavior of learning algorithms by studying their learning curves and biasvariance profiles.


[249]

Multilabel Classification Using Ensembles of Pruned Sets
Jesse Read, Bernhard Pfahringer, and Geoffrey Holmes.
Multilabel classification using ensembles of pruned sets.
In Proc 8th IEEE International Conference on Data Mining, Pisa,
Italy, pages 9951000. IEEE Computer Society, 2008.
[ bib 
http ]
This paper presents a pruned sets method (PS) for multilabel classification. It is centred on the concept of treating sets of labels as single labels. This allows the classification process to inherently take into account correlations between labels. By pruning these sets, PS focuses only on the most important correlations, which reduces complexity and improves accuracy. By combining pruned sets in an ensemble scheme (EPS), new label sets can be formed to adapt to irregular or complex data. The results from experimental evaluation on a variety of multilabel datasets show that [E]PS can achieve better performance and train much faster than other multilabel methods.


[250]

Practical Bias Variance Decomposition
Remco R. Bouckaert.
Practical bias variance decomposition.
In Proc 21st Australasian Joint Conference on Artificial
Intelligence, Auckland, New Zealand. Springer, 2008.
[ bib 
.pdf ]
Bias variance decomposition for classifiers is a useful tool in
understanding classifier behavior. Unfortunately, the literature
does not provide consistent guidelines on how to apply a bias
variance decomposition. This paper examines the various parameters
and variants of empirical bias variance decompositions through an
extensive simulation study. Based on this study, we recommend to use
ten fold cross validation as sampling method and take 100 samples
within each fold with a test set size of at least 2000. Only if the
learning algorithm is stable, fewer samples, a smaller test set
size or lower number of folds may be justified.


[251]

Revisiting MultipleInstance Learning via Embedded Instance Selection
James Foulds and Eibe Frank.
Revisiting multipleinstance learning via embedded instance
selection.
In Proc 21st Australasian Joint Conference on Artificial
Intelligence, Auckland, New Zealand. Springer, 2008.
[ bib 
.pdf ]
MultipleInstance Learning via Embedded Instance Selection
(MILES) is a recently proposed multipleinstance (MI) classification al
gorithm that applies a singleinstance base learner to a propositional
ized version of MI data. However, the original authors consider only one
singleinstance base learner for the algorithm  the 1norm SVM. We
present an empirical study investigating the efficacy of alternative base
learners for MILES, and compare MILES to other MI algorithms. Our
results show that boosted decision stumps can in some cases provide
better classification accuracy than the 1norm SVM as a base learner
for MILES. Although MILES provides competitive performance when
compared to other MI learners, we identify simpler propositionaliza
tion methods that require shorter training times while retaining MILES'
strong classification performance on the datasets we tested.


[252]

Additive Regression Applied to a LargeScale Collaborative Filtering Problem
Eibe Frank and Mark Hall.
Additive regression applied to a largescale collaborative filtering
problem.
In Proc 21set Australasian Joint Conference on Artificial
Intelligence, Auckland, New Zealand. Springer, 2008.
[ bib 
.pdf ]
The muchpublicized Netflix competition has put the spot
light on the application domain of collaborative filtering and has sparked
interest in machine learning algorithms that can be applied to this sort
of problem. The demanding nature of the Netflix data has lead to some
interesting and ingenious modifications to standard learning methods in
the name of efficiency and speed. There are three basic methods that
have been applied in most approaches to the Netflix problem so far:
standalone neighborhoodbased methods, latent factor models based on
singularvalue decomposition, and ensembles consisting of variations of
these techniques. In this paper we investigate the application of forward
stagewise additive modeling to the Netflix problem, using two regression
schemes as base learners: ensembles of weighted simple linear regressors
and kmeans clusteringthe latter being interpreted as a tool for multi
variate regression in this context. Experimental results show that our
methods produce competitive results.


[253]

Discriminating Against New Classes: OneClass versus MultiClass Classification
Kathryn Hempstalk and Eibe Frank.
Discriminating against new classes: Oneclass versus multiclass
classification.
In Proc 21set Australasian Joint Conference on Artificial
Intelligence, Auckland, New Zealand. Springer, 2008.
[ bib 
.pdf ]
Many applications require the ability to identify data that
is anomalous with respect to a target group of observations, in the sense
of belonging to a new, previously unseen 'attacker' class. One possible
approach to this kind of verification problem is oneclass classification,
learning a description of the target class concerned based solely on data
from this class. However, if known nontarget classes are available at
training time, it is also possible to use standard multiclass or twoclass
classification, exploiting the negative data to infer a description of the
target class. In this paper we assume that this scenario holds and inves
tigate under what conditions multiclass and twoclass Naive Bayes clas
sifiers are preferable to the corresponding oneclass model when the aim
is to identify examples from a new 'attacker' class. To this end we first
identify a way of performing a fair comparison between the techniques
concerned and present an adaptation of standard crossvalidation. This
is one of the main contributions of the paper. Based on the experimental
results obtained, we then show under what conditions which group of
techniques is likely to be preferable. Our main finding is that multiclass
and twoclass classification becomes preferable to oneclass classification
when a sufficiently large number of nontarget classes is available.


[254]

Propositionalisation of Profile Hidden Markov Models for Biological Sequence Analysis
Stefan Mutter, Bernhard Pfahringer, and Geoffrey Holmes.
Propositionalisation of profile hidden markov models for biological
sequence analysis.
In Proc 21st Australasian Joint Conference on Artificial
Intelligence, Auckland, New Zealand. Springer, 2008.
[ bib 
.pdf ]
Hidden Markov Models are a widely used generative model
for analysing sequence data. A variant, Profile Hidden Markov Models
are a special case used in Bioinformatics to represent, for example,
protein families. In this paper we introduce a simple
propositionalisation method for Profile Hidden Markov Models. The
method allows the use of PHMMs discriminatively in a classification
task. Previously, kernel approaches have been proposed to generate a
discriminative description for an HMM, but require the explicit
definition of a similarity measure for HMMs. Propositionalisation does
not need such a measure and allows the use of any propositional
learner including kernelbased approaches. We show empirically that
using propositionalisation leads to higher accuracies in comparison
with PHMMs on benchmark datasets.


[255]

Mining Arbitrarily Large Datasets Using Heuristic kNearest
Neighbour Search
Xing Wu, Geoffrey Holmes, and Bernhard Pfahringer.
Mining arbitrarily large datasets using heuristic knearest neighbour
search.
In Proc 21st Australasian Joint Conference on Artificial
Intelligence, Auckland, New Zealand, pages 355361. Springer, 2008.
[ bib 
http ]
Nearest Neighbour Search (NNS) is one of the top ten data mining algorithms. It is simple and effective but has a time complexity that is the product of the number of instances and the number of dimensions. When the number of dimensions is greater than two there are no known solutions that can guarantee a sublinear retrieval time. This paper describes and evaluates two ways to make NNS efficient for datasets that are arbitrarily large in the number of instances and dimensions. The methods are best described as heuristic as they are neither exact nor approximate. Both stem from recent developments in the field of data stream classification. The first uses Hoeffding Trees, an extension of decision trees to streams and the second is a direct stream extension of NNS. The methods are evaluated in terms of their accuracy and the time taken to find the neighbours. Results show that the methods are competitive with NNS in terms of accuracy but significantly faster.


[256]

Organizing the World's Machine Learning Information
Joaquin Vanschoren, Hendrik Blockeel, Bernhard Pfahringer, and Geoffrey Holmes.
Organizing the world's machine learning information.
In Proc 3rd International Symposium on Leveraging Applications
of Formal Methods, Verification and Validation, Porto Sani, Greece, pages
693708. Springer, 2008.
[ bib 
http ]
All around the globe, thousands of learning experiments are being executed on a daily basis, only to be discarded after interpretation. Yet, the information contained in these experiments might have uses beyond their original intent and, if properly stored, could be of great use to future research. In this paper, we hope to stimulate the development of such learning experiment repositories by providing a bird'seye view of how they can be created and used in practice, bringing together existing approaches and new ideas. We draw parallels between how experiments are being curated in other sciences, and consecutively discuss how both the empirical and theoretical details of learning experiments can be expressed, organized and made universally accessible. Finally, we discuss a range of possible services such a resource can offer, either used directly or integrated into data mining tools.


[257]

Adaptive feature thresholding for offline signature verification
Rober Larkins and Michael Mayo.
Adaptive feature thresholding for offline signature verification.
In Proc 23rd International Conference Image and Vision Computing
New Zealand, Christchurch, New Zealand, pages 16. IEEE, 2008.
[ bib 
www: ]
This paper introduces Adaptive Feature Thresholding (AFT) which is a novel method of persondependent offline signature verification. AFT enhances how a simple image feature of a signature is converted to a binary feature vector by significantly improving its representation in relation to the training signatures. The similarity between signatures is then easily computed from their corresponding binary feature vectors. AFT was tested on the CEDAR and GPDS benchmark datasets, with classification using either a manual or an automatic variant. On the CEDAR dataset we achieved a classification accuracy of 92% for manual and 90% for automatic, while on the GPDS dataset we achieved over 87% and 85% respectively. For both datasets AFT is less complex and requires fewer images features than the existing state of the art methods, while achieving competitive results.


[258]

Improving face gender classification by adding deliberately misaligned faces to the training data
Michael Mayo and Edmond Zhang.
Improving face gender classification by adding deliberately
misaligned faces to the training data.
In Proc 23rd International Conference Image and Vision Computing
New Zealand, Christchurch, New Zealand, pages 15. IEEE, 2008.
[ bib 
http ]
A novel method of face gender classifier construction is proposed and evaluated. Previously, researchers have assumed that a computationally expensive face alignment step (in which the face image is transformed so that facial landmarks such as the eyes, nose, chin, etc, are in uniform locations in the image) is required in order to maximize the accuracy of predictions on new face images. We, however, argue that this step is not necessary, and that machine learning classifiers can be made robust to face misalignments by automatically expanding the training data with examples of faces that have been deliberately misaligned (for example, translated or rotated). To test our hypothesis, we evaluate this automatic training dataset expansion method with two types of image classifier, the first based on weak features such as Local Binary Pattern histograms, and the second based on SIFT keypoints. Using a benchmark face gender classification dataset recently proposed in the literature, we obtain a stateoftheart accuracy of 92.5%, thus validating our approach.


[259]

Pattern discovery for object categorization
Edmond Zhang and Michael Mayo.
Pattern discovery for object categorization.
In Proc 23rd International Conference Image and Vision Computing
New Zealand, Christchurch, New Zealand, pages 1  6. IEEE, 2008.
[ bib 
http ]
This paper presents a new approach for the object categorization problem. Our model is based on the successful `bag of words' approach. However, unlike the original model, image features (keypoints) are not seen as independent and orderless. Instead, our model attempts to discover intermediate representations for each object class. This approach works by partitioning the image into smaller regions then computing the spatial relationships between all of the informative image keypoints in the region. The results show that the inclusion of spatial relationships leads to a measurable increase in performance for two of the most challenging datasets.


[260]

Machine Learning for Adaptive Computer Game Opponents
Jonathan David Miles.
Machine learning for adaptive computer game opponents.
Master's thesis, Department of Computer Science, University of
Waikato, 2008.
[ bib 
http ]
This thesis investigates the use of machine learning techniques in computer games to create a computer player that adapts to its opponent's gameplay. This includes first confirming that machine learning algorithms can be integrated into a modern computer game without have a detrimental effect on game performance, then experimenting with different machine learning techniques to maximize the computer player's performance. Experiments use three machine learning techniques; static prediction models, continuous learning, and reinforcement learning. Static models show the highest initial performance but are not able to beat a simple opponent. Continuous learning is able to improve the performance achieved with static models but the rate of improvement drops over time and the computer player is still unable to beat the opponent. Reinforcement learning methods have the highest rate of improvement but the lowest initial performance. This limits the effectiveness of reinforcement learning because a large number of episodes are required before performance becomes sufficient to match the opponent.


[261]

Random Relational Rules
Grant Anderson.
Random Relational Rules.
PhD thesis, Department of Computer Science, University of Waikato,
2008.
[ bib 
http ]
In the field of machine learning, methods for learning from singletable data have received much more attention than those for learning from multitable, or relational data, which are generally more computationally complex. However, a significant amount of the world's data is relational. This indicates a need for algorithms that can operate efficiently on relational data and exploit the larger body of work produced in the area of singletable techniques. This thesis presents algorithms for learning from relational data that mitigate, to some extent, the complexity normally associated with such learning. All algorithms in this thesis are based on the generation of random relational rules. The assumption is that random rules enable efficient and effective relational learning, and this thesis presents evidence that this is indeed the case. To this end, a system for generating random relational rules is described, and algorithms using these rules are evaluated. These algorithms include direct classification, classification by propositionalisation, clustering, semisupervised learning and generating random forests. The experimental results show that these algorithms perform competitively with previously published results for the datasets used, while often exhibiting lower runtime than other tested systems. This demonstrates that sufficient information for classification and clustering is retained in the rule generation process and that learning with random rules is efficient. Further applications of random rules are investigated. Propositionalisation allows singletable algorithms for classification and clustering to be applied to the resulting data, reducing the amount of relational processing required. Further results show that techniques for utilising additional unlabeled training data improve accuracy of classification in the semisupervised setting. The thesis also develops a novel algorithm for building random forests by making efficient use of random rules to generate trees and leaves in parallel.


[262]

Experiment Databases: Creating a New Platform for MetaLearning Research
Joaquin Vanschoren, Hendrik Blockeel, Bernhard Pfahringer, and Geoffrey Holmes.
Experiment databases: Creating a new platform for metalearning
research.
In Proc ICML/COLT/UAI 2008 Planning to Learn Workshop,
Helsinki, Finland. University of Porto, 2008.
[ bib 
http ]


[263]

Oneclass Classification by Combining Density and Class Probability Estimation
Kathryn Hempstalk, Eibe Frank, and Ian H. Witten.
Oneclass classification by combining density and class probability
estimation.
In Proc 12th European Conference on Principles and Practice of
Knowledge Discovery in Databases and 19th European Conference on Machine
Learning, Antwerp, Belgium. Springer, 2008.
[ bib 
.pdf ]
Oneclass classification has important applications such as
outlier and novelty detection. It is commonly tackled using density es
timation techniques or by adapting a standard classification algorithm
to the problem of carving out a decision boundary that describes the
location of the target data. In this paper we investigate a simple method
for oneclass classification that combines the application of a density es
timator, used to form a reference distribution, with the induction of a
standard model for class probability estimation. In this method, the ref
erence distribution is used to generate artificial data that is employed
to form a second, artificial class. In conjunction with the target class,
this artificial class is the basis for a standard twoclass learning problem.
We explain how the density function of the reference distribution can be
combined with the class probability estimates obtained in this way to
form an adjusted estimate of the density function of the target class. Us
ing UCI datasets, and data from a typist recognition problem, we show
that the combined model, consisting of both a density estimator and a
class probability estimator, can improve on using either component tech
nique alone when used for oneclass classification. We also compare the
method to oneclass classification using support vector machines.


[264]

Combining Naive Bayes and Decision Tables
Mark Hall and Eibe Frank.
Combining naive Bayes and decision tables.
In Proc 21st Florida Artificial Intelligence Research Society
Conference, Miami, Florida. AAAI Press, 2008.
[ bib 
.pdf ]
We investigate a simple seminaive Bayesian ranking method
that combines naive Bayes with induction of decision tables.
Naive Bayes and decision tables can both be trained effi
ciently, and the same holds true for the combined seminaive
model. We show that the resulting ranker, compared to ei
ther component technique, frequently significantly increases
AUC. For some datasets it significantly improves on both
techniques. This is also the case when attribute selection is
performed in naive Bayes and its seminaive variant.


[265]

Propositionalisation of multiple sequence alignments using probabilistic models
Stefan Mutter, Bernhard Pfahringer, and Geoffrey Holmes.
Propositionalisation of multiple sequence alignments using
probabilistic models.
In New Zealand Computer Science Research Student Conference
(NZCSRSC 2008), Christchurch, New Zealand, pages 234237, April 2008.
[ bib 
.pdf ]
Multiple sequence alignments play a central role in
Bioinformatics. Most alignment representations are designed to
facilitate knowledge extraction by human experts. Additionally
statistical models like Profile Hidden Markov Models are used as
representations. They offer the advantage to provide sound,
probabilistic scores. The basic idea we present in this paper is to
use the structure of a Profile Hidden Markov Model for
propositionalisation. This way we get a simple, extendable
representation of multiple sequence alignments which facilitates
further analysis by Machine Learning algorithms.


[266]

Exploiting propositionalization based on random relational rules for semisupervised learning
Bernhard Pfahringer and Grant Anderson.
Exploiting propositionalization based on random relational rules for
semisupervised learning.
In Proc 12th PacificAsia Conference on Knowledge Discovery and
Data Mining, Osaka, Japan. Springer, 2008.
[ bib 
http ]
In this paper we investigate an approach to semisupervised learning based on randomized propositionalization, which allows for applying standard propositional classification algorithms like support vector machines to multirelational data. Randomization based on random relational rules can work both with and without a class attribute and can therefore be applied simultaneously to both the labeled and the unlabeled portion of the data present in semisupervised learning. An empirical investigation compares semisupervised propositionalization to standard propositionalization using just the labeled data portion, as well as to a variant that also just uses the labeled data portion but includes the label information in an attempt to improve the resulting propositionalization. Preliminary experimental results indicate that propositionalization generated on the full dataset, i.e. the semi supervised approach, tends to outperform the other two more standard approaches.


[267]

Handling Numeric Attributes in Hoeffding Trees
Bernhard Pfahringer, Geoffrey Holmes, and Richard Kirkby.
Handling numeric attributes in hoeffding trees.
In Proc 12th PacificAsia Conference on Knowledge Discovery and
Data Mining, Osaka, Japan, pages 296307. Springer, 2008.
[ bib 
http ]
For conventional machine learning classification algorithms handling numeric attributes is relatively straightforward. Unsupervised and supervised solutions exist that either segment the data into predefined bins or sort the data and search for the best split points. Unfortunately, none of these solutions carry over particularly well to a data stream environment. Solutions for data streams have been proposed by several authors but as yet none have been compared empirically. In this paper we investigate a range of methods for multiclass treebased classification where the handling of numeric attributes takes place as the tree is constructed. To this end, we extend an existing approximation approach, based on simple Gaussian approximation. We then compare this method with four approaches from the literature arriving at eight final algorithm configurations for testing. The solutions cover a range of options from perfectly accurate and memory intensive to highly approximate. All methods are tested using the Hoeffding tree classification algorithm. Surprisingly, the experimental comparison shows that the most approximate methods produce the most accurate trees by allowing for faster tree growth.


[268]

Learning Instance Weights in MultiInstance Learning
James Foulds.
Learning instance weights in multiinstance learning.
Master's thesis, Department of Computer Science, University of
Waikato, 2008.
[ bib 
http ]
Multiinstance (MI) learning is a variant of supervised
machine learning, where each learning example contains a bag of
instances instead of just a single feature vector. MI learning has
applications in areas such as drug activity prediction, fruit disease
management and image classification. This thesis investigates the case
where each instance has a weight value determining the level of
influence that it has on its s class label. This is a more general
assumption than most existing approaches use, and thus is more widely
applicable. The challenge is to accurately estimate these weights in
order to make predictions at the bag level. An existing approach known
as MILES is retroactively identified as an algorithm that uses
instance weights for MI learning, and is evaluated using a variety of
base learners on benchmark problems. New algorithms for learning
instance weights for MI learning are also proposed and rigorously
evaluated on both artificial and realworld datasets. The new
algorithms are shown to achieve better root mean squared error rates
than existing approaches on artificial data generated according to the
underlying assumptions. Experimental results also demonstrate that the
new algorithms are competitive with existing approaches on realworld
problems.


[269]

The Positive Effects of Negative Information: Extending
OneClass Classification Models in Binary Proteomic Sequence
Classification
Stefan Mutter, Bernhard Pfahringer, and Geoffrey Holmes.
The positive effects of negative information: Extending oneclass
classification models in binary proteomic sequence classification.
In Proc 22nd Australasian Conference on Artificial
Intelligence, Melbourne, Australia, pages 260269. Springer, 2009.
[ bib 
http ]
Profile Hidden Markov Models (PHMMs) have been widely used as models for Multiple Sequence Alignments. By their nature, they are generative oneclass classifiers trained only on sequences belonging to the target class they represent. Nevertheless, they are often used to discriminate between classes. In this paper, we investigate the beneficial effects of information from nontarget classes in discriminative tasks. Firstly, the traditional PHMM is extended to a new binary classifier. Secondly, we propose propositional representations of the original PHMM that capture information from target and nontarget sequences and can be used with standard binary classifiers. Since PHMM training is time intensive, we investigate whether our approach allows the training of the PHMM to stop, before it is fully converged, without loss of predictive power.


[270]

Relational Random Forests Based on Random Relational Rules
Grant Anderson and Bernhard Pfahringer.
Relational random forests based on random relational rules.
In Proc 21st International Joint Conference on Artificial
Intelligence, Pasadena, California, pages 986991. AAAI Press, 2009.
[ bib 
.pdf ]
Random Forests have been shown to perform very well in propositional learning. FORF is an upgrade of Random Forests for relational data. In this paper we investigate shortcomings of FORF and propose an alternative algorithm, RF, for generating Random Forests over relational data. RF employs randomly generated relational rules as fully selfcontained Boolean tests inside each node in a tree and thus can be viewed as an instance of dynamic propositionalization. The implementation of RF allows for the simultaneous or parallel growth of all the branches of all the trees in the ensemble in an efficient shared, but still singlethreaded way. Experiments favorably compare RF to both FORF and the combination of static propositionalization together with standard Random Forests. Various strategies for tree initialization and splitting of nodes, as well as resulting ensemble size, diversity, and computational complexity of RF are also investigated.


[271]

The WEKA data mining software: an update
Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann,
and Ian H. Witten.
The WEKA data mining software: an update.
SIGKDD Explorations, 11(1):1018, 2009.
[ bib 
.pdf ]
More than twelve years have elapsed since the first public release of WEKA. In that time, the software has been rewritten entirely from scratch, evolved substantially and now accompanies a text on data mining [35]. These days, WEKA enjoys widespread acceptance in both academia and business, has an active community, and has been downloaded more than 1.4 million times since being placed on SourceForge in April 2000. This paper provides an introduction to the WEKA workbench, reviews the history of the project, and, in light of the recent 3.6 stable release, briefly discusses what has been added since the last stable version (Weka 3.4) released in 2003.


[272]

3D Face Recognition Using Multiview Keypoint Matching
Michael Mayo and Edmond Zhang.
3d face recognition using multiview keypoint matching.
In Proc 6th International Conference on Advanced Video and
Signal Based Surveillance, Genova, Italy, pages 290295. IEEE Computer
Society, 2009.
[ bib 
http ]
A novel algorithm for 3D face recognition based point cloud rotations, multiple projections, and voted keypoint matching is proposed and evaluated. The basic idea is to rotate each 3D point cloud representing an individual's face around the x, y or z axes, iteratively projecting the 3D points onto multiple 2.5D images at each step of the rotation. Labelled keypoints are then extracted from the resulting collection of 2.5D images, and this much smaller set of keypoints replaces the original face scan and its projections in the face database. Unknown test faces are recognised firstly by performing the same multiview keypoint extraction technique, and secondly, the application of a new weighted keypoint matching algorithm. In an extensive evaluation using the GavabDB 3D face recognition dataset (61 subjects, 9 scans per subject), our method achieves up to 95% recognition accuracy for faces with neutral expressions only, and over 90% accuracy for face recognition where expressions (such as a smile or a strong laugh) and random faceoccluding gestures are permitted.


[273]

SIFTing the Relevant from the Irrelevant: Automatically
Detecting Objects in Training Images
Edmond Zhang and Michael Mayo.
Sifting the relevant from the irrelevant: Automatically detecting
objects in training images.
In Proc Digital Image Computing: Techniques and Applications,
Melbourne, Australia, pages 317324. IEEE Computer Society, 2009.
[ bib 
http ]
Many stateoftheart object recognition systems rely on identifying the location of objects in images, in order to better learn its visual attributes. In this paper, we propose four simple yet powerful hybrid ROI detection methods (combining both local and global features), based on frequently occurring keypoints. We show that our methods demonstrate competitive performance in two different types of datasets, the Caltech101 dataset and the GRAZ02 dataset, where the pairs of keypoint bounding box method achieved the best accuracies overall.


[274]

New ensemble methods for evolving data streams
Albert Bifet, Geoff Holmes, Bernhard Pfahringer, Richard Kirkby, and Ricard
Gavaldà.
New ensemble methods for evolving data streams.
In KDD '09: Proceedings of the 15th ACM SIGKDD international
conference on Knowledge discovery and data mining, pages 139148, New York,
NY, USA, 2009. ACM.
[ bib 
.pdf ]
Advanced analysis of data streams is quickly becoming a key area of data mining research as the number of applications demanding such processing increases. Online mining when such data streams evolve over time, that is when concepts drift or change completely, is becoming one of the core issues. When tackling nonstationary concepts, ensembles of classifiers have several advantages over single classifier methods: they are easy to scale and parallelize, they can adapt to change quickly by pruning underperforming parts of the ensemble, and they therefore usually also generate more accurate concept descriptions. This paper proposes a new experimental data stream framework for studying concept drift, and two new variants of Bagging: ADWIN Bagging and AdaptiveSize Hoeffding Tree (ASHT) Bagging. Using the new experimental framework, an evaluation study on synthetic and realworld datasets comprising up to ten million examples shows that the new ensemble methods perform very well compared to several known methods.


[275]

Improving Adaptive Bagging Methods for Evolving Data Streams
Albert Bifet, Geoff Holmes, Bernhard Pfahringer, and Ricard Gavaldà.
Improving adaptive bagging methods for evolving data streams.
In Proceedings of the 1st Asian Conference on Machine Learning,
Nanjing, China. Springer, 2009.
[ bib 
.pdf ]
We propose two new improvements for bagging methods on evolving data streams. Recently, two new variants of Bagging were proposed: ADWIN Bagging and AdaptiveSize Hoeffding Tree (ASHT) Bagging. ASHT Bagging uses trees of different sizes, and ADWIN Bagging uses ADWIN as a change detector to decide when to discard underperforming ensemble members. We improve ADWIN Bagging using Hoeffding Adaptive Trees, trees that can adaptively learn from data streams that change over time. To speed up the time for adapting to change of AdaptiveSize Hoeffding Tree (ASHT) Bagging, we add an error change detector for each classifier. We test our improvements by performing an evaluation study on synthetic and realworld datasets comprising up to ten million examples.


[276]

Conditional Density Estimation with Class Probability Estimators
Eibe Frank and Remco Bouckaert.
Conditional density estimation with class probability estimators.
In Proceedings of the 1st Asian Conference on Machine Learning,
Nanjing, China. Springer, 2009.
[ bib 
.pdf ]
Many regression schemes deliver a point estimate only, but often it is useful or even essential to quantify the uncertainty inherent in a prediction. If a conditional density estimate is available, then prediction intervals can be derived from it. In this paper we compare three techniques for computing conditional density estimates using a class probability estimator, where this estimator is applied to the discretized target variable and used to derive instance weights for an underlying univariate density estimator; this yields a conditional density estimate. The three density estimators we compare are: a histogram estimator that has been used previously in this context, a normal density estimator, and a kernel estimator. In our experiments, the latter two deliver better performance, both in terms of crossvalidated loglikelihood and in terms of quality of the resulting prediction intervals. The empirical coverage of the intervals is close to the desired confidence level in most cases. We also include results for point estimation, as well as a comparison to Gaussian process regression and nonparametric quantile estimation.


[277]

Treebased Density Estimation: Algorithms and Applications
Gabi Schmidberger.
Treebased Density Estimation: Algorithms and Applications.
PhD thesis, Department of Computer Science, University of Waikato,
2009.
[ bib 
http ]
Data Mining can be seen as an extension to statistics. It comprises the preparation of data and the process of gathering new knowledge from it. The extraction of new knowledge is supported by various machine learning methods. Many of the algorithms are based on probabilistic principles or use density estimations for their computations. Density estimation has been practised in the field of statistics for several centuries. In the simplest case, a histogram estimator, like the simple equalwidth histogram, can be used for this task and has been shown to be a practical tool to represent the distribution of data visually and for computation. Like other nonparametric approaches, it can provide a flexible solution. However, flexibility in existing approaches is generally restricted because the size of the bins is fixed either the width of the bins or the number of values in them. Attempts have been made to generate histograms with a variable bin width and a variable number of values per interval, but the computational approaches in these methods have proven too difficult and too slow even with modern computer technology. In this thesis new flexible histogram estimation methods are developed and tested as part of various machine learning tasks, namely discretization, naive Bayes classification, clustering and multipleinstance learning. Not only are the new density estimation methods applied to machine learning tasks, they also borrow design principles from algorithms that are ubiquitous in artificial intelligence: divideandconquer methods are a well known way to tackle large problems by dividing them into small subproblems. Decision trees, used for machine learning classification, successfully apply this approach. This thesis presents algorithms that build density estimators using a binary split tree to cut a range of values into subranges of varying length. No class values are required for this splitting process, making it an unsupervised method. The result is a histogram estimator that adapts well even to complex density functions a novel density estimation method with flexible density estimation ability and good computational behaviour. Algorithms are presented for both univariate and multivariate data. The univariate histogram estimator is applied to discretization for density estimation and also used as density estimator inside a naive Bayes classifier. The multivariate histogram, used as the basis for a clustering method, is applied to improve the runtime behaviour of a wellknown algorithm for multipleinstance classification. Performance in these applications is evaluated by comparing the new approaches with existing methods.


[278]

Classifier Chains for Multilabel Classification
Jesse Read, Bernhard Pfahringer, Geoff Holmes, and Eibe Frank.
Classifier chains for multilabel classification.
In Proc 13th European Conference on Principles and Practice of
Knowledge Discovery in Databases and 20th European Conference on Machine
Learning, Bled, Slovenia. Springer, 2009.
[ bib 
.pdf ]
The widely known binary relevance method for multilabel classification, which considers each label as an independent binary problem, has been sidelined in the literature due to the perceived inadequacy of its labelindependence assumption. Instead, most current methods invest considerable complexity to model interdependencies between labels. This paper shows that binary relevancebased methods have much to offer, especially in terms of scalability to large datasets. We exemplify this with a novel chaining method that can model label correlations while maintaining acceptable computational complexity. Empirical evaluation over a broad range of multilabel datasets with a variety of evaluation metrics demonstrates the competitiveness of our chaining method against related and stateoftheart methods, both in terms of predictive performance and time complexity.


[279]

Continuous Typist Verification using Machine Learning
Kathryn Hempstalk.
Continuous Typist Verification using Machine Learning.
PhD thesis, Department of Computer Science, University of Waikato,
2009.
[ bib 
http ]
A keyboard is a simple input device. Its function is to send keystroke information to the computer (or other device) to which it is attached. Normally this information is employed solely to produce text, but it can also be utilized as part of an authentication system. Typist verification exploits a typist's patterns to check whether they are who they say they are, even after standard authentication schemes have confirmed their identity. This thesis investigates whether typists behave in a sufficiently unique yet consistent manner to enable an effective level of verification based on their typing patterns. Typist verification depends on more than the typist's behaviour. The quality of the patterns and the algorithms used to compare them also determine how accurately verification is performed. This thesis sheds light on all technical aspects of the problem, including data collection, feature identification and extraction, and sample classification. A dataset has been collected that is comparable in size, timing accuracy and content to others in the field, with one important exception: it is derived from real emails, rather than samples collected in an artificial setting. This dataset is used to gain insight into what features distinguish typists from one another. The features and dataset are used to train learning algorithms that make judgements on the origin of previously unseen typing samples. These algorithms use 'oneclass classification'; they make predictions for a particular user having been trained on only that user's patterns. This thesis examines many oneclass classification algorithms, including ones designed specifically for typist verification. New algorithms and features are proposed to increase speed and accuracy. The best method proposed performs at the state of the art in terms of classification accuracy, while decreasing the time taken for a prediction from minutes to seconds, and  more importantly  without requiring any negative data from other users. Also, it is general: it applies not only to typist verification, but to any other oneclass classification problem. Overall, this thesis concludes that typist verification can be considered a useful biometric technique.


[280]

Humancompetitive Automatic Topic Indexing
Olena Medelyan.
Humancompetitive Automatic Topic Indexing.
PhD thesis, Department of Computer Science, University of Waikato,
2009.
[ bib 
http ]
Topic indexing is the task of identifying the main topics covered by a document. These are useful for many purposes: as subject headings in libraries, as keywords in academic publications and as tags on the web. Knowing a document's topics helps people judge its relevance quickly. However, assigning topics manually is labor intensive. This thesis shows how to generate them automatically in a way that competes with human performance. Three kinds of indexing are investigated: term assignment, a task commonly performed by librarians, who select topics from a controlled vocabulary; tagging, a popular activity of web users, who choose topics freely; and a new method of keyphrase extraction, where topics are equated to Wikipedia article names. A general twostage algorithm is introduced that first selects candidate topics and then ranks them by significance based on their properties. These properties draw on statistical, semantic, domainspecific and encyclopedic knowledge. They are combined using a machine learning algorithm that models human indexing behavior from examples. This approach is evaluated by comparing automatically generated topics to those assigned by professional indexers, and by amateurs. We claim that the algorithm is humancompetitive because it chooses topics that are as consistent with those assigned by humans as their topics are with each other. The approach is generalizable, requires little training data and applies across different domains and languages.


[281]

Humancompetitive tagging using automatic keyphrase extraction
Olena Medelyan, Eibe Frank, and Ian H. Witten.
Humancompetitive tagging using automatic keyphrase extraction.
In Proc Conf on Empirical Methods in Natural Language
Processing. ACL, 2009.
[ bib 
.pdf ]
This paper connects two research areas: automatic tagging on the web and statistical keyphrase extraction. First, we analyze the quality of tags in a collaboratively created folksonomy using traditional evaluation techniques. Next, we demonstrate how documents can be tagged automatically with a stateoftheart keyphrase extraction algorithm, and further improve performance in this new domain using a new algorithm, ``Maui'', that utilizes semantic information extracted from Wikipedia. Maui outperforms existing approaches and extracts tags that are competitive with those assigned by the best performing human taggers.


[282]

Clustering Documents using a Wikipediabased Concept Representation
Anna Huang, David Milne, Eibe Frank, and Ian H. Witten.
Clustering documents using a wikipediabased concept representation.
In Proc 13th PacificAsia Conference on Knowledge Discovery and
Data Mining, Bangkog, Thailand, pages 628636. Springer, 2009.
[ bib 
.pdf ]
This paper shows how Wikipedia and the semantic knowledge it contains can be exploited for document clustering. We first create a conceptbased document representation by mapping the terms and phrases within documents to their corresponding articles (or concepts) in Wikipedia. We also developed a similarity measure that evaluates the semantic relatedness between concept sets for two documents. We test the conceptbased representation and the similarity measure on two standard text document datasets. Empirical results show that although further optimizations could be performed, our approach already improves upon related techniques.


[283]

Samplingbased Prediction of Algorithm Runtime
Quan Sun.
Samplingbased prediction of algorithm runtime.
Master's thesis, Department of Computer Science, University of
Waikato, 2009.
[ bib 
http ]
The ability to handle and analyse massive amounts of data has been progres sively improved during the last decade with the growth of computing power and the opening up of the Internet era. Nowadays, machine learning algorithms have been widely applied in various fields of engineering sciences and in real world applications. However, currently, users of machine learning algorithms do not usually receive feedback on when a given algorithm will have finished building a model for a particular data set. While in theory such estimation can be obtained by asymptotic performance analysis, the complexity of machine learning algorithms means theoretical asymptotic performance analysis can be a very difficult task. This work has two goals. The first goal is to investigate how to use samplingbased techniques to predict the running time of a ma chine learning algorithm training on a particular data set. The second goal is to empirically evaluate a set of samplingbased running time prediction meth ods. Experimental results show that, with some care in the sampling stage, application of appropriate transformations on the running time observations followed by the use of suitable curve fitting algorithms makes it possible to obtain useful averagecase running time predictions and an approximate time function for a given machine learning algorithm building a model on a particular data set. There are 41 WEKA (Witten Frank, 2005) machine learning algorithms are used for the experiments.


[284]

Largescale attribute selection using wrappers
Martin Gütlein, Eibe Frank, Mark Hall, and Andreas Karwath.
Largescale attribute selection using wrappers.
In Proc IEEE Symposium on Computational Intelligence and Data
Mining, pages 332339. IEEE, 2009.
[ bib 
.pdf ]
Schemespecific attribute selection with the wrapper and variants of forward selection is a popular attribute selection technique for classification that yields good results. However, it can run the risk of overfitting because of the extent of the search and the extensive use of internal crossvalidation. Moreover, although wrapper evaluators tend to achieve superior accuracy compared to filters, they face a high computational cost. The problems of overfitting and high runtime occur in particular on highdimensional datasets, like microarray data.
We investigate Linear Forward Selection, a technique to reduce the number of attributes expansions in each forward selection step. Our experiments demonstrate that this approach is faster, finds smaller subsets and can even increase the accuracy compared to standard forward selection. We also investigate a variant that applies explicit subset size determination in forward selection to combat overfitting, where the search is forced to stop at a precomputed ``optimal'' subset size. We show that this technique reduces subset size while maintaining comparable accuracy.


[285]

Analysing chromatographic data using data mining to monitor petroleum content in water
G. Holmes, D. Fletcher, P. Reutemann, and E. Frank.
Analysing chromatographic data using data mining to monitor petroleum
content in water.
In Proc 4th International ICSC Symposium, Thessaloniki, Greece,
Information Technologies in Enviromental Engineering, pages 279290, May
2009.
[ bib 
.pdf ]
Chromatography is an important analytical technique that has wide
spread use in environmental applications. A typical application is the
monitoring of water samples to determine if they contain petroleum. These
tests are mandated in many countries to enable environmental agencies to
determine if tanks used to store petrol are leaking into local water systems.
Chromatographic techniques, typically using gas or liquid chromatography
coupled with mass spectrometry, allow an analyst to detect a vast array of
compoundspotentially in the order of thousands. Accurate analysis relies
heavily on the skills of a limited pool of experienced analysts utilising
semiautomatic techniques to analyse these datasetsmaking the out
comes subjective.
The focus of current laboratory data analysis systems has been on refine
ments of existing approaches. The work described here represents a para
digm shift achieved through applying data mining techniques to tackle the
problem. These techniques are compelling because the efficacy of pre
processing methods, which are essential in this application area, can be ob
jectively evaluated. This paper presents preliminary results using a data
mining framework to predict the concentrations of petroleum compounds
in water samples. Experiments demonstrate that the framework can be
used to produce models of sufficient accuracymeasured in terms of root
mean squared error and correlation coefficientsto offer the potential fo
significantly reducing the time spent by analysts on this task.


[286]

Offline signature verification
Robert L. Larkins.
Offline signature verification.
Master's thesis, Department of Computer Science, University of
Waikato, 2009.
[ bib 
http ]
In today's society signatures are the most accepted form of identity verification. However, they have the unfortunate sideeffect of being easily abused by those who would feign the identification or intent of an individual. This thesis implements and tests current approaches to offline signature verification with the goal of determining the most beneficial techniques that are available. This investigation will also introduce novel techniques that are shown to significantly boost the achieved classification accuracy for both persondependent (oneclass training) and personindependent (twoclass training) signature verification learning strategies. The findings presented in this thesis show that many common techniques do not always give any significant advantage and in some cases they actually detract from the classification accuracy. Using the techniques that are proven to be most beneficial, an effective approach to signature verification is constructed, which achieves approximately 90% and 91% on the standard CEDAR and GPDS signature datasets respectively. These results are significantly better than the majority of results that have been previously published. Additionally, this approach is shown to remain relatively stable when a minimal number of training signatures are used, representing feasibility for realworld situations.


[287]

Machine learning for adaptive computer game opponents
Jonathan Miles.
Machine learning for adaptive computer game opponents.
Master's thesis, Department of Computer Science, University of
Waikato, 2009.
[ bib ]


[288]

Prediction of Oestrus in Dairy Cows: An Application of Machine Learning to Skewed Data
Adam David Lynam.
Prediction of oestrus in dairy cows: An application of machine
learning to skewed data.
Master's thesis, Department of Computer Science, University of
Waikato, 2009.
[ bib 
http ]
The Dairy industry requires accurate detection of oestrus(heat) in dairy cows to maximise output of the animals. Traditionally this is a process dependant on human observation and interpretation of the various signs of heat. Many areas of the dairy industry can be automated, however the detection of oestrus is an area that still requires human experts. This thesis investigates the application of Machine Learning classification techniques, on dairy cow milking data provided by the Livestock Improvement Corporation, to predict oestrus. The usefulness of various ensemble learning algorithms such as Bagging and Boosting are explored as well as specific skewed data techniques. An empirical study into the effectiveness of classifiers designed to target skewed data is included as a significant part of the investigation. Roughly Balanced Bagging and the novel Under Bagging classifiers are explored in considerable detail and found to perform quite favourably over the SMOTE technique for the datasets selected. This study uses nondairy, commonplace, Machine Learning datasets; many of which are found in the UCI Machine Learning Repository.


[289]

Accuracy of machine learning models versus hand crafted expert systems  A credit scoring case study
Arie BenDavid and Eibe Frank.
Accuracy of machine learning models versus hand crafted expert
systems  a credit scoring case study.
Expert Systems with Applications, 36(3):52645271, 2009.
[ bib 
http ]
Relatively few publications compare machine learning models with expert systems when applied to the same problem domain. Most publications emphasize those cases where the former beat the latter. Is it a realistic picture of the state of the art?
Some other findings are presented here. The accuracy of a real world mind crafted credit scoring expert system is compared with dozens of machine learning models. The results show that while some machine learning models can surpass the expert system's accuracy with statistical significance, most models do not. More interestingly, this happened only when the problem was treated as regression. In contrast, no machine learning model showed any statistically significant advantage over the expert system's accuracy when the same problem was treated as classification. Since the true nature of the class data was ordinal, the latter is the more appropriate setting. It is also shown that the answer to the question is highly dependent on the meter that is being used to define accuracy.


[290]

Encyclopedia of Database Systems
Ian H. Witten.
Encyclopedia of Database Systems, chapter Classification, pages
331335.
Springer, 2009.
[ bib 
http ]


[291]

Efficient multilabel classification for evolving data streams
Jesse Read.
Efficient multilabel classification for evolving data streams.
PhD thesis, Department of Computer Science, University of Waikato,
2010.
[ bib 
http ]
Multilabel classiﬁcation is relevant to many domains, such as text,
image and other media, and bioinformatics. Researchers have already
noticed that in multilabel data, correlations exist between labels,
and a variety of approaches, drawing inspiration from many spheres of
machine learning, have been able to model these correlations. However,
data sources from the real world are growing ever larger and the
multilabel task is particularly sensitive to this due to the
complexity associated with multiple labels and the correlations
between them. Consequently, many methods do not scale up to large
problems.
This thesis deals with scalable multilabel classiﬁcation: methods
which exhibit high predictive performance, but are also able to scale
up to larger problems. The first major contribution is the pruned sets
method, which is able to model label correlations directly for high
predictive performance, but reduces overﬁtting and complexity over
related methods by pruning and subsampling label sets, and can thus
scale up to larger datasets. The second major contribution is the
classifier chains method, which models correlations with a chain of
binary classiﬁers. The use of binary models allows for scalability to
even larger datasets. Pruned sets and classiﬁer chains are robust with
respect to both the variety and scale of data that they can deal with,
and can be incorporated into other methods. In an ensemble scheme,
these methods are able to compete with stateoftheart methods in
terms of predictive performance as well as scale up to large datasets
of hundreds of thousands of training examples.
This thesis also puts a special emphasis on multilabel evaluation;
introducing a new evaluation measure and studying threshold
calibration. With one of the largest and most varied collections of
multilabel datasets in the literature, extensive experimental
evaluation shows the advantage of these methods, both in terms of
predictive performance, and computational efficiency and scalability.


[292]

WEKAExperiences with a Java OpenSource Project
Remco R. Bouckaert, Eibe Frank, Mark A. Hall, Geoffrey Holmes, Bernhard
Pfahringer, Peter Reutemann, and Ian H. Witten.
WEKAexperiences with a java opensource project.
Journal of Machine Learning Research, 11:25332541, 2010.
[ bib 
.pdf ]
WEKA is a popular machine learning workbench with a development life of nearly two decades. This article provides an overview of the factors that we believe to be important to its success. Rather than focussing on the software's functionality, we review aspects of project management and historical development decisions that likely had an impact on the uptake of the project.


[293]

Accurate Ensembles for Data Streams: Combining Restricted Hoeffding Trees using Stacking
Albert Bifet, Eibe Frank, Geoffrey Holmes, and Bernhard Pfahringer.
Accurate ensembles for data streams: Combining restricted Hoeffding
trees using stacking.
In Proc 2nd Asian Conference on Machine Learning, Tokyo. JMLR,
2010.
[ bib 
.pdf ]
The success of simple methods for classification shows that is is often not necessary to model complex attribute interactions to obtain good classification accuracy on practical problems. In this paper, we propose to exploit this phenomenon in the data stream context by building an ensemble of Hoeffding trees that are each limited to a small subset of attributes. In this way, each tree is restricted to model interactions between attributes in its corresponding subset. Because it is not known a priori which attribute subsets are relevant for prediction, we build exhaustive ensembles that consider all possible attribute subsets of a given size. As the resulting Hoeffding trees are not all equally important, we weigh them in a suitable manner to obtain accurate classifications. This is done by combining the logodds of their probability estimates using sigmoid perceptrons, with one perceptron per class. We propose a mechanism for setting the perceptrons' learning rate using the ADWIN change detection method for data streams, and also use ADWIN to reset ensemble members (i.e. Hoeffding trees) when they no longer perform well. Our experiments show that the resulting ensemble classifier outperforms bagging for data streams in terms of accuracy when both are used in conjunction with adaptive naive Bayes Hoeffding trees, at the expense of runtime and memory consumption.


[294]

Enhanced spatial pyramid matching using logpolarbased image subdivision and representation
Edmond Zhang and Michael Mayo.
Enhanced spatial pyramid matching using logpolarbased image
subdivision and representation.
In Proc International Conference on Digital Image Computing:
Techniques and Applications, Sydney, Australia. IEEE Computer Society, 2010.
[ bib 
http ]
This paper presents a new model for capturing spatial information for object categorization with bagofwords (BOW). BOW models have recently become popular for the task of object recognition, owing to their good performance and simplicity. Much work has been proposed over the years to improve the BOW model, where the Spatial Pyramid Matching (SPM) technique is the most notable. We propose a new method to exploit spatial relationships between image features, based on binned logpolar grids. Our model works by partitioning the image into grids of different scales and orientations and computing histogram of local features within each grid. Experimental results show that our approach improves the results on three diverse datasets over the SPM technique.


[295]

Speeding Up and Boosting Diverse Density Learning
James R. Foulds and Eibe Frank.
Speeding up and boosting diverse density learning.
In Proc 13th International Conference on Discovery Science,
Canberra, Australia, pages 102116. Springer, 2010.
[ bib 
.pdf ]
In multiinstance learning, each example is described by a bag of instances instead of a single feature vector. In this paper, we revisit the idea of performing multiinstance classification based on a pointandscaling concept by searching for the point in instance space with the highest diverse density. This is a computationally expensive process, and we describe several heuristics designed to improve runtime. Our results show that simple variants of existing algorithms can be used to find diverse density maxima more efficiently. We also show how significant increases in accuracy can be obtained by applying a boosting algorithm with a modified version of the diverse density algorithm as the weak learner.


[296]

Sentiment Knowledge Discovery in Twitter Streaming Data
Albert Bifet and Eibe Frank.
Sentiment knowledge discovery in Twitter streaming data.
In Proc 13th International Conference on Discovery Science,
Canberra, Australia, pages 115. Springer, 2010.
[ bib 
.pdf ]
Microblogs are a challenging new source of information for data mining techniques. Twitter is a microblogging service built to dis cover what is happening at any moment in time, anywhere in the world. Twitter messages are short, and generated constantly, and well suited for knowledge discovery using data stream mining. We briefly discuss the challenges that Twitter data streams pose, focusing on classification problems, and then consider these streams for opinion mining and sentiment analysis. To deal with streaming unbalanced classes, we propose a sliding window Kappa statistic for evaluation in timechanging data streams. Using this statistic we perform a study on Twitter data using learning algorithms for data streams.


[297]

A Study of Hierarchical and Flat Classification of Proteins
Arthur Zimek, Fabian Buchwald, Eibe Frank, and Stefan Kramer.
A study of hierarchical and flat classification of proteins.
IEEE/ACM Transactions on Computational Biology and
Bioinformatics, 7:563571, 2010.
[ bib 
http ]
Automatic classification of proteins using machine learning is an important problem that has received significant attention in the literature. One feature of this problem is that expertdefined hierarchies of protein classes exist and can potentially be exploited to improve classification performance. In this article we investigate empirically whether this is the case for two such hierarchies. We compare multiclass classification techniques that exploit the information in those class hierarchies and those that do not, using logistic regression, decision trees, bagged decision trees, and support vector machines as the underlying base learners. In particular, we compare hierarchical and flat variants of ensembles of nested dichotomies. The latter have been shown to deliver strong classification performance in multiclass settings. We present experimental results for synthetic, fold recognition, enzyme classification, and remote homology detection data. Our results show that exploiting the class hierarchy improves performance on the synthetic data, but not in the case of the protein classification problems. Based on this we recommend that strong flat multiclass methods be used as a baseline to establish the benefit of exploiting class hierarchies in this area.


[298]

Evolving concurrent Petri net models of epistasis
Michael Mayo and Lorenzo Beretta.
Evolving concurrent petri net models of epistasis.
In Proc 2nd Asian Conference on Intelligent Information and
Database Systems, Hue City, Vietnam, pages 166175. Springer, 2010.
[ bib 
http ]
A genetic algorithm is used to learn a nondeterministic Petri netbased model of nonlinear gene interactions, or statistical epistasis. Petri nets are computational models of concurrent processes. However, often certain global assumptions (e.g. transition priorities) are required in order to convert a nondeterministic Petri net into a simpler deterministic model for easier analysis and evaluation. We show, by converting a Petri net into a set of state trees, that it is possible to both retain Petri net nondeterminism (i.e. allowing local interactions only, thereby making the model more realistic), whilst also learning useful Petri nets with practical applications. Our Petri nets produce predictions of genetic disease risk assessments derived from clinical data that match with over 92% accuracy.


[299]

A 3factor epistatic model predicts digital ulcers in Italian scleroderma patients
Lorenzo Beretta, Alessandro Santaniello, Michael Mayo, Francesca Cappiello,
Maurizio Marchini, and Raffaella Scorza.
A 3factor epistatic model predicts digital ulcers in italian
scleroderma patients.
European Journal of Internal Medicine, 21(4):347353, 2010.
[ bib 
http ]
Background The genetic background may predispose systemic sclerosis (SSc) patients to the development of digital ulcers (DUs). Methods Twentytwo functional cytokine single nucleotide polymorphisms (SNPs) and 3 HLA class I and II antigens were typed at the genomic level by polymerase chain reaction in 200 Italian SSc patients. Associations with DUs were sought by parametric models and with the Multifactor Dimensionality Reduction (MDR) algorithm to depict the presence of epistasis. Biological models consistent with MDR results were built by means of Petri nets to describe the metabolic significance of our findings. Results On the exploratory analysis, the diffuse cutaneous subset (dcSSc) was the only single factor statistically associated with DUs (p = 0.045, ns after Bonferroni correction). Genegene analysis showed that a 3factor model comprising the IL6 C174G, the IL2 G330T SNPs and the HLAB*3501 allele was predictive for the occurrence of DUs in our population (testing accuracy = 66.9%; p < 0.0001, permutation testing). Conclusion Biological interpretation via Petri net showed that IL6 is a key factor in determining DUs occurrence and that this cytokines may synergise with HLAB*3501 to determine DUs onset. Owing to the limited number of patients included in the study, future research are needed to replicate our statistical findings as well as to better determine their functional meaning.


[300]

Predicting polycyclic aromatic hydrocarbon concentrations in soil and water samples
Geoffrey Holmes, Dale Fletcher, and Peter Reutemann.
Predicting polycyclic aromatic hydrocarbon concentrations in soil and
water samples.
In Proc International Congress on Environmental Modelling and
Software, 2010.
[ bib 
http ]
Polycyclic Aromatic Hydrocarbons (PAHs) are compounds found in the environment that can be harmful to humans. They are typically formed due to incomplete combustion and as such remain after burning coal, oil, petrol, diesel, wood, household waste and so forth. Testing laboratories routinely screen soil and water samples taken from potentially contaminated sites for PAHs using Gas Chromatography Mass Spectrometry (GCMS). A GCMS device produces a chromatogram which is processed by an analyst to determine the concentrations of PAH compounds of interest. In this paper we investigate the application of data mining techniques to PAH chromatograms in order to provide reliable prediction of compound concentrations. A workflow engine with an easytouse graphical user interface is at the heart of processing the data. This engine allows a domain expert to set up workflows that can load the data, preprocess it in parallel in various ways and convert it into data suitable for data mining toolkits. The generated output can then be evaluated using different data mining techniques, to determine the impact of preprocessing steps on the performance of the generated models and for picking the best approach. Encouraging results for predicting PAH compound concentrations, in terms of correlation coefficients and rootmeansquared error are demonstrated.


[301]

Fast Conditional Density Estimation for Quantitative StructureActivity Relationships
Fabian Buchwald, Tobias Girschick, Eibe Frank, and Stefan Kramer.
Fast conditional density estimation for quantitative
structureactivity relationships.
In Proc 24th AAAI Conference on Artificial Intelligence, 2010.
[ bib 
http ]
Many methods for quantitative structureactivity relation ships (QSARs) deliver point estimates only, without quantifying the uncertainty inherent in the prediction. One way to quantify the uncertainy of a QSAR prediction is to pre dict the conditional density of the activity given the structure instead of a point estimate. If a conditional density estimate is available, it is easy to derive prediction intervals of activities. In this paper, we experimentally evaluate and compare three methods for conditional density estimation for their suitability in QSAR modeling. In contrast to traditional methods for conditional density estimation, they are based on generic machine learning schemes, more specifically, class probability estimators. Our experiments show that a kernel estimator based on class probability estimates from a random forest classifier is highly competitive with Gaussian process regression, while taking only a fraction of the time for train ing. Therefore, generic machinelearning based methods for conditional density estimation may be a good and fast option for quantifying uncertainty in QSAR modeling.


[302]

Fast Perceptron Decision Tree Learning from Evolving Data
Streams
Albert Bifet, Geoff Holmes, Bernhard Pfahringer, and Eibe Frank.
Fast perceptron decision tree learning from evolving data streams.
In Proc 14th PacificAsia Conference on Knowledge Discovery and
Data Mining, Hyderabad, India, pages 299310. Springer, 2010.
[ bib 
.pdf ]
Mining of data streams must balance three evaluation dimensions: accuracy, time and memory. Excellent accuracy on data streams has been obtained with Naive Bayes Hoeffding TreesHoeffding Trees with naive Bayes models at the leaf nodesalbeit with increased runtime compared to standard Hoeffding Trees. In this paper, we show that runtime can be reduced by replacing naive Bayes with perceptron classifiers, while maintaining highly competitive accuracy. We also show that accuracy can be increased even further by combining majority vote, naive Bayes, and perceptrons. We evaluate four perceptronbased learning strategies and compare them against appropriate baselines: simple perceptrons, Perceptron Hoeffding Trees, hybrid Naive Bayes Perceptron Trees, and bagged versions thereof. We implement a perceptron that uses the sigmoid activation function instead of the threshold activation function and optimizes the squared error, with one perceptron per class value. We test our methods by performing an evaluation study on synthetic and realworld datasets comprising up to ten million examples.


[303]

MOA: Massive Online Analysis
Albert Bifet, Geoff Holmes, Richard Kirkby, and Bernhard Pfahringer.
Moa: Massive online analysis.
Journal of Machine Learning Research (JMLR), 2010.
[ bib 
.pdf ]
Massive Online Analysis (MOA) is a software environment for implementing algorithms and running experiments for online learning from evolving data streams. MOA includes a collection of offline and online methods as well as tools for evaluation. In particular, it implements boosting, bagging, and Hoeffding Trees, all with and without Naive Bayes classifiers at the leaves. MOA supports bidirectional interaction with WEKA, the Waikato Environment for Knowledge Analysis, and is released under the GNU GPL license.


[304]

GNUsmail: Open Framework for Online Email Classification
José M. CarmonaCejudo, Manuel BaenaGarcía, José del
CampoÁvila, Rafael Morales Bueno, and Albert Bifet.
Gnusmail: Open framework for online email classification.
In ECAI 2010  19th European Conference on Artificial
Intelligence, Lisbon, Portugal, August 1620, 2010, Proceedings, pages
11411142, 2010.
[ bib 
http ]
Realtime classification of massive email data is a challenging task that presents its own particular difficulties. Since email data presents an important temporal component, several problems arise: emails arrive continuously, and the criteria used to classify those emails can change, so the learning algorithms have to be able to deal with concept drift. Our problem is more general than spam detection, which has received much more attention in the literature. In this paper we present GNUsmail, an opensource extensible framework for email classification, which structure supports incremental and online learning. This framework enables the incorporation of algorithms developed by other researchers, such as those included in WEKA and MOA. We evaluate this framework, characterized by two overlapping phases (preprocessing and learning), using the ENRON dataset, and we compare the results achieved by WEKA and MOA algorithms.


[305]

Leveraging Bagging for Evolving Data Streams
Albert Bifet, Geoffrey Holmes, and Bernhard Pfahringer.
Leveraging bagging for evolving data streams.
In Machine Learning and Knowledge Discovery in Databases,
European Conference, ECML PKDD 2010, Barcelona, Spain, September 2024, 2010,
Proceedings, Part I, pages 135150, 2010.
[ bib 
http ]
Bagging, boosting and Random Forests are classical ensemble methods used to improve the performance of single classifiers. They obtain superior performance by increasing the accuracy and diversity of the single classifiers. Attempts have been made to reproduce these methods in the more challenging context of evolving data streams. In this paper, we propose a new variant of bagging, called leveraging bagging. This method combines the simplicity of bagging with adding more randomization to the input, and output of the classifiers. We test our method by performing an evaluation study on synthetic and realworld datasets comprising up to ten million examples.


[306]

MOA: Massive Online Analysis, a Framework for Stream Classification
and Clustering
Albert Bifet, Geoff Holmes, Bernhard Pfahringer, Philipp Kranen, Hardy Kremer,
Timm Jansen, and Thomas Seidl.
Moa: Massive online analysis, a framework for stream classification
and clustering.
Journal of Machine Learning Research  Proceedings Track,
11:4450, 2010.
[ bib 
.html ]
Massive Online Analysis (MOA) is a software environment for implementing algorithms and running experiments for online learning from evolving data streams. MOA is designed to deal with the challenging problem of scaling up the implementation of state of the art algorithms to real world dataset sizes. It contains collection of offline and online for both classification and clustering as well as tools for evaluation. In particular, for classification it implements boosting, bagging, and Hoeffding Trees, all with and without Naive Bayes classifiers at the leaves. For clustering, it implements StreamKM++, CluStream, ClusTree, DenStream, DStream and CobWeb. Researchers benefit from MOA by getting insights into workings and problems of different approaches, practitioners can easily apply and compare several algorithms to real world data set and settings. MOA supports bidirectional interaction with WEKA, the Waikato Environment for Knowledge Analysis, and is released under the GNU GPL license.


[307]

Clustering Performance on Evolving Data Streams: Assessing
Algorithms and Evaluation Measures within MOA
Philipp Kranen, Hardy Kremer, Timm Jansen, Thomas Seidl, Albert Bifet, Geoff
Holmes, and Bernhard Pfahringer.
Clustering performance on evolving data streams: Assessing algorithms
and evaluation measures within moa.
In ICDM Workshops, pages 14001403, 2010.
[ bib 
http ]
In today's applications, evolving data streams are ubiquitous. Stream clustering algorithms were introduced to gain useful knowledge from these streams in realtime. The quality of the obtained clusterings, i.e. how good they reflect the data, can be assessed by evaluation measures. A multitude of stream clustering algorithms and evaluation measures for clusterings were introduced in the literature, however, until now there is no general tool for a direct comparison of the different algorithms or the evaluation measures. In our demo, we present a novel experimental framework for both tasks. It offers the means for extensive evaluation and visualization and is an extension of the Massive Online Analysis (MOA) software environment released under the GNU GPL License.


[308]

ADAPTIVE STREAM MINING: Pattern Learning and Mining from Evolving Data Streams
Albert Bifet.
ADAPTIVE STREAM MINING: Pattern Learning and Mining from
Evolving Data Streams.
IOS Press, Amsterdam, 2010.
[ bib 
http ]


[309]

A Review of MultiInstance Learning Assumptions
James Foulds and Eibe Frank.
A review of multiinstance learning assumptions.
Knowledge Engineering Review, 25(1):125, 2010.
[ bib 
.pdf ]
Multiinstance (MI) learning is a variant of inductive machine learning where each learning example contains a bag of instances instead of a single feature vector. The term commonly refers to the supervised setting, where each bag is associated with a label. This type of representation is a natural fit for a number of realworld learning scenarios, including drug activity prediction and image classification, hence many multiinstance learning algorithms have been proposed. Any MI learning method must relate instances to baglevel class labels, but many types of relationships between instances and class labels are possible. Although all early work in MI learning assumes a specific MI concept class known to be appropriate for a drug activity prediction domain, this standard MI assumption is not guaranteed to hold in other domains. Much of the recent work in MI learning has concentrated on a relaxed view of the MI problem, where the standard MI assumption is dropped, and alternative assumptions are considered instead. However, often it is not clearly stated what particular assumption is used and how it relates to other assumptions that have been proposed. In this paper, we aim to clarify the use of alternative MI assumptions by reviewing the work done in this area.


[310]

Clustering mixed data
Lynette Hunt and Murray Jorgensen.
Clustering mixed data.
Wiley Interdisciplinary Reviews: Data Mining and Knowledge
Discovery, 1(4):352361, 2011.
[ bib 
http ]
Mixture model clustering proceeds by fitting a finite mixture of multivariate distributions to data, the fitted mixture density then being used to allocate the data to one of the components. Common model formulations assume that either all the attributes are continuous or all the attributes are categorical. In this paper, we consider options for model formulation in the more practical case of mixed data: multivariate data sets that contain both continuous and categorical attributes.


[311]

Using the online crossentropy method to learn relational
policies for playing different games
Samuel Sarjant, Bernhard Pfahringer, Kurt Driessens, and Tony Smith.
Using the online crossentropy method to learn relational policies
for playing different games.
In Proceedings IEEE Conference on Computational Intelligence and
Games, pages 182189. IEEE, 2011.
[ bib ]
By defining a videogame environment as a collection of objects, relations, actions and rewards, the relational reinforcement learning algorithm presented in this paper generates and optimises a set of concise, humanreadable relational rules for achieving maximal reward. Rule learning is achieved using a combination of incremental specialisation of rules and a modified online crossentropy method, which dynamically adjusts the rate of learning as the agent progresses. The algorithm is tested on the Ms. PacMan and Mario environments, with results indicating the agent learns an effective policy for acting within each environment.


[312]

Classifier chains for multilabel classification
Jesse Read, Bernhard Pfahringer, Geoff Holmes, and Eibe Frank.
Classifier chains for multilabel classification.
Machine Learning, 85(3):333359, 2011.
[ bib 
http ]
The widely known binary relevance method for multilabel classification, which considers each label as an independent binary problem, has often been overlooked in the literature due to the perceived inadequacy of not directly modelling label correlations. Most current methods invest considerable complexity to model interdependencies between labels. This paper shows that binary relevancebased methods have much to offer, and that high predictive performance can be obtained without impeding scalability to large datasets. We exemplify this with a novel classifier chains method that can model label correlations while maintaining acceptable computational complexity. We extend this approach further in an ensemble framework. An extensive empirical evaluation covers a broad range of multilabel datasets with a variety of evaluation metrics. The results illustrate the competitiveness of the chaining method against related and stateoftheart methods, both in terms of predictive performance and time complexity.


[313]

Beyond Trees: Adopting MITI to Learn Rules and Ensemble
Classifiers for MultiInstance Data
Luke Bjerring and Eibe Frank.
Beyond trees: Adopting MITI to learn rules and ensemble classifiers
for multiinstance data.
In Proc 24th Australasian Joint Conference on Artificial
Intelligence, Perth, Australia, pages 4150. Springer, 2011.
[ bib 
.pdf ]
MITI is a simple and elegant decision tree learner designed for multiinstance classification problems, where examples for learning consist of bags of instances. MITI grows a tree in bestfirst manner by maintaining a priority queue containing the unexpanded nodes in the fringe of the tree. When the head node contains instances from positive examples only, it is made into a leaf, and any bag of data that is associated with this leaf is removed. In this paper we first revisit the basic algorithm and consider the effect of parameter settings on classification accuracy, using several benchmark datasets. We show that the chosen splitting criterion in particular can have a significant effect on accuracy. We identify a potential weakness of the algorithm—subtrees can contain structure that has been created using data that is subsequently removed—and show that a simple modification turns the algorithm into a rule learner that avoids this problem. This rule learner produces more compact classifiers with comparable accuracy on the benchmark datasets we consider. Finally, we present randomized algorithm variants that enable us to generate ensemble classifiers. We show that these can yield substantially improved classification accuracy.


[314]

Using Output Codes for Twoclass Classification Problems
Fanhua Zeng.
Using output codes for twoclass classification problems.
Master's thesis, Department of Computer Science, University of
Waikato, 2011.
[ bib 
http ]
Errorcorrecting output codes (ECOCs) have been widely used in many applications for multiclass classification problems. The problem is that ECOCs cannot be ap plied directly on twoclass datasets. The goal of this thesis is to design and evaluate an approach to solve this problem, and then investigate whether the approach can yield better classification models. To be able to use ECOCs, we turn twoclass datasets into multiclass datasets first, by using clustering. With the resulting multiclass datasets in hand, we evalu ate three different encoding methods for ECOCs: exhaustive coding, random coding and a “predefined” code that is found using random search. The exhaustive coding method has the highest errorcorrecting abilities. However, this method is limited due to the exponential growth of bit columns in the codeword matrix precluding it from being used for problems with large numbers of classes. Random coding can be used to cover situations with large numbers of classes in the data. To improve on completely random matrices, “predefined” codeword matrices can be generated by using random search that optimizes row separation yielding better error correction than a purely random matrix. To speed up the process of finding good matrices, GPU parallel programming is investigated in this thesis. From the empirical results, we can say that the new algorithm, which applies multiclass ECOCs on twoclass data using clustering, does improve the performance for some base learners, when compared to applying them directly to the original two class datasets.


[315]

Smoothing in Probability Estimation Trees
Zhimeng Han.
Smoothing in probability estimation trees.
Master's thesis, Department of Computer Science, University of
Waikato, 2011.
[ bib 
http ]
Classification learning is a type of supervised machine learning technique that uses a classification model (e.g. decision tree) to predict unknown class labels for previously unseen instances. In many applications it can be very useful to additionally obtain class probabilities for the different class labels. Decision trees that yield these probabilities are also called probability estimation trees (PETs). Smoothing is a technique used to improve the probability estimates. There are several existing smoothing methods, such as the Laplace correction, MEstimate smoothing and MBranch smoothing. Smoothing does not just apply to PETs. In the field of text compression, PPM in particular, smoothing methods play a important role. This thesis migrates smoothing methods from text compression to PETs. The newly migrated methods in PETs are compared with the best of the existing smoothing methods considered in this thesis under different experiment setups. Unpruned, pruned and bagged trees are considered in the experiments. The main finding is that the PPMbased methods yield the best probability estimate when used with bagged trees, but not when used with individual (pruned or unpruned) trees.


[316]

Conceptbased text clustering
Lan Huang.
Conceptbased text clustering.
PhD thesis, Department of Computer Science, University of Waikato,
2011.
[ bib 
http ]
Thematic organization of text is a natural practice of humans and a crucial task
for today’s vast repositories. Clustering automates this by assessing the similarity
between texts and organizing them accordingly, grouping like ones together and
separating those with diﬀerent topics. Clusters provide a comprehensive logical
structure that facilitates exploration, search and interpretation of current texts,
as well as organization of future ones.
Automatic clustering is usually based on words. Text is represented by the
words it mentions, and thematic similarity is based on the proportion of words
that texts have in common. The resulting bagofwords model is semantically
ambiguous and undesirably orthogonal—it ignores the connections between words.
This thesis claims that using concepts as the basis of clustering can signiﬁ
cantly improve eﬀectiveness. Concepts are deﬁned as units of knowledge. When
organized according to the relations among them, they form a concept system.
Two concept systems are used here: WordNet, which focuses on word knowledge,
and Wikipedia, which encompasses world knowledge.
We investigate a clustering procedure with three components: using concepts
to represent text; taking the semantic relations among them into account dur
ing clustering; and learning a text similarity measure from concepts and their
relations. First, we demonstrate that concepts provide a succinct and informa
tive representation of the themes in text, exemplifying this with the two concept
systems. Second, we deﬁne methods for utilizing concept relations to enhance
clustering by making the representation models more discriminative and extend
ing thematic similarity beyond surface overlap. Third, we present a similarity
measure based on concepts and their relations that is learned from a small num
ber of examples, and show that it both predicts similarity consistently with human
judgement and improves clustering. The thesis provides strong support for the
use of conceptbased representations instead of the classic bagofwords model.


[317]

Sequencebased protein classification: binary Profile Hidden Markov Models and propositionalisation
Stefan Mutter.
Sequencebased protein classification: binary Profile Hidden
Markov Models and propositionalisation.
PhD thesis, Department of Computer Science, University of Waikato,
2011.
[ bib 
http ]
Detecting similarity in biological sequences is a key element to understanding the mechanisms of life. Researchers infer potential structural, functional or evolutionary relationships from similarity. However, the concept of similarity is complex in biology. Sequences consist of different molecules with different chemical properties, have short and long distance interactions, form 3D structures and change through evolutionary processes. Amino acids are one of the key molecules of life. Most importantly, a sequence of amino acids constitutes the building block for proteins which play an essential role in cellular processes. This thesis investigates similarity amongst proteins. In this area of research there are two important and closely related classification tasks – the detection of similar proteins and the discrimination amongst them. Hidden Markov Models (HMMs) have been successfully applied to the detection task as they model sequence similarity very well. From a Machine Learning point of view these HMMs are essentially oneclass classifiers trained solely on a small number of similar proteins neglecting the vast number of dissimilar ones. Our basic assumption is that integrating this neglected information will be highly beneficial to the classification task. Thus, we transform the problem representation from a oneclass to a binary one. Equipped with the necessary sound understanding of Machine Learning, especially concerning problem representation and statistically significant evaluation, our work pursues and combines two different avenues on this aforementioned transformation. First, we introduce a binary HMM that discriminates significantly better than the standard one, even when only a fraction of the negative information is used. Second, we interpret the HMM as a structured graph of information. This information cannot be accessed by highly optimised standard Machine Learning classifiers as they expect a fixed length feature vector representation. Propositionalisation is a technique to transform the former representation into the latter. This thesis introduces new propositionalisation techniques. The change in representation changes the learning problem from a oneclass, generative to a propositional, discriminative one. It is a common assumption that discriminative techniques are better suited for classification tasks, and our results validate this assumption. We suggest a new way to significantly improve on discriminative power and runtime by means of terminating the timeintense training of HMMs early, subsequently applying propositionalisation and classifying with a discriminative, binary learner.


[318]

Experiments with Multiview Multiinstance Learning for Supervised Image Classification
Michael Mayo and Eibe Frank.
Experiments with multiview multiinstance learning for supervised
image classification.
In Proc 26th International Conference Image and Vision Computing
New Zealand, Auckland, New Zealand, pages 363369, 2011.
[ bib 
.pdf ]
In this paper we empirically investigate the benefits of multiview multiinstance (MVMI) learning for supervised
image classification. In multiinstance learning, examples for learning contain bags of feature vectors and thus data from
different views cannot simply be concatenated as in the singleinstance case. Hence, multiview learning, where one classifier
is built per view, is particularly attractive when applying multiinstance learning to image classification. We take several diverse
image data sets—ranging from person detection to astronomical object classification to species recognition—and derive a set of
multiple instance views from each of them. We then show via an extensive set of 10×10 stratified crossvalidation experiments that
MVMI, based on averaging predicted confidence scores, generally exceeds the performance of traditional singleview multiinstance
learning, when using support vector machines and boosting as the underlying learning algorithms.


[319]

A comparison of methods for estimating prediction intervals in NIR spectroscopy: Size matters
Remco R. Bouckaert, Eibe Frank, Geoffrey Holmes, and Dale Fletcher.
A comparison of methods for estimating prediction intervals in NIR
spectroscopy: Size matters.
Chemometrics and Intelligent Laboratory Systems, 109(2):139 
145, 2011.
[ bib 
http ]
In this article we demonstrate that, when evaluating a method for determining prediction intervals, interval size matters more than coverage because the latter can be fixed at a chosen confidence level with good reliability. To achieve the desired coverage, we employ a simple nonparametric method to compute prediction intervals by calibrating estimated prediction errors, and we extend the basic method with a continuum correction to deal with small data sets. In our experiments on a collection of several NIR data sets, we evaluate several existing methods of computing prediction intervals for partial leastsquares (PLS) regression. Our results show that, when coverage is fixed at a chosen confidence level, and the number of PLS components is selected to minimize squared error of point estimates, interval estimation based on the classic ordinary leastsquares method produces the narrowest intervals, outperforming the Udeviation method and linearization, regardless of the confidence level that is chosen.


[320]

Data Mining: Practical Machine Learning Tools and Techniques
Ian H. Witten, Eibe Frank, and Mark A. Hall.
Data Mining: Practical Machine Learning Tools and Techniques.
Morgan Kaufmann, Burlington, MA, 3 edition, 2011.
[ bib 
.html ]


[321]

Bagging Ensemble Selection
Quan Sun and Bernhard Pfahringer.
Bagging ensemble selection.
In Australasian Conference on Artificial Intelligence, pages
251260. Springer, 2011.
[ bib 
http ]
Ensemble selection has recently appeared as a popular ensemble learning method, not only because its implementation is fairly straightforward, but also due to its excellent predictive performance on practical problems. The method has been highlighted in winning solutions of many data mining competitions, such as the Netflix competition, the KDD Cup 2009 and 2010, the UCSD FICO contest 2010, and a number of data mining competitions on the Kaggle platform. In this paper we present a novel variant: bagging ensemble selection. Three variations of the proposed algorithm are compared to the original ensemble selection algorithm and other ensemble algorithms. Experiments with ten real world problems from diverse domains demonstrate the benefit of the bagging ensemble selection algorithm.


[322]

Semirandom Model Tree Ensembles: An Effective and Scalable
Regression Method
Bernhard Pfahringer.
Semirandom model tree ensembles: An effective and scalable
regression method.
In Australasian Conference on Artificial Intelligence, pages
231240. Springer, 2011.
[ bib 
http ]
We present and investigate ensembles of semirandom model trees as a novel regression method. Such ensembles combine the scalability of treebased methods with predictive performance rivalling the state of the art in numeric prediction. An empirical investigation shows that SemiRandom Model Trees produce predictive performance which is competitive with stateoftheart methods like Gaussian Processes Regression or Additive Groves of Regression Trees. The training and optimization of Random Model Trees scales better than Gaussian Processes Regression to larger datasets, and enjoys a constant advantage over Additive Groves of the order of one to two orders of magnitude.


[323]

A Model for Predicting the Resolution of Type 2 Diabetes in Severely Obese Subjects Following Rouxen Y Gastric Bypass Surgery
Mark Thomas Hayes, Lynette Anne Hunt, Jonathan Foo, Yulia Tychinskaya, and
Richard Strawson Stubbs.
A model for predicting the resolution of type 2 diabetes in severely
obese subjects following rouxen y gastric bypass surgery.
Obesity Surgery, 21(7):910916, 2011.
[ bib 
http ]
Severely obese type 2 diabetics who undergo Rouxen Y gastric bypass surgery have significant improvements in glycaemic control. Little work has been undertaken to establish the independent predictors of such resolution or to develop a predictive model. The aim of this study was to develop a mathematical model and establish independent predictors for the resolution of diabetes.
METHODS:
A consecutive sample of 130 severely obese type 2 diabetics who underwent gastric bypass surgery for weight loss from November 1997 to May 2007 with prospective preoperative documentation of biochemical and clinical measurements was followed up over 12 months. Logistic discrimination analysis was undertaken to identify those variables with independent predictive value and to develop a predictive model for resolution of type 2 diabetes. Consecutive samples of 130 patients with body mass index (BMI) ≥ 35 with type 2 diabetes were selected. One hundred and twentyseven patients completed the study with a sufficient data set. Patients were deemed unresolved if (1) diabetic medication was still required after surgery; (2) if fasting plasma glucose (FPG) remained >7 mmol/L; or (3) HbA1c remained >7%.
RESULTS:
Resolution of diabetes was seen in 84%, while diabetes remained but was improved in 16% of patients. Resolution was rapid and sustained with 74% of those on medication before surgery being able to discontinue this by the time of discharge 6 days following surgery. Five preoperative variables were found to have independent predictive value for resolution of diabetes, including BMI, HbA1c, FPG, hypertension and requirement for insulin. Two models have been proposed for prediction of diabetes resolution, each with 86% correct classification in this cohort of patients.
CONCLUSIONS:
Type 2 diabetes resolves in a very high percentage of patients undergoing gastric bypass surgery for severe obesity. The key predictive variables include preoperative BMI, HbA1c, FPG, the presence of hypertension and diabetic status.


[324]

Hybridizing Data Stream Mining and Technical Indicators
in Automated Trading Systems
Michael Mayo.
Hybridizing data stream mining and technical indicators in automated
trading systems.
In Modeling Decision for Artificial Intelligence  8th
International Conference, MDAI 2011, Changsha, Hunan, China, July 2830,
2011, Proceedings, pages 7990. Springer, 2011.
[ bib 
http ]
Automated trading systems for financial markets can use data mining techniques for future price movement prediction. However, classifier accuracy is only one important component in such a system: the other is a decision procedure utilizing the prediction in order to be long, short or out of the market. In this paper, we investigate the use of technical indicators as a means of deciding when to trade in the direction of a classifier’s prediction. We compare this “hybrid” technical/data stream miningbased system with a naive system that always trades in the direction of predicted price movement. We are able to show via evaluations across five financial market datasets that our novel hybrid technique frequently outperforms the naive system. To strengthen our conclusions, we also include in our evaluation several “simple” trading strategies without any data mining component that provide a much stronger baseline for comparison than traditional buyandhold or sellandhold strategies.


[325]

Modelling epistasis in genetic disease using Petri nets,
evolutionary computation and frequent itemset mining
Michael Mayo and Lorenzo Beretta.
Modelling epistasis in genetic disease using petri nets, evolutionary
computation and frequent itemset mining.
Expert Syst. Appl., 38(4):40064013, 2011.
[ bib 
http ]
Petri nets are useful for mathematically modelling diseasecausing genetic epistasis. A Petri net model of an interaction has the potential to lead to biological insight into the cause of a genetic disease. However, defining a Petri net by hand for a particular interaction is extremely difficult because of the sheer complexity of the problem and degrees of freedom inherent in a Petri net’s architecture.
We propose therefore a novel method, based on evolutionary computation and data mining, for automatically constructing Petri net models of nonlinear gene interactions. The method comprises two main steps. Firstly, an initial partial Petri net is set up with several repeated subnets that model individual genes and a set of constraints, comprising relevant common sense and biological knowledge, is also defined. These constraints characterise the class of Petri nets that are desired. Secondly, this initial Petri net structure and the constraints are used as the input to a genetic algorithm. The genetic algorithm searches for a Petri net architecture that is both a superset of the initial net, and also conforms to all of the given constraints. The genetic algorithm evaluation function that we employ gives equal weighting to both the accuracy of the net and also its parsimony.
We demonstrate our method using an epistatic model related to the presence of digital ulcers in systemic sclerosis patients that was recently reported in the literature. Our results show that although individual “perfect” Petri nets can frequently be discovered for this interaction, the true value of this approach lies in generating many different perfect nets, and applying data mining techniques to them in order to elucidate common and statistically significant patterns of interaction.


[326]

MOATweetReader: RealTime Analysis in Twitter Streaming
Data
Albert Bifet, Geoffrey Holmes, and Bernhard Pfahringer.
Moatweetreader: Realtime analysis in twitter streaming data.
In Discovery Science  14th International Conference, DS 2011,
Espoo, Finland, October 57, 2011. Proceedings, pages 4660. Springer,
2011.
[ bib 
http ]
Twitter is a microblogging service built to discover what is happening at any moment in time, anywhere in the world. Twitter messages are short, generated constantly, and well suited for knowledge discovery using data stream mining. We introduce MOATweetReader, a system for processing tweets in real time. We show two main applications of the new system for studying Twitter data: detecting changes in term frequencies and performing realtime sentiment analysis.


[327]

Active Learning with Evolving Streaming Data
Indre Zliobaite, Albert Bifet, Bernhard Pfahringer, and Geoff Holmes.
Active learning with evolving streaming data.
In Machine Learning and Knowledge Discovery in Databases 
European Conference, ECML PKDD 2011, Athens, Greece, September 59, 2011,
Proceedings, Part III, pages 597612. Springer, 2011.
[ bib 
http ]
In learning to classify streaming data, obtaining the true labels may require major effort and may incur excessive cost. Active learning focuses on learning an accurate model with as few labels as possible. Streaming data poses additional challenges for active learning, since the data distribution may change over time (concept drift) and classifiers need to adapt. Conventional active learning strategies concentrate on querying the most uncertain instances, which are typically concentrated around the decision boundary. If changes do not occur close to the boundary, they will be missed and classifiers will fail to adapt. In this paper we develop two active learning strategies for streaming data that explicitly handle concept drift. They are based on uncertainty, dynamic allocation of labeling efforts over time and randomization of the search space. We empirically demonstrate that these strategies react well to changes that can occur anywhere in the instance space and unexpectedly.


[328]

MOA: A RealTime Analytics Open Source Framework
Albert Bifet, Geoff Holmes, Bernhard Pfahringer, Jesse Read, Philipp Kranen,
Hardy Kremer, Timm Jansen, and Thomas Seidl.
Moa: A realtime analytics open source framework.
In Machine Learning and Knowledge Discovery in Databases 
European Conference, ECML PKDD 2011, Athens, Greece, September 59, 2011,
Proceedings, Part III, pages 617620. Springer, 2011.
[ bib 
http ]
Massive Online Analysis (MOA) is a software environment for implementing algorithms and running experiments for online learning from evolving data streams. MOA is designed to deal with the challenging problems of scaling up the implementation of state of the art algorithms to real world dataset sizes and of making algorithms comparable in benchmark streaming settings. It contains a collection of offline and online algorithms for classification, clustering and graph mining as well as tools for evaluation. For researchers the framework yields insights into advantages and disadvantages of different approaches and allows for the creation of benchmark streaming data sets through stored, shared and repeatable settings for the data feeds. Practitioners can use the framework to easily compare algorithms and apply them to real world data sets and settings. MOA supports bidirectional interaction with WEKA, the Waikato Environment for Knowledge Analysis. Besides providing algorithms and measures for evaluation and comparison, MOA is easily extensible with new contributions and allows for the creation of benchmark scenarios.


[329]

Online Evaluation of Email Streaming Classifiers Using GNUsmail
José M. CarmonaCejudo, Manuel BaenaGarcía, José del
CampoÁvila, Albert Bifet, Joao Gama, and Rafael Morales Bueno.
Online evaluation of email streaming classifiers using gnusmail.
In Advances in Intelligent Data Analysis X  10th International
Symposium, IDA 2011, Porto, Portugal, October 2931, 2011. Proceedings,
pages 90100. Springer, 2011.
[ bib 
http ]
Realtime email classification is a challenging task because of its online nature, subject to conceptdrift. Identifying spam, where only two labels exist, has received great attention in the literature. We are nevertheless interested in classification involving multiple folders, which is an additional source of complexity. Moreover, neither crossvalidation nor other sampling procedures are suitable for data streams evaluation. Therefore, other metrics, like the prequential error, have been proposed. However, the prequential error poses some problems, which can be alleviated by using mechanisms such as fading factors. In this paper we present GNUsmail, an opensource extensible framework for email classification, and focus on its ability to perform online evaluation. GNUsmail’s architecture supports incremental and online learning, and it can be used to compare different online mining methods, using stateofart evaluation metrics. We show how GNUsmail can be used to compare different algorithms, including a tool for launching replicable experiments.


[330]

Mining frequent closed graphs on evolving data streams
Albert Bifet, Geoff Holmes, Bernhard Pfahringer, and Ricard Gavaldà.
Mining frequent closed graphs on evolving data streams.
In Proceedings of the 17th ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, San Diego, CA, USA, August 2124,
2011, pages 591599. ACM, 2011.
[ bib 
http ]
Graph mining is a challenging task by itself, and even more so when processing data streams which evolve in realtime. Data stream mining faces hard constraints regarding time and space for processing, and also needs to provide for concept drift detection. In this paper we present a framework for studying graph pattern mining on timevarying streams. Three new methods for mining frequent closed subgraphs are presented. All methods work on coresets of closed subgraphs, compressed representations of graph sets, and maintain these sets in a batchincremental manner, but use different approaches to address potential concept drift. An evaluation study on datasets comprising up to four million graphs explores the strength and limitations of the proposed methods. To the best of our knowledge this is the first work on mining frequent closed subgraphs in nonstationary data streams.


[331]

An effective evaluation measure for clustering on evolving
data streams
Hardy Kremer, Philipp Kranen, Timm Jansen, Thomas Seidl, Albert Bifet, Geoff
Holmes, and Bernhard Pfahringer.
An effective evaluation measure for clustering on evolving data
streams.
In Proceedings of the 17th ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, San Diego, CA, USA, August 2124,
2011, pages 868876. ACM, 2011.
[ bib 
http ]
Due to the ever growing presence of data streams, there has been a considerable amount of research on stream mining algorithms. While many algorithms have been introduced that tackle the problem of clustering on evolving data streams, hardly any attention has been paid to appropriate evaluation measures. Measures developed for static scenarios, namely structural measures and groundtruthbased measures, cannot correctly reflect errors attributable to emerging, splitting, or moving clusters. These situations are inherent to the streaming context due to the dynamic changes in the data distribution. In this paper we develop a novel evaluation measure for stream clustering called Cluster Mapping Measure (CMM). CMM effectively indicates different types of errors by taking the important properties of evolving data streams into account. We show in extensive experiments on real and synthetic data that CMM is a robust measure for stream clustering evaluation.


[332]

Mining frequent closed trees in evolving data streams
Albert Bifet and Ricard Gavaldà.
Mining frequent closed trees in evolving data streams.
Intell. Data Anal., 15(1):2948, 2011.
[ bib 
http ]
We propose new algorithms for adaptively mining closed rooted trees, both labeled and unlabeled, from data streams that change over time. Closed patterns are powerful representatives of frequent patterns, since they eliminate redundant information. Our approach is based on an advantageous representation of trees and a lowcomplexity notion of relaxed closed trees, as well as ideas from Galois Lattice Theory. More precisely, we present three closed tree mining algorithms in sequence: an incremental one, IncTreeMiner, a slidingwindow based one, WinTreeMiner, and finally one that mines closed trees adaptively from data streams, AdaTreeMiner. By adaptive we mean here that it presents at all times the closed trees that are frequent in the current state of the data stream. To the best of our knowledge this is the first work on mining closed frequent trees in streaming data varying with time. We give a first experimental evaluation of the proposed algorithms.


[333]

Using GNUsmail to Compare Data Stream Mining Methods for
Online Email Classification
José M. CarmonaCejudo, Manuel BaenaGarcía, Rafael Morales Bueno,
Joao Gama, and Albert Bifet.
Using GNUsmail to compare data stream mining methods for online
email classification.
Journal of Machine Learning Research  Proceedings Track,
17:1218, 2011.
[ bib 
.pdf ]
Realtime classification of emails is a challenging task because of its online nature, and also because email streams are subject to concept drift. Identifying email spam, where only two different labels or classes are defined (spam or not spam), has received great attention in the literature. We are nevertheless interested in a more specific classification where multiple folders exist, which is an additional source of complexity: the class can have a very large number of different values. Moreover, neither crossvalidation nor other sampling procedures are suitable for evaluation in data stream contexts, which is why other metrics, like the prequential error, have been proposed. In this paper, we present GNUsmail, an opensource extensible framework for email classification, and we focus on its ability to perform online evaluation. GNUsmails architecture supports incremental and online learning, and it can be used to compare different data stream mining methods, using stateofart online evaluation metrics. Besides describing the framework, characterized by two overlapping phases, we show how it can be used to compare different algorithms in order to find the most appropriate one. The GNUsmail source code includes a tool for launching replicable experiments.


[334]

Streaming Multilabel Classification
Jesse Read, Albert Bifet, Geoff Holmes, and Bernhard Pfahringer.
Streaming multilabel classification.
Journal of Machine Learning Research  Proceedings Track,
17:1925, 2011.
[ bib 
.pdf ]
This paper presents a new experimental framework for studying multilabel evolving stream classification, with efficient methods that combine the best practices in streaming scenarios with the best practices in multilabel classification. Many real world problems involve data which can be considered as multilabel data streams. Efficient methods exist for multilabel classification in non streaming scenarios. However, learning in evolving streaming scenarios is more challenging, as the learners must be able to adapt to change using limited time and memory. We present a new experimental software that extends the MOA framework. Massive Online Analysis (MOA) is a software environment for implementing algorithms and running experiments for online learning from evolving data streams. It is released under the GNU GPL license.


[335]

MOA Concept Drift Active Learning Strategies for Streaming
Data
Indre Zliobaite, Albert Bifet, Geoff Holmes, and Bernhard Pfahringer.
MOA concept drift active learning strategies for streaming data.
Journal of Machine Learning Research  Proceedings Track,
17:4855, 2011.
[ bib 
.pdf ]
We present a framework for active learning on evolving data streams, as an extension to the MOA system. In learning to classify streaming data, obtaining the true labels may require major effort and may incur excessive cost. Active learning focuses on learning an accurate model with as few labels as possible. Streaming data poses additional challenges for active learning, since the data distribution may change over time (concept drift) and classifiers need to adapt. Conventional active learning strategies concentrate on querying the most uncertain instances, which are typically concentrated around the decision boundary. If changes do not occur close to the boundary, they will be missed and classifiers will fail to adapt. We propose a software system that implements active learning strategies, extending the MOA framework. This software is released under the GNU GPL license.


[336]

Detecting Sentiment Change in Twitter Streaming Data
Albert Bifet, Geoffrey Holmes, Bernhard Pfahringer, and Ricard Gavaldà.
Detecting sentiment change in Twitter streaming data.
Journal of Machine Learning Research  Proceedings Track,
17:511, 2011.
[ bib 
.pdf ]
MOATweetReader is a realtime system to read tweets in real time, to detect changes, and to find the terms whose frequency changed. Twitter is a microblogging service built to discover what is happening at any moment in time, anywhere in the world. Twitter messages are short, and generated constantly, and well suited for knowledge discovery using data stream mining. MOATweetReader is a software extension to the MOA framework. Massive Online Analysis (MOA) is a software environment for implementing algorithms and running experiments for online learning from evolving data streams.


[337]

Combining Modifications to Multinomial Naive Bayes for Text
Classification
Antti Puurula.
Combining modifications to multinomial naive bayes for text
classification.
In Proc 8th Asia Information Retrieval Societies Conference,
Tianjin, China, pages 114125, 2012.
[ bib 
http ]
Multinomial Naive Bayes (MNB) is a preferred classifier for many text classification tasks, due to simplicity and trivial scaling to large scale tasks. However, in terms of classification accuracy it has a performance gap to modern discriminative classifiers, due to strong data assumptions. This paper explores the optimized combination of popular modifications to generative models in the context of MNB text classification. In order to optimize the introduced classifier metaparameters, we explore direct search optimization using random search algorithms. We evaluate 7 basic modifications and 4 search algorithms across 5 publicly availably available datasets, and give comparisons to similarly optimized Multiclass Support Vector Machine (SVM) classifiers. The use of optimized modifications results in over 20% mean reduction in classification errors compared to baseline MNB models, reducing the gap between SVM and MNB mean performance by over 60%. Some of the individual modifications are shown to have substantial and significant effects, while differences between the random search algorithms are smaller and not statistically significant. The evaluated modifications are potentially applicable to many applications of generative text modeling, where similar performance gains can be achieved.


[338]

Scalable Text Classification with Sparse Generative Modeling
Antti Puurula.
Scalable text classification with sparse generative modeling.
In Proc 12th Pacific Rim International Conference on Artificial
Intelligence, Kuching, Malaysia, pages 458469, 2012.
[ bib 
http ]
Machine learning technology faces challenges in handling Big Data: vast volumes of online data such as web pages, news stories and articles. A dominant solution has been parallelization, but this does not make the tasks less challenging. An alternative solution is using sparse computation methods to fundamentally change the complexity of the processing tasks themselves. This can be done by using both the sparsity found in natural data and sparsified models. In this paper we show that sparse representations can be used to reduce the time complexity of generative classifiers to build fundamentally more scalable classifiers. We reduce the time complexity of Multinomial Naive Bayes classification with sparsity and show how to extend these findings into three multilabel extensions: Binary Relevance, Label Powerset and Multilabel Mixture Models. To provide competitive performance we provide the methods with smoothing and pruning modifications and optimize model metaparameters using direct search optimization. We report on classification experiments on 5 publicly available datasets for largescale multilabel classification. All three methods scale easily to the largest available tasks, with training times measured in seconds and classification times in milliseconds, even with millions of training documents, features and classes. The presented sparse modeling techniques should be applicable to many other classifiers, providing the same types of fundamental complexity reductions when applied to large scale tasks.


[339]

Ensembles of Sparse Multinomial Classifiers for Scalable Text Classification
Antti Puurula and Albert Bifet.
Ensembles of sparse multinomial classifiers for scalable text
classification.
In Proc ECML PKDD 2012 Workshop on LargeScale Hierarchical
Classification, Bristol, UK, 2012.
[ bib 
.pdf ]
Machine learning techniques face new challenges in scalability to largescale tasks. Many of the existing algorithms are unable to
scale to potentially millions of features and structured classes encountered in webscale datasets such as Wikipedia. The third Large Scale
Hierarchical Text Classification evaluation (LSHTC3) evaluated systems
for multilabel hierarchical categorization of Wikipedia articles. In this
paper we present a broad overview of our system in the evaluation, performing among the top systems in the challenge. We describe the several
new modeling ideas we used that make text classification systems both
more effective and scalable. These are: reduction of inference time complexity for probabilistic classifiers using inverted indices, classifier modifications optimized with direct search algorithms, ensembles of diverse
multilabel classifiers and a novel featureregression based method for
scalable ensemble combination.


[340]

Online Estimation of Discrete Densities using Classifier Chains
Michael Geilke, Eibe Frank, and Stefan Kramer.
Online estimation of discrete densities using classifier chains.
In Proc ECML PKDD 2012 Workshop on Instant Interactive Data
Mining, Bristol, UK, 2012.
[ bib 
.pdf ]
We propose an approach to estimate a discrete joint density
online, that is, the algorithm is only provided the current example, its
current estimate, and a limited amount of memory. To design an online estimator for discrete densities, we use classifier chains to model
dependencies among features. Each classifier in the chain estimates the
probability of one particular feature. Because a single chain may not pro
vide a reliable estimate, we also consider ensembles of classifier chains.
Our experiments on synthetic data show that the approach is feasible
and the estimated densities approach the true, known distribution with
increasing amounts of data.


[341]

Learning a Conceptbased Document Similarity Measure
Lan Huang, David Milne, Eibe Frank, and Ian H. Witten.
Learning a conceptbased document similarity measure.
Journal of the American Society for Information Science and
Technology, 2012.
[ bib 
.pdf ]
Document similarity measures are crucial components of many text analysis tasks, including information retrieval, document classification, and document clustering. Conventional measures are brittle: they estimate the surface overlap between documents based on the words they mention and ignore deeper semantic connections. We propose a new measure that assesses similarity at both the lexical and semantic levels, and learns from human judgments how to combine them by using machine learning techniques. Experiments show that the new measure produces values for documents that are more consistent with people’s judgments than people are with each other. We also use it to classify and cluster large document sets covering different genres and topics, and find that it improves both classification and clustering performance.


[342]

Ensembles of Restricted Hoeffding Trees
Albert Bifet, Eibe Frank, Geoffrey Holmes, and Bernhard Pfahringer.
Ensembles of restricted hoeffding trees.
ACM Transactions on Intelligent Systems and Technology, 3(2),
2012.
[ bib 
.pdf ]
The success of simple methods for classification shows that is is often not necessary to model complex attribute interactions to obtain good classification accuracy on practical problems. In this paper, we propose to exploit this phenomenon in the data stream context by building an ensemble of Hoeffding trees that are each limited to a small subset of attributes. In this way, each tree is restricted to model interactions between attributes in its corresponding subset. Because it is not known a priori which attribute subsets are relevant for prediction, we build exhaustive ensembles that consider all possible attribute subsets of a given size. As the resulting Hoeffding trees are not all equally important, we weigh them in a suitable manner to obtain accurate classifications. This is done by combining the logodds of their probability estimates using sigmoid perceptrons, with one perceptron per class. We propose a mechanism for setting the perceptrons' learning rate using the ADWIN change detection method for data streams, and also use ADWIN to reset ensemble members (i.e. Hoeffding trees) when they no longer perform well. Our experiments show that the resulting ensemble classifier outperforms bagging for data streams in terms of accuracy when both are used in conjunction with adaptive naive Bayes Hoeffding trees, at the expense of runtime and memory consumption. We also show that our stacking method can improve the performance of a bagged ensemble.


[343]

Experiment databases  A new way to share, organize and
learn from experiments
Joaquin Vanschoren, Hendrik Blockeel, Bernhard Pfahringer, and Geoffrey Holmes.
Experiment databases  a new way to share, organize and learn from
experiments.
Machine Learning, 87(2):127158, 2012.
[ bib ]
Thousands of machine learning research papers contain extensive experimental comparisons. However, the details of those experiments are often lost after publication, making it impossible to reuse these experiments in further research, or reproduce them to verify the claims made. In this paper, we present a collaboration framework designed to easily share machine learning experiments with the community, and automatically organize them in public databases. This enables immediate reuse of experiments for subsequent, possibly much broader investigation and offers faster and more thorough analysis based on a large set of varied results. We describe how we designed such an experiment database, currently holding over 650,000 classification experiments, and demonstrate its use by answering a wide range of interesting research questions and by verifying a number of recent studies.


[344]

Scalable and efficient multilabel classification for evolving
data streams
Jesse Read, Albert Bifet, Geoff Holmes, and Bernhard Pfahringer.
Scalable and efficient multilabel classification for evolving data
streams.
Machine Learning, 88(12):243272, 2012.
[ bib ]
Many challenging real world problems involve multilabel data streams. Efficient methods exist for multilabel classification in nonstreaming scenarios. However, learning in evolving streaming scenarios is more challenging, as classifiers must be able to deal with huge numbers of examples and to adapt to change using limited time and memory while being ready to predict at any point.
This paper proposes a new experimental framework for learning and evaluating on multilabel data streams, and uses it to study the performance of various methods. From this study, we develop a multilabel Hoeffding tree with multilabel classifiers at the leaves. We show empirically that this method is well suited to this challenging task. Using our new framework, which allows us to generate realistic multilabel data streams with concept drift (as well as real data), we compare with a selection of baseline methods, as well as new learning methods from the literature, and show that our Hoeffding tree method achieves fast and more accurate performance.


[345]

Bagging Ensemble Selection for Regression
Quan Sun and Bernhard Pfahringer.
Bagging ensemble selection for regression.
In Australasian Conference on Artificial Intelligence, pages
695706. Springer, 2012.
[ bib ]
Bagging ensemble selection (BES) is a relatively new ensemble learning strategy. The strategy can be seen as an ensemble of the ensemble selection from libraries of models (ES) strategy. Previous experimental results on binary classification problems have shown that using random trees as base classifiers, BESOOB (the most successful variant of BES) is competitive with (and in many cases, superior to) other ensemble learning strategies, for instance, the original ES algorithm, stacking with linear regression, random forests or boosting. Motivated by the promising results in classification, this paper examines the predictive performance of the BESOOB strategy for regression problems. Our results show that the BESOOB strategy outperforms Stochastic Gradient Boosting and Bagging when using regression trees as the base learners. Our results also suggest that the advantage of using a diverse model library becomes clear when the model library size is relatively large. We also present encouraging results indicating that the nonnegative least squares algorithm is a viable approach for pruning an ensemble of ensembles.


[346]

Stream Data Mining Using the MOA Framework
Philipp Kranen, Hardy Kremer, Timm Jansen, Thomas Seidl, Albert Bifet, Geoff
Holmes, Bernhard Pfahringer, and Jesse Read.
Stream data mining using the moa framework.
In Proceedings Database Systems for Advanced Applications,
pages 309313. Springer, 2012.
[ bib ]
Massive Online Analysis (MOA) is a software framework that provides algorithms and evaluation methods for mining tasks on evolving data streams. In addition to supervised and unsupervised learning, MOA has recently been extended to support multilabel classification and graph mining. In this demonstrator we describe the main features of MOA and present the newly added methods for outlier detection on streaming data. Algorithms can be compared to established baseline methods such as LOF and ABOD using standard ranking measures including Spearman rank coefficient and the AUC measure. MOA is an open source project and videos as well as tutorials are publicly available on the MOA homepage.


[347]

Full model selection in the space of data mining operators
Quan Sun, Bernhard Pfahringer, and Michael Mayo.
Full model selection in the space of data mining operators.
In Proceedings Genetic and Evolutionary Computation Conference,
pages 15031504. ACM, 2012.
[ bib ]
We propose a framework and a novel algorithm for the full model selection (FMS) problem. The proposed algorithm, combining both genetic algorithms (GA) and particle swarm optimization (PSO), is named GPS (which stands for GAPSOFMS), in which a GA is used for searching the optimal structure of a data mining solution, and PSO is used for searching the optimal parameter set for a particular structure instance. Given a classification or regression problem, GPS outputs a FMS solution as a directed acyclic graph consisting of diverse data mining operators that are applicable to the problem, including data cleansing, data sampling, feature transformation/selection and algorithm operators. The solution can also be represented graphically in a human readable form. Experimental results demonstrate the benefit of the algorithm.


[348]

BatchIncremental versus InstanceIncremental Learning in
Dynamic and Evolving Data
Jesse Read, Albert Bifet, Bernhard Pfahringer, and Geoff Holmes.
Batchincremental versus instanceincremental learning in dynamic and
evolving data.
In Proceedings Symposium on Advances in Intelligent Data
Analysis, pages 313323, 2012.
[ bib ]
Many real world problems involve the challenging context of data streams, where classifiers must be incremental: able to learn from a theoreticallyinfinite stream of examples using limited time and memory, while being able to predict at any point. Two approaches dominate the literature: batchincremental methods that gather examples in batches to train models; and instanceincremental methods that learn from each example as it arrives. Typically, papers in the literature choose one of these approaches, but provide insufficient evidence or references to justify their choice. We provide a first indepth analysis comparing both approaches, including how they adapt to concept drift, and an extensive empirical study to compare several different versions of each approach. Our results reveal the respective advantages and disadvantages of the methods, which we discuss in detail.


[349]

Maximum Common Subgraph based locally weighted regression
Madeleine Seeland, Fabian Buchwald, Stefan Kramer, and Bernhard Pfahringer.
Maximum common subgraph based locally weighted regression.
In Proceedings of the ACM Symposium on Applied Computing, pages
165172. ACM, 2012.
[ bib ]
This paper investigates a simple, yet effective method for regression on graphs, in particular for applications in cheminformatics and for quantitative structureactivity relationships (QSARs). The method combines Locally Weighted Learning (LWL) with Maximum Common Subgraph (MCS) based graph distances. More specifically, we investigate a variant of locally weighted regression on graphs (structures) that uses the maximum common subgraph for determining and weighting the neighborhood of a graph and feature vectors for the actual regression model. We show that this combination, LWLMCS, outperforms other methods that use the local neighborhood of graphs for regression. The performance of this method on graphs suggests it might be useful for other types of structured data as well.


[350]

Multilabel classification using boolean matrix decomposition
Jörg Wicker, Bernhard Pfahringer, and Stefan Kramer.
Multilabel classification using boolean matrix decomposition.
In Proceedings of the ACM Symposium on Applied Computing, pages
179186. ACM, 2012.
[ bib ]
This paper introduces a new multilabel classifier based on Boolean matrix decomposition. Boolean matrix decomposition is used to extract, from the full label matrix, latent labels representing useful Boolean combinations of the original labels. Base level models predict latent labels, which are subsequently transformed into the actual labels by Boolean matrix multiplication with the second matrix from the decomposition. The new method is tested on six publicly available datasets with varying numbers of labels. The experimental evaluation shows that the new method works particularly well on datasets with a large number of labels and strong dependencies among them.


[351]

Parameter Tuning Using Gaussian Processes
Jinjin Ma.
Parameter tuning using gaussian processes.
Master's thesis, Department of Computer Science, University of
Waikato, 2012.
[ bib 
http ]
Most machine learning algorithms require us to set up their parameter values before applying these algorithms to solve problems. Appropriate parameter settings will bring good performance while inappropriate parameter settings generally result in poor modelling. Hence, it is necessary to acquire the “best” parameter values for a particular algorithm before building the model. The “best” model not only reflects the “real” function and is well fitted to existing points, but also gives good performance when making predictions for new points with previously unseen values.
A number of methods exist that have been proposed to optimize parameter values. The basic idea of all such methods is a trialanderror process whereas the work presented in this thesis employs Gaussian process (GP) regression to optimize the parameter values of a given machine learning algorithm. In this thesis, we consider the optimization of only twoparameter learning algorithms. All the possible parameter values are specified in a 2dimensional grid in this work. To avoid bruteforce search, Gaussian Process Optimization (GPO) makes use of “expected improvement” to pick useful points rather than validating every point of the grid step by step. The point with the highest expected improvement is evaluated using crossvalidation and the resulting data point is added to the training set for the Gaussian process model. This process is repeated until a stopping criterion is met. The final model is built using the learning algorithm based on the best parameter values identified in this process.
In order to test the effectiveness of this optimization method on regression and classification problems, we use it to optimize parameters of some wellknown machine learning algorithms, such as decision tree learning, support vector machines and boosting with trees. Through the analysis of experimental results obtained on datasets from the UCI repository, we find that the GPO algorithm yields competitive performance compared with a bruteforce approach, while exhibiting a distinct advantage in terms of training time and number of crossvalidation runs. Overall, the GPO method is a promising method for the optimization of parameter values in machine learning.


[352]

An application of data mining to fruit and vegetable sample identification using Gas ChromatographyMass Spectrometry
Geoffrey Holmes, Dale Fletcher, and Peter Reutemann.
An application of data mining to fruit and vegetable sample
identification using gas chromatographymass spectrometry.
In Proceedings of the International Congress on Environmental
Modelling and Software (IEMSS), Leipzig, Germany, 2012.
[ bib 
.pdf ]
One of the uses of Gas ChromatographyMass Spectrometry (GCMS) is in the detection of pesticide residues in fruit and vegetables. In a high throughput laboratory there is the potential for sample swaps or mislabelling, as once a sample has been preprocessed to be injected into the GCMS analyser, it is no longer distinguishable by eye. Possible consequences of such mistakes can be the destruction of large amounts of actually safe produce or pesticidecontaminated produce reaching the consumer. For the purposes of food safety and traceability, it can also be extremely valuable to know the source (country of origin) of a food product. This can help uncover fraudulent attempts of trying to sell food originating from countries deemed unsafe. In this study, we use the workﬂow environment ADAMS to examine whether we can determine the fruit/vegetable, and the country of origin of a sample from a GCMS chromatogram. A workﬂow is used to generate data sets using different data preprocessing methods, and data representations from a database of over 8000 GCMS chromatograms, consisting of more than 100 types of fruit and vegetables from more than 120 countries. A variety of classiﬁcation algorithms are evaluated using the WEKA data mining workbench. We demonstrate excellent results, both for the determination of fruit/vegetable type and for the country of origin, using a histogram of ion counts, and Classiﬁcation by Regression using Random Regression Forest with PLStransformed data.


[353]

Evolutionary Data Selection for Enhancing Models of Intraday
Forex Time Series
Michael Mayo.
Evolutionary data selection for enhancing models of intraday forex
time series.
In Proceedings Applications of Evolutionary Computation, pages
184193. Springer, 2012.
[ bib ]
The hypothesis in this paper is that a significant amount of intraday market data is either noise or redundant, and that if it is eliminated, then predictive models built using the remaining intraday data will be more accurate. To test this hypothesis, we use an evolutionary method (called Evolutionary Data Selection, EDS) to selectively remove out portions of training data that is to be made available to an intraday market predictor. After performing experiments in which dataselected and nondataselected versions of the same predictive models are compared, it is shown that EDS is effective and does indeed boost predictor accuracy. It is also shown in the paper that building multiple models using EDS and placing them into an ensemble further increases performance. The datasets for evaluation are large intraday forex time series, specifically series from the EUR/USD, the USD/JPY and the EUR/JPY markets, and predictive models for two primary tasks per market are built: intraday return prediction and intraday volatility prediction.


[354]

Developing data mining applications
Geoff Holmes.
Developing data mining applications.
In International Conference on Knowledge Discovery and Data
Mining, page 225. ACM, 2012.
[ bib ]
In this talk I will review several realworld applications developed at the University of Waikato over the past 15 years. These include the use of near infrared spectroscopy coupled with data mining as an alternate laboratory technique for predicting compound concentrations in soil and plant samples, and the analysis of gas chromatography mass spectrometry (GCMS) data, a technique used to determine in environmental applications, for example, the petroleum content in soil and water samples. I will then briefly discuss how experience with these applications has led to the development of an opensource framework for application development.


[355]

Scientific Workflow Management with ADAMS
Peter Reutemann and Joaquin Vanschoren.
Scientific workflow management with adams.
In PeterA. Flach, Tijl Bie, and Nello Cristianini, editors,
Machine Learning and Knowledge Discovery in Databases, volume 7524 of
Lecture Notes in Computer Science, pages 833837. Springer Berlin
Heidelberg, 2012.
[ bib 
http ]
We demonstrate the Advanced Data mining And Machine learning System (ADAMS), a novel workflow engine designed for rapid prototyping and maintenance of complex knowledge workflows. ADAMS does not require the user to manually connect inputs to outputs on a large canvas. It uses a compact workflow representation, control operators, and a simple interface between operators, allowing them to be autoconnected. It contains an extensive library of operators for various types of analysis, and a convenient plugin architecture to easily add new ones.
Keywords: scientific workflows; machine learning; data mining


[356]

Integrated Instance and Classbased Generative Modeling for Text Classification
Antti Puurula and SungHyon Myaeng.
Integrated instance and classbased generative modeling for text
classification.
In Proc 18th Australasian Document Computing Symposium, pages
6673. ACM, 2013.
[ bib 
.pdf ]
Statistical methods for text classification are predominantly based on the paradigm of classbased learning that associates class variables with features, discarding the instances of data after model training. This results in efficient models, but neglects the finegrained information present in individual documents. Instancebased learning uses this information, but suffers from data sparsity with text data. In this paper, we propose a generative model called Tied Document Mixture (TDM) for extending Multinomial Naive Bayes (MNB) with mixtures of hierarchically smoothed models for documents. Alternatively, TDM can be viewed as a Kernel Density Classifier using classsmoothed Multinomial kernels. TDM is evaluated for classification accuracy on 14 different datasets for multilabel, multiclass and binaryclass text classification tasks and compared to instance and classbased learning baselines. The comparisons to MNB demonstrate a substantial improvement in accuracy as a function of available training documents per class, ranging up to average error reductions of over 26% in sentiment clas sification and 65% in spam classification. On average TDM is as accurate as the best discriminative classifiers, but retains the linear time complexities of instancebased learning methods, with exact algorithms for both model estimation and inference.


[357]

Cumulative Progress in Language Models for Information Retrieval
Antti Puurula.
Cumulative progress in language models for information retrieval.
In Proc 11th Australasian Language Technology Workshop,
Brisbane, Australia, pages 96100. ACL, 2013.
[ bib 
.pdf ]
The improvements to adhoc IR systems over the last decades have been recently criticized as illusionary and based on incorrect baseline comparisons. In this paper several improvements to the LM approach to IR are combined and evaluated: PitmanYor Process smoothing, TFIDF feature weighting and modelbased feedback. The increases in ranking quality are significant and cumulative over the standard baselines of Dirichlet Prior and 2stage Smoothing, when evaluated across 13 standard adhoc retrieval datasets. The combination of the improvements is shown to improve the Mean Average Precision over the datasets by 17.1% relative. Furthermore, the considered improvements can be easily implemented with little additional computation to existing LM retrieval systems. On the basis of the results it is suggested that LM research for IR should move towards using stronger baseline models.


[358]

Online Estimation of Discrete Densities
Michael Geilke, Eibe Frank, Andreas Karwath, and Stefan Kramer.
Online estimation of discrete densities.
In Proc 13th IEEE International Conference on Data Mining,
Dallas, Texas. IEEE, 2013.
[ bib 
.pdf ]
We address the problem of estimating a discrete joint density online, that is, the algorithm is only provided the current example and its current estimate. The proposed online estimator of discrete densities, EDDO (Estimation of Discrete Densities Online), uses classifier chains to model dependencies among features. Each classifier in the chain estimates the probability of one particular feature. Because a single chain may not provide a reliable estimate, we also consider ensembles of classifier chains and ensembles of weighted classifier chains. For all density estimators, we provide consistency proofs and propose algorithms to perform certain inference tasks. The empirical evaluation of the estimators is conducted in several experiments and on data sets of up to several million instances: We compare them to density estimates computed from Bayesian structure learners, evaluate them under the influence of noise, measure their ability to deal with concept drift, and measure the runtime performance. Our experiments demonstrate that, even though designed to work online, EDDO delivers estimators of competitive accuracy compared to batch Bayesian structure learners and batch variants of EDDO.


[359]

Propositionalisation of Multiinstance Data using Random Forests
Eibe Frank and Bernhard Pfahringer.
Propositionalisation of multiinstance data using random forests.
In Proc 26th Australasian Conference on Artificial
Intelligence, Otago, New Zealand, pages 362373. Springer, 2013.
[ bib 
.pdf ]
Multiinstance learning is a generalisation of attributevalue learning where examples for learning consist of labeled bags (i.e. multisets) of instances. This learning setting is more computationally challenging than attributevalue learning and a natural fit for important application areas of machine learning such as classification of molecules and image classification. One approach to solve multiinstance learning problems is to apply propositionalisation, where bags of data are converted into vectors of attributevalue pairs so that a standard propositional (i.e. attributevalue) learning algorithm can be applied. This approach is attractive because of the large number of propositional learning algorithms that have been developed and can thus be applied to the propositionalised data. In this paper, we empirically investigate a variant of an existing propositionalisation method called TLC. TLC uses a single decision tree to obtain propositionalised data. Our variant applies a random forest instead and is motivated by the potential increase in robustness that this may yield. We present results on synthetic and realworld data from the above two application domains showing that it indeed yields increased classification accuracy when applying boosting and support vector machines to classify the propositionalised data.


[360]

Random Projections as Regularizers: Learning a Linear Discriminant Ensemble from Fewer Observations than Dimensions
Robert J. Durrant and Ata Kabán.
Random projections as regularizers: Learning a linear discriminant
ensemble from fewer observations than dimensions.
In Proc 5th Asian Conference on Machine Learning, Canberra,
Australia. JMLR, 2013.
[ bib 
.pdf ]
We examine the performance of an ensemble of randomlyprojected Fisher Linear Discriminant classifiers, focusing on the case when there are fewer training observations than data
dimensions. Our ensemble is learned from a sequence of randomlyprojected representations of the original high dimensional data and therefore for this approach data can be
collected, stored and processed in such a compressed form.
The specific form and simplicity of this ensemble permits a direct and much more detailed
analysis than existing generic tools in previous works. In particular, we are able to derive the exact form of the generalization error of our ensemble, conditional on the training
set, and based on this we give theoretical guarantees which directly link the performance
of the ensemble to that of the corresponding linear discriminant learned in the full data
space. To the best of our knowledge these are the first theoretical results to prove such
an explicit link for any classifier and classifier ensemble pair. Furthermore we show that
the randomlyprojected ensemble is equivalent to implementing a sophisticated regularization scheme to the linear discriminant learned in the original data space and this prevents
overfitting in conditions of small sample size where pseudoinverse FLD learned in the data
space is provably poor.
We confirm theoretical findings with experiments, and demonstrate the utility of our approach on several datasets from the bioinformatics domain where fewer observations than
dimensions are the norm.


[361]

DimensionAdaptive Bounds on Compressive FLD Classification
Ata Kabán and Robert J. Durrant.
Dimensionadaptive bounds on compressive fld classification.
In Proc 24th International Conference on Algorithmic Learning
Theory, Singapore, pages 294308, 2013.
[ bib 
http 
.pdf ]
Efficient dimensionality reduction by random projections (RP) gains popularity, hence the learning guarantees achievable in RP spaces are of great interest. In finite dimensional setting, it has been shown for the compressive Fisher Linear Discriminant (FLD) classifier that for good generalisation the required target dimension grows only as the log of the number of classes and is not adversely affected by the number of projected data points. However these bounds depend on the dimensionality d of the original data space. In this paper we give further guarantees that remove d from the bounds under certain conditions of regularity on the data density structure. In particular, if the data density does not fill the ambient space then the error of compressive FLD is independent of the ambient dimension and depends only on a notion of ‘intrinsic dimension’.


[362]

Clustering Based Active Learning for Evolving Data Streams
Dino Ienco, Albert Bifet, Indre Zliobaite, and Bernhard Pfahringer.
Clustering based active learning for evolving data streams.
In Proc 16th International Conference on Discovery Science,
Singapore, pages 7993, 2013.
[ bib 
http ]
Data labeling is an expensive and timeconsuming task. Choosing which labels to use is increasingly becoming important. In the active learning setting, a classifier is trained by asking for labels for only a small fraction of all instances. While many works exist that deal with this issue in nonstreaming scenarios, few works exist in the data stream setting. In this paper we propose a new active learning approach for evolving data streams based on a preclustering step, for selecting the most informative instances for labeling. We consider a batch incremental setting: when a new batch arrives, first we cluster the examples, and then, we select the best instances to train the learner. The clustering approach allows to cover the whole data space avoiding to oversample examples from only few areas. We compare our method w.r.t. state of the art active learning strategies over real datasets. The results highlight the improvement in performance of our proposal. Experiments on parameter sensitivity are also reported.


[363]

Pairwise metarules for better metalearningbased algorithm
ranking
Quan Sun and Bernhard Pfahringer.
Pairwise metarules for better metalearningbased algorithm ranking.
Machine Learning, 93(1):141161, 2013.
[ bib 
http ]
In this paper, we present a novel metafeature generation method in the context of metalearning, which is based on rules that compare the performance of individual base learners in a oneagainstone manner. In addition to these new metafeatures, we also introduce a new metalearner called Approximate Ranking Tree Forests (ART Forests) that performs very competitively when compared with several stateoftheart metalearners. Our experimental results are based on a large collection of datasets and show that the proposed new techniques can improve the overall performance of metalearning for algorithm ranking significantly. A key point in our approach is that each performance figure of any base learner for any specific dataset is generated by optimising the parameters of the base learner separately for each dataset.


[364]

Pitfalls in Benchmarking Data Stream Classification and
How to Avoid Them
Albert Bifet, Jesse Read, Indre Zliobaite, Bernhard Pfahringer, and Geoff
Holmes.
Pitfalls in benchmarking data stream classification and how to avoid
them.
In Proc European Conference on Machine Learning and Knowledge
Discovery in Databases, Prague, Czech Republic, pages 465479, 2013.
[ bib 
http ]
Data stream classification plays an important role in modern data analysis, where data arrives in a stream and needs to be mined in real time. In the data stream setting the underlying distribution from which this data comes may be changing and evolving, and so classifiers that can update themselves during operation are becoming the stateoftheart. In this paper we show that data streams may have an important temporal component, which currently is not considered in the evaluation and benchmarking of data stream classifiers. We demonstrate how a naive classifier considering the temporal component only outperforms a lot of current stateoftheart classifiers on real data streams that have temporal dependence, i.e. data is autocorrelated. We propose to evaluate data stream classifiers taking into account temporal dependence, and introduce a new evaluation measure, which provides a more accurate gauge of data stream classifier performance. In response to the temporal dependence issue we propose a generic wrapper for data stream classifiers, which incorporates the temporal component into the attribute space.


[365]

SMOTE for Regression
Luís Torgo, Rita P. Ribeiro, Bernhard Pfahringer, and Paula Branco.
Smote for regression.
In Proc 16th Portuguese Conference on Artificial Intelligence,
Angra do Heroísmo, Azores, Portugal, pages 378389, 2013.
[ bib 
http ]
Several real world prediction problems involve forecasting rare values of a target variable. When this variable is nominal we have a problem of class imbalance that was already studied thoroughly within machine learning. For regression tasks, where the target variable is continuous, few works exist addressing this type of problem. Still, important application areas involve forecasting rare extreme values of a continuous target variable. This paper describes a contribution to this type of tasks. Namely, we propose to address such tasks by sampling approaches. These approaches change the distribution of the given training data set to decrease the problem of imbalance between the rare target cases and the most frequent ones. We present a modification of the wellknown Smote algorithm that allows its use on these regression tasks. In an extensive set of experiments we provide empirical evidence for the superiority of our proposals for these particular regression tasks. The proposed SmoteR method can be used with any existing regression algorithm turning it into a general tool for addressing problems of forecasting rare extreme values of a continuous target variable


[366]

Applying additive logistic regression to data derived from sensors monitoring behavioral and physiological characteristics of dairy cows to detect lameness
Claudia Kamphuis, Eibe Frank, Jennie K. Burke, Gwyn Verkerk, and Jenny Jago.
Applying additive logistic regression to data derived from sensors
monitoring behavioral and physiological characteristics of dairy cows to
detect lameness.
Dairy Science, 2013.
[ bib 
http ]
The hypothesis was that sensors currently available on farm that monitor behavioral and physiological characteristics have potential for the detection of lameness in dairy cows. This was tested by applying additive logistic regression to variables derived from sensor data. Data were collected between November 2010 and June 2012 on 5 commercial pasturebased dairy farms. Sensor data from weigh scales (liveweight), pedometers (activity), and milk meters (milking order, unadjusted and adjusted milk yield in the first 2 min of milking, total milk yield, and milking duration) were collected at every milking from 4,904 cows. Lameness events were recorded by farmers who were trained in detecting lameness before the study commenced. A total of 318 lameness events affecting 292 cows were available for statistical analyses. For each lameness event, the lame cow's sensor data for a time period of 14 d before observation date were randomly matched by farm and date to 10 healthy cows (i.e., cows that were not lame and had no other health event recorded for the matched time period). Sensor data relating to the 14d time periods were used for developing univariable (using one source of sensor data) and multivariable (using multiple sources of sensor data) models. Model development involved the use of additive logistic regression by applying the LogitBoost algorithm with a regression tree as base learner. The model's output was a probability estimate for lameness, given the sensor data collected during the 14d time period. Models were validated using leaveonefarmout crossvalidation and, as a result of this validation, each cow in the data set (318 lame and 3,180 nonlame cows) received a probability estimate for lameness. Based on the area under the curve (AUC), results indicated that univariable models had low predictive potential, with the highest AUC values found for liveweight (AUC = 0.66), activity (AUC = 0.60), and milking order (AUC = 0.65). Combining these 3 sensors improved AUC to 0.74. Detection performance of this combined model varied between farms but it consistently and significantly outperformed univariable models across farms at a fixed specificity of 80%. Still, detection performance was not high enough to be implemented in practice on large, pasturebased dairy farms. Future research may improve performance by developing variables based on sensor data of liveweight, activity, and milking order, but that better describe changes in sensor data patterns when cows go lame.


[367]

Towards a Framework for Designing Full Model Selection and
Optimization Systems
Quan Sun, Bernhard Pfahringer, and Michael Mayo.
Towards a framework for designing full model selection and
optimization systems.
In Proc 11th International Workshop on Multiple Classifier
Systems, Nanjing, China, pages 259270. Springer, 2013.
[ bib 
http ]
People from a variety of industrial domains are beginning to realise that appropriate use of machine learning techniques for their data mining projects could bring great benefits. Endusers now have to face the new problem of how to choose a combination of data processing tools and algorithms for a given dataset. This problem is usually termed the Full Model Selection (FMS) problem. Extended from our previous work [10], in this paper, we introduce a framework for designing FMS algorithms. Under this framework, we propose a novel algorithm combining both genetic algorithms (GA) and particle swarm optimization (PSO) named GPS (which stands for GAPSOFMS), in which a GA is used for searching the optimal structure for a data mining solution, and PSO is used for searching optimal parameters for a particular structure instance. Given a classification dataset, GPS outputs a FMS solution as a directed acyclic graph consisting of diverse data mining operators that are available to the problem. Experimental results demonstrate the benefit of the algorithm. We also present, with detailed analysis, two modeltreebased variants for speeding up the GPS algorithm.


[368]

Predicting Regression Test Failures Using Genetic AlgorithmSelected
Dynamic Performance Analysis Metrics
Michael Mayo and Simon A. Spacey.
Predicting regression test failures using genetic algorithmselected
dynamic performance analysis metrics.
In Proc 5th International Symposium on Search Based Software
Engineering, St. Petersburg, Russia, pages 158171, 2013.
[ bib 
http ]
A novel framework for predicting regression test failures is proposed. The basic principle embodied in the framework is to use performance analysis tools to capture the runtime behaviour of a program as it executes each test in a regression suite. The performance information is then used to build a dynamically predictive model of test outcomes. Our framework is evaluated using a genetic algorithm for dynamic metric selection in combination with stateoftheart machine learning classifiers. We show that if a program is modified and some tests subsequently fail, then it is possible to predict with considerable accuracy which of the remaining tests will also fail which can be used to help prioritise tests in time constrained testing environments.


[369]

Automatic construction of lexicons, taxonomies, ontologies,
and other knowledge structures
Olena Medelyan, Ian H. Witten, Anna Divoli, and Jeen Broekstra.
Automatic construction of lexicons, taxonomies, ontologies, and other
knowledge structures.
Wiley Interdisciplinary Reviews: Data Mining and Knowledge
Discovery, 3(4):257279, 2013.
[ bib 
http ]
Abstract, structured, representations of knowledge such as lexicons, taxonomies, and ontologies have proven to be powerful resources not only for the systematization of knowledge in general, but to support practical technologies of document organization, information retrieval, natural language understanding, and questionanswering systems. These resources are extremely time consuming for people to create and maintain, yet demand for them is growing, particularly in specialized areas ranging from legacy documents of large enterprises to rapidly changing domains such as current affairs and celebrity news. Consequently, researchers are investigating methods of creating such structures automatically from document collections, calling on the proliferation of interlinked resources already available on the web for background knowledge and general information about the world. This review surveys what is possible, and also outlines current research directions.


[370]

Towards large scale continuous EDA: a random matrix theory
perspective
Ata Kabán, Jakramate Bootkrajang, and Robert John Durrant.
Towards large scale continuous eda: a random matrix theory
perspective.
In Proc Genetic and Evolutionary Computation Conference,
Amsterdan, The Netherlands, pages 383390, 2013.
[ bib 
http 
.pdf ]
Estimation of distribution algorithms (EDA) are a major branch of evolutionary algorithms (EA) with some unique advantages in principle. They are able to take advantage of correlation structure to drive the search more efficiently, and they are able to provide insights about the structure of the search space. However, model building in high dimensions is extremely challenging and as a result existing EDAs lose their strengths in large scale problems.
Large scale continuous global optimisation is key to many real world problems of modern days. Scaling up EAs to large scale problems has become one of the biggest challenges of the field.
This paper pins down some fundamental roots of the problem and makes a start at developing a new and generic framework to yield effective EDAtype algorithms for large scale continuous global optimisation problems. Our concept is to introduce an ensemble of random projections of the set of fittest search points to low dimensions as a basis for developing a new and generic divideandconquer methodology. This is rooted in the theory of random projections developed in theoretical computer science, and will exploit recent advances of nonasymptotic random matrix theory.


[371]

Sharp Generalization Error Bounds for Randomlyprojected Classifiers
Robert J. Durrant and Ata Kabán.
Sharp generalization error bounds for randomlyprojected classifiers.
In Proc 30th International Conference on Machine Learning,
Atlanta, Georgia, pages 693701. JMLR, 2013.
[ bib 
.pdf ]
We derive sharp bounds on the generalization error of a generic linear classifier trained by empirical risk minimization on randomlyprojected data. We make no restrictive assumptions (such as sparsity or separability) on the data: Instead we use the fact that, in a classification setting, the question of interest is really ‘what is the effect of random projection on the predicted class labels?’ and we therefore derive the exact probability of ‘label flipping’ under Gaussian random projection in order to quantify this effect precisely in our bounds.


[372]

Constructing a Focused Taxonomy from a Document Collection
Olena Medelyan, Steve Manion, Jeen Broekstra, Anna Divoli, AnnaLan Huang, and
Ian H. Witten.
Constructing a focused taxonomy from a document collection.
In Proc 10th European Semantic Web Conference, Montpellier,
France, pages 367381. Springer, 2013.
[ bib 
http ]
We describe a new method for constructing custom taxonomies from document collections. It involves identifying relevant concepts and entities in text; linking them to knowledge sources like Wikipedia, DBpedia, Freebase, and any supplied taxonomies from related domains; disambiguating conflicting concept mappings; and selecting semantic relations that best group them hierarchically. An RDF model supports interoperability of these steps, and also provides a flexible way of including existing NLP tools and further knowledge sources. From 2000 news articles we construct a custom taxonomy with 10,000 concepts and 12,700 relations, similar in structure to manually created counterparts. Evaluation by 15 human judges shows the precision to be 89% and 90% for concepts and relations respectively; recall was 75% with respect to a manually generated taxonomy for the same domain.


[373]

Artificial neural network is highly predictive of outcome in paediatric acute liver failure
Jeremy Rajanayagam, Eibe Frank, Ross W. Shepherd, and Peter J. Lewindon.
Artificial neural network is highly predictive of outcome in
paediatric acute liver failure.
Pediatric Transplantation, 2013.
[ bib 
http ]
Current prognostic models in PALF are unreliable, failing to account for complex, nonlinear relationships existing between multiple prognostic factors. A computational approach using ANN should provide superior modelling to PELDMELD scores. We assessed the prognostic accuracy of PELDMELD scores and ANN in PALF in children presenting to the QLTS, Australia. A comprehensive registrybased data set was evaluated in 54 children (32M, 22F, median age 17 month) with PALF. PELDMELD scores calculated at (i) meeting PALF criteria and (ii) peak. ANN was evaluated using stratified 10fold crossvalidation. Outcomes were classified as good (transplantfree survival) or poor (death or LT) and predictive accuracy compared using AUROC curves. Mean PELDMELD scores were significantly higher in nontransplanted nonsurvivors (i) 37 and (ii) 46 and transplant recipients (i) 32 and (ii) 43 compared to transplantfree survivors (i) 26 and (ii) 30. Threshold PELDMELD scores ≥27 and ≥42, at meeting PALF criteria and peak, gave AUROC 0.71 and 0.86, respectively, for poor outcome. ANN showed superior prediction for poor outcome with AUROC 0.96, sensitivity 82.6%, specificity 96%, PPV 96.2% and NPV 85.7% (cutoff 0.5). ANN is superior to PELDMELD for predicting poor outcome in PALF.


[374]

Identifying Market Price Levels Using Differential Evolution
Michael Mayo.
Identifying market price levels using differential evolution.
In Proc 16th European Conference on Applications of Evolutionary
Computation, Vienna, Austria, pages 203212, 2013.
[ bib 
http ]
Evolutionary data mining is used in this paper to investigate the concept of support and resistance levels in financial markets. Specifically, Differential Evolution is used to learn support/resistance levels from price data. The presence of these levels is then tested in outofsample data. Our results from a set of experiments covering five years worth of daily data across nine different US markets show that there is statistical evidence for price levels in certain markets, and that Differential Evolution can uncover them.


[375]

A direct policysearch algorithm for relational reinforcement learning
Samuel Sarjant.
A direct policysearch algorithm for relational reinforcement
learning.
PhD thesis, Department of Computer Science, University of Waikato,
2013.
[ bib 
http ]
Relational Reinforcement Learning (RRL) is a subfield of machine learning in which a learning agent seeks to maximise a numerical reward within an environment, represented as collections of objects and relations, by performing actions that interact with the environment. The relational representation allows more dynamic environment states than an attributebased representation of reinforcement learning, but this flexibility also creates new problems such as a potentially infinite number of states.
This thesis describes an RRL algorithm named Cerrla that creates policies directly from a set of learned relational conditionaction rules using the CrossEntropy Method (CEM) to control policy creation. The CEM assigns each rule a sampling probability and gradually modifies these probabilities such that the randomly sampled policies consist of ‘better’ rules, resulting in larger rewards received. Rule creation is guided by an inferred partial model of the environment that defines: the minimal conditions needed to take an action, the possible specialisation conditions per rule, and a set of simplification rules to remove redundant and illegal rule conditions, resulting in compact, efficient, and comprehensible policies.
Cerrla is evaluated on four separate environments, where each environment has several different goals. Results show that compared to existing RRL algorithms, Cerrla is able to learn equal or better behaviour in less time on the standard RRL environment. On other larger, more complex environments, it can learn behaviour that is competitive to specialised approaches. The simplified rules and CEM’s bias towards compact policies result in comprehensive and effective relational policies created in a relatively short amount of time.


[376]

Model selection based product kernel learning for regression
on graphs
Madeleine Seeland, Stefan Kramer, and Bernhard Pfahringer.
Model selection based product kernel learning for regression on
graphs.
In Proc 28th Annual ACM Symposium on Applied Computing,
Coimbra, Portugal, pages 136143. ACM, 2013.
[ bib 
http ]
The choice of a suitable graph kernel is intrinsically hard and often cannot be made in an informed manner for a given dataset. Methods for multiple kernel learning offer a possible remedy, as they combine and weight kernels on the basis of a labeled training set of molecules to define a new kernel. Whereas most methods for multiple kernel learning focus on learning convex linear combinations of kernels, we propose to combine kernels in products, which theoretically enables higher expressiveness. In experiments on ten publicly available chemical QSAR datasets we show that product kernel learning is on no dataset significantly worse than any of the competing kernel methods and on average the best method available. A qualitative analysis of the resulting product kernels shows how the results vary from dataset to dataset.


[377]

Efficient data stream classification via probabilistic adaptive
windows
Albert Bifet, Bernhard Pfahringer, Jesse Read, and Geoff Holmes.
Efficient data stream classification via probabilistic adaptive
windows.
In Proc 28th Annual ACM Symposium on Applied Computing,
Coimbra, Portugal, pages 801806. ACM, 2013.
[ bib 
http ]
In the context of a data stream, a classifier must be able to learn from a theoreticallyinfinite stream of examples using limited time and memory, while being able to predict at any point. Many methods deal with this problem by basing their model on a window of examples. We introduce a probabilistic adaptive window (PAW) for datastream learning, which improves this windowing technique with a mechanism to include older examples as well as the most recent ones, thus maintaining information on past concept drifts while being able to adapt quickly to new ones. We exemplify PAW with lazy learning methods in two variations: one to handle concept drift explicitly, and the other to add classifier diversity using an ensemble. Along with the standard measures of accuracy and time and memory use, we compare classifiers against stateoftheart classifiers from the datastream literature.


[378]

An opensource toolkit for mining Wikipedia
David N. Milne and Ian H. Witten.
An opensource toolkit for mining wikipedia.
Artificial Intelligence, 194:222239, 2013.
[ bib 
http ]
The online encyclopedia Wikipedia is a vast, constantly evolving tapestry of interlinked articles. For developers and researchers it represents a giant multilingual database of concepts and semantic relations, a potential resource for natural language processing and many other research areas. This paper introduces the Wikipedia Miner toolkit, an opensource software system that allows researchers and developers to integrate Wikipedia's rich semantics into their own applications. The toolkit creates databases that contain summarized versions of Wikipedia's content and structure, and includes a Java API to provide access to them. Wikipedia's articles, categories and redirects are represented as classes, and can be efficiently searched, browsed, and iterated over. Advanced features include parallelized processing of Wikipedia dumps, machinelearned semantic relatedness measures and annotation features, and XMLbased web services. Wikipedia Miner is intended to be a platform for sharing data mining techniques.


[379]

Improving BagsofWords model for object categorization
Edmond Yiwen Zhang.
Improving BagsofWords model for object categorization.
PhD thesis, Department of Computer Science, University of Waikato,
2013.
[ bib 
http ]
In the past decade, BagsofWords (BOW) models have become popular for the task of object recognition, owing to their good performance and simplicity. Some of the most effective recent methods for computerbased object recognition work by detecting and extracting local image features, before quantizing them according to a codebook rule such as kmeans clustering, and classifying these with conventional classifiers such as Support Vector Machines and Naive Bayes.
In this thesis, a Spatial Object Recognition Framework is presented that consists of the four main contributions of the research.
The first contribution, frequent keypoint pattern discovery, works by combining pairs and triplets of frequent keypoints in order to discover intermediate representations for object classes. Based on the same frequent keypoints principle, algorithms for locating the regionofinterest in training images is then discussed.
Extensions to the successful Spatial Pyramid Matching scheme, in order to better capture spatial relationships, are then proposed. The pairs frequency histogram and shapes frequency histogram work by capturing more redefined spatial information between local image features.
Finally, alternative techniques to Spatial Pyramid Matching for capturing spatial information are presented. The proposed techniques, variations of binned logpolar histograms, divides the image into grids of different scale and different orientation. Thus captures the distribution of image features both in distance and orientation explicitly.
Evaluations on the framework are focused on several recent and popular datasets, including image retrieval, object recognition, and object categorization. Overall, while the effectiveness of the framework is limited in some of the datasets, the proposed contributions are nevertheless powerful improvements of the BOW model.


[380]

Random projections as regularizers: learning a linear discriminant from fewer observations than dimensions
RJ Durrant and A Kaban.
Random projections as regularizers: learning a linear discriminant
from fewer observations than dimensions.
Machine Learning, 99(2):257286, 2014.
[ bib 
http ]
We prove theoretical guarantees for an averagingensemble of randomly projected Fisher linear discriminant classifiers, focusing on the case when there are fewer training observations than data dimensions. The specific form and simplicity of this ensemble permits a direct and much more detailed analysis than existing generic tools in previous works. In particular, we are able to derive the exact form of the generalization error of our ensemble, conditional on the training set, and based on this we give theoretical guarantees which directly link the performance of the ensemble to that of the corresponding linear discriminant learned in the full data space. To the best of our knowledge these are the first theoretical results to prove such an explicit link for any classifier and classifier ensemble pair. Furthermore we show that the randomly projected ensemble is equivalent to implementing a sophisticated regularization scheme to the linear discriminant learned in the original data space and this prevents overfitting in conditions of small sample size where pseudoinverse FLD learned in the data space is provably poor. Our ensemble is learned from a set of randomly projected representations of the original high dimensional data and therefore for this approach data can be collected, stored and processed in such a compressed form. We confirm our theoretical findings with experiments, and demonstrate the utility of our approach on several
datasets from the bioinformatics domain and one very high dimensional dataset from the drug discovery domain, both settings in which fewer observations than dimensions are the norm.


[381]

Fully Supervised Training of Gaussian Radial Basis Function Networks in WEKA
Eibe Frank.
Fully supervised training of Gaussian radial basis function
networks in WEKA.
Technical Report 04/14, Department of Computer Science, University of
Waikato, 2014.
[ bib 
.pdf ]
Radial basis function networks are a type of feedforward network with a long history in machine learning. In spite of this, there is relatively little literature on how to train them so that accurate predictions are obtained. A common strategy is to train the hidden layer of the network using kmeans clustering and the output layer using supervised learning. However, Wettschereck and Dietterich [2] found that supervised training of hidden layer parameters can improve predictive performance. They investigated learning center locations, local variances of the basis functions, and attribute weights, in a supervised manner. This document discusses supervised training of Gaussian radial basis function networks in the WEKA machine learning software. More specifically, we discuss the RBFClassifier and RBFRegressor classes available as part of the RBFNetwork package for WEKA 3.7 and consider (a) learning of center locations and one global variance parameter, (b) learning of center locations and one local variance parameter per basis function, and (c) learning center locations with perattribute local variance parameters. We also consider learning attribute weights jointly with other parameters.


[382]

DNA methylationassociated colonic mucosal immune and defense responses in treatmentnaïve pediatric ulcerative colitis
R Alan Harris, Dorottya NagySzakal, Sabina AV Mir, Eibe Frank, Reka Szigeti,
Jess L Kaplan, Jiri Bronsky, Antone Opekun, George D Ferry, Harland Winter,
and Richard Kellermayer.
DNA methylationassociated colonic mucosal immune and defense
responses in treatmentnaïve pediatric ulcerative colitis.
Epigenetics, 9(8):11311137, 2014.
[ bib 
http ]
Inflammatory bowel diseases (IBD) are emerging globally, indicating that environmental factors may be important in their pathogenesis. Colonic mucosal epigenetic changes, such as DNa methylation, can occur in response to the environment and have been implicated in IBD pathology. However, mucosal DNa methylation has not been examined in treatmentnaive patients. We studied DNa methylation in untreated, left sided colonic biopsy specimens using the Infinium humanMethylation450 Beadchip array. We analyzed 22 control (c) patients, 15 untreated crohn’s disease (cD) patients, and 9 untreated ulcerative colitis (Uc) patients from two cohorts. Samples obtained at the time of clinical remission from two of the treatmentnaive Uc patients were also included into the analysis. Ucspecific gene expression was interrogated in a subset of adjacent samples (5 c and 5 Uc) using the affymetrix Genechip PrimeView human Gene expression arrays. Only treatmentnaive Uc separated from control. Onehundredand20 genes with significant expression change in Uc (> 2fold, P < 0.05) were associated with differentially methylated regions (DMRs). epigenetically associated gene expression changes (including gene expression changes in the IFITM1, ITGB2, S100A9, SLPI, SAA1, and STAT3 genes) were linked to colonic mucosal immune and defense responses. These findings underscore the relationship between epigenetic changes and inflammation in pediatric treatmentnaive Uc and may have potential etiologic, diagnostic, and therapeutic relevance for IBD.


[383]

Change detection in categorical evolving data streams
Dino Ienco, Albert Bifet, Bernhard Pfahringer, and Pascal Poncelet.
Change detection in categorical evolving data streams.
In Proc 29th Annual ACM Symposium on Applied Computing, pages
792797. ACM, 2014.
[ bib ]
Detecting change in evolving data streams is a central issue for accurate adaptive learning. In real world applications, data streams have categorical features, and changes induced in the data distribution of these categorical features have not been considered extensively so far. Previous work on change detection focused on detecting changes in the accuracy of the learners, but without considering changes in the data distribution.
To cope with these issues, we propose a new unsupervised change detection method, called CDCStream (Change Detection in Categorical Data Streams), well suited for categorical data streams. The proposed method is able to detect changes in a batch incremental scenario. It is based on the two following characteristics: (i) a summarization strategy is proposed to compress the actual batch by extracting a descriptive summary and (ii) a new segmentation algorithm is proposed to highlight changes and issue warnings for a data stream. To evaluate our proposal we employ it in a learning task over real world data and we compare its results with state of the art methods. We also report qualitative evaluation in order to show the behavior of CDCStream.


[384]

Détection de changements dans des flots de données qualitatives
Dino Ienco, Albert Bifet, Bernhard Pfahringer, and Pascal Poncelet.
Détection de changements dans des flots de données qualitatives.
In Proc 14èmes Journées Francophones Extraction et Gestion des
Connaissances, pages 517520. HermannÉditions, 2014.
[ bib ]
Pour mieux analyser et extraire de la connaissance de flots de données,
des approches spécifiques ont été proposées ces dernières années. L’un des
challenges auquel elles doivent faire face est la détection de changement dans
les données. Alors que de plus en plus de données qualitatives sont générées,
peu de travaux de recherche se sont intéressés à la détection de changement dans
ce contexte et les travaux existants se sont principalement focalisés sur la qualité
d’un modèle appris plutôt qu’au réel changement dans les données. Dans
cet article nous proposons une nouvelle méthode de détection de changement
non supervisée, appelée CDCStream (Change Detection in Categorical Data
Streams), adaptée aux flux de données qualitatives.


[385]

Evolving artificial datasets to improve interpretable classifiers
Michael Mayo and Quan Sun.
Evolving artificial datasets to improve interpretable classifiers.
In Proc 2014 IEEE Congress on Evolutionary Computation, pages
23672374. IEEE, 2014.
[ bib ]
Differential Evolution can be used to construct effective and compact artificial training datasets for machine learning algorithms. In this paper, a series of comparative experiments are performed in which two simple interpretable supervised classifiers (specifically, Naive Bayes and linear Support Vector Machines) are trained (i) directly on “real” data, as would be the normal case, and (ii) indirectly, using special artificial datasets derived from real data via evolutionary optimization. The results across several challenging test problems show that supervised classifiers trained indirectly using our novel evolutionbased approach produce models with superior predictive classification performance. Besides presenting the accuracy of the learned models, we also analyze the sensitivity of our artificial data optimization process to Differential Evolution's parameters, and then we examine the statistical characteristics of the artificial data that is evolved.


[386]

Kaggle LSHTC4 Winning Solution
Antti Puurula, Jesse Read, and Albert Bifet.
Kaggle LSHTC4 winning solution.
CoRR, abs/1405.0546, 2014.
[ bib 
http ]
Our winning submission to the 2014 Kaggle competition for Large Scale Hierarchical Text Classification (LSHTC) consists mostly of an ensemble of sparse generative models extending Multinomial Naive Bayes. The baseclassifiers consist of hierarchically smoothed models combining document, label, and hierarchy level Multinomials, with feature preprocessing using variants of TFIDF and BM25. Additional diversification is introduced by different types of folds and random search optimization for different measures. The ensemble algorithm optimizes macroFscore by predicting the documents for each label, instead of the usual prediction of labels per document. Scores for documents are predicted by weighted voting of baseclassifier outputs with a variant of FeatureWeighted Linear Stacking. The number of documents per label is chosen using label priors and thresholding of vote scores. This document describes the models and software used to build our solution. Reproducing the results for our solution can be done by running the scripts included in the Kaggle package. A package omitting precomputed result files is also distributed. All code is open source, released under GNU GPL 2.0, and GPL 3.0 for Weka and Meka dependencies.


[387]

Multilabel Classification with MetaLabels
Jesse Read, Antti Puurula, and Albert Bifet.
Multilabel classification with metalabels.
In Proc 2014 IEEE International Conference on Data Mining,
pages 941946. IEEE, 2014.
[ bib ]
The area of multilabel classification has rapidly developed in recent years. It has become widely known that the baseline binary relevance approach can easily be outperformed by methods which learn labels together. A number of methods have grown around the label powerset approach, which models label combinations together as class values in a multiclass problem. We describe the labelpowersetbased solutions under a general framework of metalabels and provide some theoretical justification for this framework which has been lacking; explaining how metalabels essentially allow a random projection into a space where nonlinearities can easily be tackled with established linear learning algorithms. The proposed framework enables comparison and combination of related approaches to different multilabel problems. We present a novel model in the framework and evaluate it empirically against several highperforming methods, with respect to predictive performance and scalability, on a number of datasets and evaluation metrics. This deployment obtains competitive accuracy for a fraction of the computation required by the current metalabel methods for multilabel classification.


[388]

CrowdSourcing Ontology Content and Curation: The Massive Ontology
Interface
Samuel Sarjant, Catherine Legg, Matt Stannett, and Duncan Willcock.
Crowdsourcing ontology content and curation: The massive ontology
interface.
In Proc 8th International Conference on Formal Ontology in
Information Systems, pages 251260. IOS Press, 2014.
[ bib 
http ]
Crowdsourcing is an increasingly popular approach to building large, complex publicinterest projects. The ontology infrastructure that is required to scaffold the goals of the Semantic Web is such a project. We have been thinking hard about what ‘crowdsourced ontology’ might look like, and are currently advancing on two fronts: usersupplied content and usersupplied curation. We achieve the former by mining 90% of the concepts and relations in our ontology from Wikipedia. However other research groups are also pursuing this strategy (e.g. DBpedia, YAGO). Our claim to be on the cutting edge is in our latter goal. We are building a web portal: The Massive Ontology Interface, for users to interact with our ontology in a clean, syntaxlight format. The interface is designed to enable users to identify errors and add new concepts and assertions, and to discuss the knowledge in the openended way that fosters real collaboration in Wikipedia. We here present our system, discuss the design decisions that have shaped it and the motivation we offer users to interact with it.


[389]

Metalearning and the full model selection problem
Quan Sun.
Metalearning and the full model selection problem.
PhD thesis, Department of Computer Science, University of Waikato,
2014.
[ bib 
http ]
When working as a data analyst, one of my daily tasks is to select appropriate tools from a set of existing data analysis techniques in my toolbox, including data preprocessing, outlier detection, feature selection, learning algorithm and evaluation techniques, for a given data project. This indeed was an enjoyable job at the beginning, because to me finding patterns and valuable information from data is always fun. Things become tricky when several projects needed to be done in a relatively short time.
Naturally, as a computer science graduate, I started to ask myself, What can be automated here?; because, intuitively, part of my work is more or less a loop that can be programmed. Literally, the loop is choose, run, test and choose again... until some criterion/goals are met.
In other words, I use my experience or knowledge about machine learning and data mining to guide and speed up the process of selecting and applying techniques in order to build a relatively good predictive model for a given dataset for some purpose. So the following questions arise:
Is it possible to design and implement a system that helps a data analyst to choose from a set of data mining tools? Or at least that provides a useful recommendation about tools that potentially save some time for a human analyst.
To answer these questions, I decided to undertake a longterm study on this topic, to think, define, research, and simulate this problem before coding my dream system. This thesis presents research results, including new methods, algorithms, and theoretical and empirical analysis from two directions, both of which try to propose systematic and efficient solutions to the questions above, using different resource requirements, namely, the metalearningbased algorithm/parameter ranking approach and the metaheuristic searchbased fullmodel selection approach.
Some of the results have been published in research papers; thus, this thesis also serves as a coherent collection of results in a single volume.


[390]

Hierarchical metarules for scalable metalearning
Quan Sun and Bernhard Pfahringer.
Hierarchical metarules for scalable metalearning.
In DucNghia Pham and SeongBae Park, editors, Proc 13th Pacific
Rim International Conference on Artificial intelligence, pages 383395.
Springer, 2014.
[ bib ]
The Pairwise MetaRules (PMR) method proposed in [18] has been shown to improve the predictive performances of several metalearning algorithms for the algorithm ranking problem. Given m target objects (e.g., algorithms), the training complexity of the PMR method with respect to m is quadratic: (m over 2)=m×(m−1)/2. This is usually not a problem when m is moderate, such as when ranking 20 different learning algorithms. However, for problems with a much larger m, such as the metalearningbased parameter ranking problem, where m can be 100+, the PMR method is less efficient. In this paper, we propose a novel method named Hierarchical MetaRules (HMR), which is based on the theory of orthogonal contrasts. The proposed HMR method has a linear training complexity with respect to m, providing a way of dealing with a large number of objects that the PMR method cannot handle efficiently. Our experimental results demonstrate the benefit of the new method in the context of metalearning.


[391]

Resampling strategies for regression
Luis Torgo, Paula Branco, Rita P. Ribeiro, and Bernhard Pfahringer.
Resampling strategies for regression.
Expert Systems, 32(3), 2014.
[ bib ]
Several real world prediction problems involve forecasting rare values of a target variable. When this variable is nominal, we have a problem of class imbalance that was thoroughly studied within machine learning. For regression tasks, where the target variable is continuous, few works exist addressing this type of problem. Still, important applications involve forecasting rare extreme values of a continuous target variable. This paper describes a contribution to this type of tasks. Namely, we propose to address such tasks by resampling approaches that change the distribution of the given data set to decrease the problem of imbalance between the rare target cases and the most frequent ones. We present two modifications of wellknown resampling strategies for classification tasks: the undersampling and the synthetic minority oversampling technique (SMOTE) methods. These modifications allow the use of these strategies on regression tasks where the goal is to forecast rare extreme values of the target variable. In an extensive set of experiments, we provide empirical evidence for the superiority of our proposals for these particular regression tasks. The proposed resampling methods can be used with any existing regression algorithm, which means that they are general tools for addressing problems of forecasting rare extreme values of a continuous target variable.


[392]

Improvements to BM25 and Language Models Examined
Andrew Trotman, Antti Puurula, and Blake Burgess.
Improvements to BM25 and language models examined.
In Proc 2014 Australasian Document Computing Symposium,
page 58. ACM, 2014.
[ bib ]
Recent work on search engine ranking functions report improvements on BM25 and Language Models with Dirichlet Smoothing. In this investigation 9 recent ranking functions (BM25, BM25+, BM25T, BM25adpt, BM25L, TFlID, LMDS, LMPYP, and LMPYPTFIDF) are compared by training on the INEX 2009 Wikipedia collection and testing on INEX 2010 and 9 TREC collections. We find that once trained (using particle swarm optimi zation) there is very little difference in performance between these functions, that relevance feedback is effective, that stemming is effective, and that it remains unclear which function is best overall.


[393]

WISE 2014 Challenge: Multilabel Classification of Print Media Articles
to Topics
Grigorios Tsoumakas, Apostolos Papadopoulos, Weining Qian, Stavros
Vologiannidis, Alexander D'yakonov, Antti Puurula, Jesse Read, Jan Svec, and
Stanislav Semenov.
WISE 2014 challenge: Multilabel classification of print media
articles to topics.
In Proc 15th International Conference on Web Information Systems
Engineering, pages 541548, 2014.
[ bib 
http ]
The WISE 2014 challenge was concerned with the task of multilabel classification of articles coming from Greek print media. Raw data comes from the scanning of print media, article segmentation, and optical character segmentation, and therefore is quite noisy. Each article is examined by a human annotator and categorized to one or more of the topics being monitored. Topics range from specific persons, products, and companies that can be easily categorized based on keywords, to more general semantic concepts, such as environment or economy. Building multilabel classifiers for the automated annotation of articles into topics can support the work of human annotators by suggesting a list of all topics by order of relevance, or even automate the annotation process for media and/or categories that are easier to predict. This saves valuable time and allows a media monitoring company to expand the portfolio of media being monitored. This paper summarizes the approaches of the top 4 among the 121 teams that participated in the competition.


[394]

Algorithm selection on data streams
Jan N van Rijn, Geoffrey Holmes, Bernhard Pfahringer, and Joaquin Vanschoren.
Algorithm selection on data streams.
In Proc 17th International Conference on Discovery Science,
pages 325336. Springer, 2014.
[ bib ]
We explore the possibilities of metalearning on data streams, in particular algorithm selection. In a first experiment we calculate the characteristics of a small sample of a data stream, and try to predict which classifier performs best on the entire stream. This yields promising results and interesting patterns. In a second experiment, we build a metaclassifier that predicts, based on measurable data characteristics in a window of the data stream, the best classifier for the next window. The results show that this metaalgorithm is very competitive with state of the art ensembles, such as OzaBag, OzaBoost and Leveraged Bagging. The results of all experiments are made publicly available in an online experiment database, for the purpose of verifiability, reproducibility and generalizability.


[395]

Towards metalearning over data streams
Jan N van Rijn, Geoffrey Holmes, Bernhard Pfahringer, and Joaquin Vanschoren.
Towards metalearning over data streams.
In Proc International Workshop on Metalearning and Algorithm
Selection, volume Vol1201, pages 3738. ceurws.org, 2014.
[ bib 
.pdf ]
Modern society produces vast streams of data. Many stream mining algorithms have been developed to capture general trends in these streams, and make predictions for future observations, but relatively little is known about which algorithms perform particularly well on which kinds of data. Moreover, it is possible that the characteristics of the data change over time, and thus that a different algorithm should be recommended at various points in time. Figure 1 illustrates this. As such, we are dealing with the Algorithm Selection Problem [9] in a data stream setting. Based on measurable metafeatures from a window of observations from a data stream, a metaalgorithm is built that predicts the best classifier for the next window. Our results show that this metaalgorithm is competitive with stateoftheart data streaming ensembles, such as OzaBag [6], OzaBoost [6] andLeveraged Bagging[3].


[396]

Active learning with drifting streaming data
Indrė Žliobaitė, Albert Bifet, Bernhard Pfahringer, and Geoff Holmes.
Active learning with drifting streaming data.
IEEE Transactions on Neural Networks and Learning Systems,
25(1):2739, 2014.
[ bib ]
In learning to classify streaming data, obtaining true labels may require major effort and may incur excessive cost. Active learning focuses on carefully selecting as few labeled instances as possible for learning an accurate predictive model. Streaming data poses additional challenges for active learning, since the data distribution may change over time (concept drift) and models need to adapt. Conventional active learning strategies concentrate on querying the most uncertain instances, which are typically concentrated around the decision boundary. Changes occurring further from the boundary may be missed, and models may fail to adapt. This paper presents a theoretically supported framework for active learning from drifting data streams and develops three active learning strategies for streaming data that explicitly handle concept drift. They are based on uncertainty, dynamic allocation of labeling efforts over time, and randomization of the search space. We empirically demonstrate that these strategies react well to changes that can occur anywhere in the instance space and unexpectedly.


[397]

From unlabelled tweets to Twitterspecific opinion words on Research & Development in Information Retrieval
Felipe BravoMarquez, Eibe Frank, and Bernhard Pfahringer.
From unlabelled tweets to twitterspecific opinion words on research
& development in information retrieval.
In Proc 38th Annual ACM SIGIR Conference, Santiago de Chile,
Chile. ACM Press, 2015.
[ bib 
.pdf ]
In this article, we propose a wordlevel classification model for automatically generating a Twitterspecific opinion lexicon from a corpus of unlabelled tweets. The tweets from the corpus are represented by two vectors: a bagofwords vector and a semantic vector based on wordclusters. We propose a distributional representation for words by treating them as the centroids of the tweet vectors in which they appear. The lexicon generation is conducted by training a wordlevel classifier using these centroids to form the instance space and a seed lexicon to label the training instances. Experimental results show that the two types of tweet vectors complement each other in a statistically significant manner and that our generated lexicon produces significant improvements for tweetlevel polarity classification.


[398]

Positive, Negative, or Neutral: Learning an Expanded Opinion Lexicon from Emoticonannotated Tweets
Felipe BravoMarquez, Eibe Frank, and Bernhard Pfahringer.
Positive, negative, or neutral: Learning an expanded opinion lexicon
from emoticonannotated tweets.
In Proc 23rd International Joint Conference on Artificial
Intelligence, Buenos Aires, Argentina, pages 12291235. IJCAI/AAAI, 2015.
[ bib 
.pdf ]
We present a supervised framework for expanding an opinion lexicon for tweets. The lexicon contains partofspeech (POS) disambiguated entries with a threedimensional probability distribution for pos itive, negative, and neutral polarities. To obtain this distribution using machine learning, we propose wordlevel attributes based on POS tags and information calculated from streams of emoticon annotated tweets. Our experimental results show that our method outperforms the threedimensional wordlevel polarity classification performance obtained by semantic orientation, a stateoftheart measure for establishing worldlevel sentiment.


[399]

Alternating Model Trees
Eibe Frank, Michael Mayo, and Stefan Kramer.
Alternating model trees.
In Proc 30th ACM Symposium on Applied Computing, Data Mining
Track, pages 871878. ACM Press, 2015.
[ bib 
.pdf ]
Model tree induction is a popular method for tackling regression problems requiring interpretable models. Model trees are decision trees with multiple linear regression models at the leaf nodes. In this paper, we propose a method for growing alternating model trees, a form of option tree for regression problems. The motivation is that alternating decision trees achieve high accuracy in classification problems because they represent an ensemble classifier as a single tree structure. As in alternating decision trees for classification, our alternating model trees for regression contain splitter and prediction nodes, but we use simple linear regression functions as opposed to constant predictors at the prediction nodes. Moreover, additive regression using forward stagewise modeling is applied to grow the tree rather than a boosting algorithm. The size of the tree is determined using crossvalidation. Our empirical results show that alternating model trees achieve significantly lower squared error than standard model trees on several regression datasets.


[400]

Scalable Text Mining with Sparse Generative Models
Antti Puurula.
Scalable Text Mining with Sparse Generative Models.
PhD thesis, Department of Computer Science, University of Waikato,
2015.
[ bib 
http ]
The information age has brought a deluge of data. Much of this is in text form, insurmountable in scope for humans and incomprehensible in structure for computers. Text mining is an expanding field of research that seeks to utilize the information contained in vast document collections. General data mining methods based on machine learning face challenges with the scale of text data, posing a need for scalable text mining methods.
This thesis proposes a solution to scalable text mining: generative models combined with sparse computation. A unifying formalization for generative text models is defined, bringing together research traditions that have used formally equivalent models, but ignored parallel developments. This framework allows the use of methods developed in different processing tasks such as retrieval and classification, yielding effective solutions across different text mining tasks. Sparse computation using inverted indices is proposed for inference on probabilistic models. This reduces the computational complexity of the common text mining operations according to sparsity, yielding probabilistic models with the scalability of modern search engines.
The proposed combination provides sparse generative models: a solution for text mining that is general, effective, and scalable. Extensive experimentation on text classification and ranked retrieval datasets are conducted, showing that the proposed solution matches or outperforms the leading taskspecific methods in effectiveness, with a order of magnitude decrease in classification times for Wikipedia article categorization with a million classes. The developed methods were further applied in two 2014 Kaggle data mining prize competitions with over a hundred competing teams, earning first and second places.


[401]

Evaluation methods and decision theory for classification of streaming data with temporal dependence
Indrė Žliobaitė, Albert Bifet, Jesse Read, Bernhard Pfahringer, and Geoff
Holmes.
Evaluation methods and decision theory for classification of
streaming data with temporal dependence.
Machine Learning, 98(3):455482, 2015.
[ bib ]
Predictive modeling on data streams plays an important role in modern data analysis, where data arrives continuously and needs to be mined in real time. In the stream setting the data distribution is often evolving over time, and models that update themselves during operation are becoming the stateoftheart. This paper formalizes a learning and evaluation scheme of such predictive models. We theoretically analyze evaluation of classifiers on streaming data with temporal dependence. Our findings suggest that the commonly accepted data stream classification measures, such as classification accuracy and Kappa statistic, fail to diagnose cases of poor performance when temporal dependence is present, therefore they should not be used as sole performance indicators. Moreover, classification accuracy can be misleading if used as a proxy for evaluating change detectors with datasets that have temporal dependence. We formulate the decision theory for streaming data classification with temporal dependence and develop a new evaluation methodology for data stream classification that takes temporal dependence into account. We propose a combined measure for classification performance, that takes into account temporal dependence, and we recommend using it as the main performance measure in classification of streaming data.

