[1]

Mixture model clustering for mixed data with missing information
Lynette Hunt and Murray Jorgensen.
Mixture model clustering for mixed data with missing information.
Computational Statistics & Data Analysis, 41(3???4):429 
440, 2003.
[ bib 
http ]
One difficulty with classification studies is unobserved or missing observations that often occur in multivariate datasets. The mixture likelihood approach to clustering has been well developed and is much used, particularly for mixtures where the component distributions are multivariate normal. It is shown that this approach can be extended to analyse data with mixed categorical and continuous attributes and where some of the data are missing at random in the sense of Little and Rubin (Statistical Analysis with Mixing Data, Wiley, New York).


[2]

Token Identification Using HMM and PPM Models.
Yingying Wen, Ian H. Witten, and Dianhui Wang.
Token identification using hmm and ppm models.
In Proc 16th Australian Joint Conference on Artificial
Intelligence, pages 173185. Springer, 2003.
[ bib ]
Hidden markov models (HMMs) and prediction by partial matching models (PPM) have been successfully used in language pro cessing tasks including learningbased token identification. Most of the existing systems are domain and languagedependent. The power of re targetability and applicability of these systems is limited. This paper investigates the effect of the combination of HMMs and PPM on to ken identification. We implement a system that bridges the two well known methods through words new to the identification model. The sys tem is fully domain and languageindependent. No changes of code are necessary when applying to other domains or languages. The only re quired input of the system is an annotated corpus. The system has been tested on two corpora and achieved an overall Fmeasure of 69.02% for TCC, and 76.59% for BIB. Although the performance is not as good as that obtained from a system with languagedependent components, our proposed system has power to deal with large scope of domain and languageindependent problem. Identification of date has the best result, 73% and 92% of correct tokens are identified for two corpora respectively. The system also performs reasonably well on people???s name with correct tokens of 68% for TCC, and 76% for BIB.


[3]

A probabilistic line breaking algorithm
R.R. Bouckaert.
A probabilistic line breaking algorithm.
In Proc Sixteenth Australian Joint Conference on Artificial
Intelligence, 2003.
[ bib 
.pdf ]
We show how a probabilistic interpretation of an ill defined problem, the problem of finding line breaks in a paragraph, can lead to an efficient new algorithm that performs well. The graphical model that results from the probabilistic interpretation has the advantage that it is easy to tune due to the probabilistic approach. Furthermore, the algorithm optimizes the probability a break up is acceptable over the whole paragraph, it does not show threshold effects and it allows for easy incorporation of subtle typographical rules. Thanks to the architecture of the Bayesian network, the algorithm is linear in the number of characters in a paragraph. Empirical evidence suggests that this algorithm performs closer to results published through desk top publishing than a number of existing systems.


[4]

Choosing between two learning algorithms based on calibrated tests
R.R. Bouckaert.
Choosing between two learning algorithms based on calibrated tests.
In T. Fawcett and N. Mishra, editors, Proc Twentieth
International Conference on Machine Learning, pages 5158, Washington, USA,
2003. Morgan Kaufmann.
[ bib 
.pdf ]
Designing a hypothesis test to determine the best of two machine learning algorithms with only a small data set available is not a simple task. Many popular tests suffer from low power (5x2 cv), or high Type I error (Weka?s 10x10 cross validation). Furthermore, many tests show a low level of replicability, so that tests performed by different scientists with the same pair of algorithms, the same data sets and the same hypothesis test still may present different results. We show that 5x2 cv, resampling and 10 fold cv suffer from low replicability. The main complication is due to the need to use the data multiple times. As a consequence, independence assumptions for most hypothesis tests are violated. In this paper, we pose the case that reuse of the same data causes the effective degrees of freedom to be much lower than theoretically expected. We show how to calibrate the effective degrees of freedom empirically for various tests. Some tests are not calibratable, indicating another flaw in the design. However the ones that are calibratable all show very similar behavior. Moreover, the Type I error of those tests is on the mark for a wide range of circumstances, while they show a power and replicability that is a considerably higher than currently popular hypothesis tests.


[5]

Choosing learning algorithms using sign tests with high replicability
R.R. Bouckaert.
Choosing learning algorithms using sign tests with high
replicability.
In Proc Sixteenth Australian Joint Conference on Artificial
Intelligence, 2003.
[ bib 
.ps ]
An important task in machine learning is determining which learning algorithm works best for a given data set. When the amount of data is small the same data needs to be used repeatedly in order to get a reasonable estimate of the accuracy of the learning algorithms. This results in violations of assumptions on which standard tests are based and makes it hard to design a good test.
In this article, we investigate sign tests to address the problem of choosing the best of two learning algorithms when only a small data set is available. Sign tests are conceptually simple and no assumption about underlying distributions is required. We show that simplistic sample generation can lead to flawed test outcomes. Furthermore, we identify a test that performs well based on Type I error (showing a difference between algorithms when there is none), power (showing a difference when it indeed exists) and replicability.
Replicability is a novel measure of a quality of a test that gives an indication of how likely it is that the test outcome will be the same when the same test on the same data with the same sampling scheme and same pair of algorithms is executed, but with a difference randomization of the data. A new definition of replicability is provided and its benefits highlighted. A new definition of replicability is provided and its benefits highlighted. Empirical evidence is provided to show the test is robust under a varied range of circumstances.


[6]

Applying Propositional Learning Algorithms to Multiinstance data
Eibe Frank and Xin Xu.
Applying propositional learning algorithms to multiinstance data.
Technical Report 06/03, Department of Computer Science, University of
Waikato, 2003.
[ bib 
.ps.gz 
.pdf ]
Multiinstance learning is commonly tackled using specialpurpose algorithms. Development of these algorithms has started because early experiments with standard propositional learners have failed to produce satisfactory results on multiinstance datamore specifically, the Musk data. In this paper we present evidence that this is not necessarily the case. We introduce a simple wrapper for applying standard propositional learners to multiinstance problems and present empirical results for the Musk data that are competitive with genuine multiinstance algorithms. The key features of our new wrapper technique are: (1) it discards the standard multiinstance assumption that there is some inherent difference between positive and negative bags, and (2) it introduces weights to treat instances from different bags differently. We show that these two modifications are essential for producing good results on the Musk benchmark datasets.


[7]

Locally Weighted Naive Bayes
Eibe Frank, Mark Hall, and Bernhard Pfahringer.
Locally weighted naive Bayes.
In U. Kjaerulff and C. Meek, editors, Proc 19th Conference in
Uncertainty in Artificial Intelligence, Acapulco, Mexico, pages 249256.
Morgan Kaufmann, 2003.
Also published as Working Paper 04/03, Department of Computer
Science, The University of Waikato, Hamilton.
[ bib 
.ps.gz 
.pdf ]
Despite its simplicity, the naive Bayes classifier has surprised machine learning researchers by exhibiting good performance on a variety of learning problems. Encouraged by these results, researchers have looked to overcome naive Bayes' primary weaknessattribute independenceand improve the performance of the algorithm. This paper presents a locally weighted version of naive Bayes that relaxes the independence assumption by learning local models at prediction time. Experimental results show that locally weighted naive Bayes rarely degrades accuracy compared to standard naive Bayes and, in many cases, improves accuracy dramatically. The main advantage of this method compared to other techniques for enhancing naive Bayes is its conceptual and computational simplicity.


[8]

Predicting Library of Congress Classifications from Library of Congress Subject Headings
E. T. Frank and G. W. Paynter.
Predicting library of congress classifications from library of
congress subject headings.
Technical Report 01/03, The University of Waikato, Department of
Computer Science, Hamilton, New Zealand, 2003.
[ bib 
.ps 
.pdf ]
This paper addresses the problem of automatically assigning a Library of Congress Classification (LCC) to work given its set of Library of Congress Subject Headings (LCSH). LCC are organized in a tree: the root node of this hierarchy comprises all possible topics, and leaf nodes correspond to the most specialized topic areas defined. We describe a procedure that, given a resource identified by its LCSH, automatically places that resource in the LCC hierarchy. The procedure uses machine learning techniques and training data from a large library catalog to learn a classification model mapping from sets of LCSH to nodes in the LCC tree. We present empirical results for our technique showing its accuracy on an independent collection of 50,000 LCSH/LCC pairs.


[9]

Visualizing Class Probability Estimators
Eibe Frank and Mark Hall.
Visualizing class probability estimators.
In N. Lavrac and et al., editors, Proc 7th European Conference
on Principles and Practice of Knowledge Discovery in Databases, volume 2838
of LNCS, pages 168179. Springer, 2003.
Also published as Working Paper 02/03, Department of Computer
Science, The University of Waikato, Hamilton.
[ bib 
.ps.gz 
.pdf ]
Inducing classifiers that make accurate predictions on future data is a driving force for research in inductive learning. However, also of importance to the users is how to gain information from the models produced. Unfortunately, some of the most powerful inductive learning algorithms generate black boxes?that is, the representation of the model makes it virtually impossible to gain any insight into what has been learned. This paper presents a technique that can help the user understand why a classifier makes the predictions that it does by providing a twodimensional visualization of its class probability estimates. It requires the classifier to generate class probabilities but most practical algorithms are able to do so (or can be modified to this end).


[10]

Benchmarking Attribute Selection Techniques for Discrete Class Data Mining
Mark Andrew Hall and Geoffrey Holmes.
Benchmarking attribute selection techniques for discrete class data
mining.
IEEE Transactions on Knowledge and Data Engineering,
15(3):14371447, November/December 2003.
[ bib 
.pdf ]
Data engineering is generally considered to be a central issue in the development of data mining applications. The success of many learning schemes, in their attempts to construct models of data, hinges on the reliable identification of a small set of highly predictive attributes. The inclusion of irrelevant, redundant and noisy attributes in the model building process phase can result in poor predictive performance and increased computation.
Attribute selection generally involves a combination of search and attribute utility estimation plus evaluation with respect to specific learning schemes. This leads to a large number of possible permutations and has led to a situation where very few benchmark studies have been conducted.
This paper presents a benchmark comparison of several attribute selection methods for supervised classification. All the methods produce an attribute ranking, a useful devise for isolating the individual merit of an attribute. Attribute selection is achieved by crossvalidating the attribute rankings with respect to a classification learner to find the best attributes. Results are reported for a selection of standard data sets and two diverse learning schemes C4.5 and naive Bayes.


[11]

Mining data streams using option trees
G. Holmes, B. Pfahringer, and R. B. Kirkby.
Mining data streams using option trees.
Technical Report 08/03, The University of Waikato, Department of
Computer Science, Hamilton, New Zealand, 2003.
[ bib ]
The data stream model for data mining places harsh restrictions on a learning algorithm. A model must be induced following the briefest interrogation of the data, must use only available memory and must update itself over time within these constraints. Additionally, the model must be able to be used for data mining at any point in time. This paper describes a data stream classification algorithm using an ensemble of option trees. The ensemble of trees is induced by boosting and iteratively combined into a single interpretable model. The algorithm is evaluated using benchmark datasets for accuracy against stateoftheart algorithms that make use of the entire dataset.


[12]

Logistic Model Trees
Niels Landwehr, Mark Hall, and Eibe Frank.
Logistic model trees.
In N. Lavrac and et al., editors, Proc 14th European Conference
on Machine Learning, volume 2837 of LNCS, pages 241252,
CavtatDubrovnik, Croatia, 2003. Springer.
[ bib 
.ps 
.pdf ]
Tree induction methods and linear models are popular techniques for supervised learning tasks, both for the prediction of nominal classes and continuous numeric values. For predicting numeric quantities, there has been work on combining these two schemes into ?model trees?, i.e. trees that contain linear regression functions at the leaves. In this paper, we present an algorithm that adapts this idea for classification problems, using logistic regression instead of linear regression. We use a stagewise fitting process to construct the logistic regression models that can select relevant attributes in the data in a natural way, and show how this approach can be used to build the logistic regression models at the leaves by incrementally refining those constructed at higher levels in the tree. We compare the performance of our algorithm against that of decision trees and logistic regression on 32 benchmark UCI datasets, and show that it achieves a higher classification accuracy on average than the other two methods.


[13]

The ability for fourier transform infrared spectroscopy to classify persimmon genotypes by epicuticular leaf waxes
A.D. Mowat and G. Holmes.
The ability for fourier transform infrared spectroscopy to classify
persimmon genotypes by epicuticular leaf waxes.
In Proc Proc Second International Persimmon Symposium, ISHS Acta
Horticulturae 601, pages 6569, Queensland, Australia, 2003.
[ bib ]
The ability of Fourier Transform Infrared (FTIR) spectroscopic measurements of epicuticular leaf waxes to classify five nonastringent persimmon cultivars ( Diospyros kaki L. cv. Fuyu, cv. Matsumoto wase Fuyu, cv. Hanna Fuyu, cv. Oku Gosho and cv. IIIQ12) was investigated. Principal component analysis was used to reduce the FTIR spectral data to 20 variables. Machine learning models classified the cultivars into five groups. Three of the cultivars, Oku Gosho, Hanna Fuyu and IIIQ12 were readily distinguished from each other and from Fuyu and Matsumoto wase Fuyu. However, the leaf wax data was unable to distinguished between two closely related persimmon genotypes, Fuyu and the early maturing L1 mutation of Fuyu, Matsumoto wase Fuyu. Here, epicuticular wax biosynthesis appears unaffected by the LI mutation found in Matsumoto wase Fuyu. When the classification was based on four groups, where Fuyu and Matsumoto wase Fuyu samples were combined, the best performing machine learning model correctly classified 98.7% of the samples.


[14]

Propositionalization through Stochastic Discrimination
B. Pfahringer and G. Holmes.
Propositionalization through stochastic discrimination.
In Proceedings of the WorkinProgress Track at the 13th
International Conference on Inductive Logic Programming, pages 6068.
Department of Informatics, University of Szeged, Hungary, September 2003.
[ bib 
.ps 
.pdf ]
A simple algorithm based on the theory of stochastic discrimination is developed for the fast extraction of subgraphs with potential discriminative power from a given set of preclassified graphs. A preliminary experimental evaluation indicates the potential of the approach. Limitations are discussed as well as directions for future research.


[15]

Text Categorisation Using Document Profiling
Maximilien Sauban and Bernhard Pfahringer.
Text categorisation using document profiling.
In N Lavrac, editor, Proc 7th European Conference on Principles
and Practice of Knowledge Discovery in Databases, volume 2838 of LNCS,
pages 411422. Springer, 2003.
[ bib 
.ps 
.pdf ]
This paper presents an extension of prior work by Michael D. Lee on psychologically plausible text categorisation. Our approach utilises Lee?s model as a preprocessing filter to generate a dense representation for a given text document (a document profile) and passes that on to an arbitrary standard propositional learning algorithm. Similarly to standard feature selection for text classification, the dimensionality of instances is drastically reduced this way, which in turn greatly lowers the computational load for the subsequent learning algorithm. The filter itself is very fast as well, as it basically is just an interesting variant of Naive Bayes. We present different variations of the filter and conduct an evaluation against the Reuters21578 collection that shows performance comparable to previously published results on that collection, but at a lower computational cost.


[16]

All of MEDLINE indexed to the Gene Ontology
A. C. Smith and J. G. Cleary.
All of medline indexed to the gene ontology.
In Proc Sixth Annual BioOntologies Meeting, pages 711,
Brisbane, Australia, 2003.
[ bib ]
Most of what is known about genes and proteins is contained in the biomedical literature. The problem for the biologist is how to connect novel sequence data to relevant published documents. One way is to BLAST the sequence and then follow the literature links established in genomic/proteomic databases for known sequences with similar structure. Another way is to find the closelymatching genes or proteins in the Gene Ontology, then retrieve documents associated with GO terms. The advantage of this approach is that it provides a conceptual context for discovering possible genetic roles or molecular functions for a new sequence. The problems with both search strategies, however, is that they return only a small portion of the available literature. We are solving this problem by amplifying the available documents associated with GO terms to cover the entirety of the MEDLINE corpus.


[17]

A TwoLevel Learning Method for Generalized Multiinstance
Problems
Nils Weidmann, Eibe Frank, and Bernhard Pfahringer.
A twolevel learning method for generalized multiinstance problems.
In N. Lavrac and et al., editors, Proc 14th European Conference
on Machine Learning, CavtatDubrovnik, Croatia, pages 468479. Springer,
2003.
[ bib 
.ps.gz 
.pdf ]
In traditional multiinstance (MI) learning, a single positive instance in a bag produces a positive class label. Hence, the learner knows how the bag?s class label depends on the labels of the instances in the bag and can explicitly use this information to solve the learning task. In this paper we investigate a generalized view of the MI problem where this simple assumption no longer holds. We assume that an ?interaction? between instances in a bag determines the class label. Our twolevel learning method for this type of problem transforms an MI bag into a single metainstance that can be learned by a standard propositional method. The metainstance indicates which regions in the instance space are covered by instances of the bag. Results on both artificial and realworld data show that this twolevel classification approach is well suited for generalized MI problems.


[18]

Statistical learning in multiple instance problem
Xin Xu.
Statistical learning in multiple instance problem.
Master's thesis, University of Waikato, Hamilton, NZ, 2003.
0657.594.
[ bib 
http 
.ps.gz ]
Multiple instance (MI) learning is a relatively new topic in machine learning. It is concerned with supervised learning but differs from normal supervised learning in two points: (1) it has multiple instances in an example (and there is only one instance in an example in standard supervised learning), and (2) only one class label is observable for all the instances in an example (whereas each instance has its own class label in normal supervised learning). In MI learning there is a common assumption regarding the relationship between the class label of an example and the unobservable class labels of the instances inside it. This assumption, which is called the MI assumption in this thesis, states that An example is positive if at least one of its instances is positive and negative otherwise.
In this thesis, we first categorize current MI methods into a new framework. According to our analysis, there are two main categories of MI methods, instancebased and metadatabased approaches. Then we propose a new assumption for MI learning, called the collective assumption. Although this assumption has been used in some previous MI methods, it has never been explicitly stated, and this is the first time that it is formally specified. Using this new assumption we develop new algorithms  more specifically two instancebased and one metadatabased methods. All of these methods build probabilistic models and thus implement statistical learning algorithms. The exact generative models underlying these methods are explicitly stated and illustrated so that one may clearly understand the situations to which they can best be applied. The empirical results presented in this thesis show that they are competitive on standard benchmark datasets. Finally, we explore some practical applications of MI learning, both existing and new ones.
This thesis makes three contributions: a new framework for MI learning, new MI methods based on this framework and experimental results for new applications of MI learning.

