Mixture model clustering for mixed data with missing information

Lynette Hunt and Murray Jorgensen. Mixture model clustering for mixed data with missing information. Computational Statistics & Data Analysis, 41(3???4):429 - 440, 2003.
[ bib | http ]
One difficulty with classification studies is unobserved or missing observations that often occur in multivariate datasets. The mixture likelihood approach to clustering has been well developed and is much used, particularly for mixtures where the component distributions are multivariate normal. It is shown that this approach can be extended to analyse data with mixed categorical and continuous attributes and where some of the data are missing at random in the sense of Little and Rubin (Statistical Analysis with Mixing Data, Wiley, New York).


Token Identification Using HMM and PPM Models.

Yingying Wen, Ian H. Witten, and Dianhui Wang. Token identification using hmm and ppm models. In Proc 16th Australian Joint Conference on Artificial Intelligence, pages 173-185. Springer, 2003.
[ bib ]
Hidden markov models (HMMs) and prediction by partial matching models (PPM) have been successfully used in language pro- cessing tasks including learning-based token identification. Most of the existing systems are domain- and language-dependent. The power of re- targetability and applicability of these systems is limited. This paper investigates the effect of the combination of HMMs and PPM on to- ken identification. We implement a system that bridges the two well known methods through words new to the identification model. The sys- tem is fully domain- and language-independent. No changes of code are necessary when applying to other domains or languages. The only re- quired input of the system is an annotated corpus. The system has been tested on two corpora and achieved an overall F-measure of 69.02% for TCC, and 76.59% for BIB. Although the performance is not as good as that obtained from a system with language-dependent components, our proposed system has power to deal with large scope of domain- and language-independent problem. Identification of date has the best result, 73% and 92% of correct tokens are identified for two corpora respectively. The system also performs reasonably well on people???s name with correct tokens of 68% for TCC, and 76% for BIB.


A probabilistic line breaking algorithm

R.R. Bouckaert. A probabilistic line breaking algorithm. In Proc Sixteenth Australian Joint Conference on Artificial Intelligence, 2003.
[ bib | .pdf ]
We show how a probabilistic interpretation of an ill defined problem, the problem of finding line breaks in a paragraph, can lead to an efficient new algorithm that performs well. The graphical model that results from the probabilistic interpretation has the advantage that it is easy to tune due to the probabilistic approach. Furthermore, the algorithm optimizes the probability a break up is acceptable over the whole paragraph, it does not show threshold effects and it allows for easy incorporation of subtle typographical rules. Thanks to the architecture of the Bayesian network, the algorithm is linear in the number of characters in a paragraph. Empirical evidence suggests that this algorithm performs closer to results published through desk top publishing than a number of existing systems.


Choosing between two learning algorithms based on calibrated tests

R.R. Bouckaert. Choosing between two learning algorithms based on calibrated tests. In T. Fawcett and N. Mishra, editors, Proc Twentieth International Conference on Machine Learning, pages 51-58, Washington, USA, 2003. Morgan Kaufmann.
[ bib | .pdf ]
Designing a hypothesis test to determine the best of two machine learning algorithms with only a small data set available is not a simple task. Many popular tests suffer from low power (5x2 cv), or high Type I error (Weka?s 10x10 cross validation). Furthermore, many tests show a low level of replicability, so that tests performed by different scientists with the same pair of algorithms, the same data sets and the same hypothesis test still may present different results. We show that 5x2 cv, resampling and 10 fold cv suffer from low replicability. The main complication is due to the need to use the data multiple times. As a consequence, independence assumptions for most hypothesis tests are violated. In this paper, we pose the case that reuse of the same data causes the effective degrees of freedom to be much lower than theoretically expected. We show how to calibrate the effective degrees of freedom empirically for various tests. Some tests are not calibratable, indicating another flaw in the design. However the ones that are calibratable all show very similar behavior. Moreover, the Type I error of those tests is on the mark for a wide range of circumstances, while they show a power and replicability that is a considerably higher than currently popular hypothesis tests.


Choosing learning algorithms using sign tests with high replicability

R.R. Bouckaert. Choosing learning algorithms using sign tests with high replicability. In Proc Sixteenth Australian Joint Conference on Artificial Intelligence, 2003.
[ bib | .ps ]
An important task in machine learning is determining which learning algorithm works best for a given data set. When the amount of data is small the same data needs to be used repeatedly in order to get a reasonable estimate of the accuracy of the learning algorithms. This results in violations of assumptions on which standard tests are based and makes it hard to design a good test.

In this article, we investigate sign tests to address the problem of choosing the best of two learning algorithms when only a small data set is available. Sign tests are conceptually simple and no assumption about underlying distributions is required. We show that simplistic sample generation can lead to flawed test outcomes. Furthermore, we identify a test that performs well based on Type I error (showing a difference between algorithms when there is none), power (showing a difference when it indeed exists) and replicability.

Replicability is a novel measure of a quality of a test that gives an indication of how likely it is that the test outcome will be the same when the same test on the same data with the same sampling scheme and same pair of algorithms is executed, but with a difference randomization of the data. A new definition of replicability is provided and its benefits highlighted. A new definition of replicability is provided and its benefits highlighted. Empirical evidence is provided to show the test is robust under a varied range of circumstances.


Applying Propositional Learning Algorithms to Multi-instance data

Eibe Frank and Xin Xu. Applying propositional learning algorithms to multi-instance data. Technical Report 06/03, Department of Computer Science, University of Waikato, 2003.
[ bib | .ps.gz | .pdf ]
Multi-instance learning is commonly tackled using special-purpose algorithms. Development of these algorithms has started because early experiments with standard propositional learners have failed to produce satisfactory results on multi-instance data-more specifically, the Musk data. In this paper we present evidence that this is not necessarily the case. We introduce a simple wrapper for applying standard propositional learners to multi-instance problems and present empirical results for the Musk data that are competitive with genuine multi-instance algorithms. The key features of our new wrapper technique are: (1) it discards the standard multi-instance assumption that there is some inherent difference between positive and negative bags, and (2) it introduces weights to treat instances from different bags differently. We show that these two modifications are essential for producing good results on the Musk benchmark datasets.


Locally Weighted Naive Bayes

Eibe Frank, Mark Hall, and Bernhard Pfahringer. Locally weighted naive Bayes. In U. Kjaerulff and C. Meek, editors, Proc 19th Conference in Uncertainty in Artificial Intelligence, Acapulco, Mexico, pages 249-256. Morgan Kaufmann, 2003. Also published as Working Paper 04/03, Department of Computer Science, The University of Waikato, Hamilton.
[ bib | .ps.gz | .pdf ]
Despite its simplicity, the naive Bayes classifier has surprised machine learning researchers by exhibiting good performance on a variety of learning problems. Encouraged by these results, researchers have looked to overcome naive Bayes' primary weakness-attribute independence-and improve the performance of the algorithm. This paper presents a locally weighted version of naive Bayes that relaxes the independence assumption by learning local models at prediction time. Experimental results show that locally weighted naive Bayes rarely degrades accuracy compared to standard naive Bayes and, in many cases, improves accuracy dramatically. The main advantage of this method compared to other techniques for enhancing naive Bayes is its conceptual and computational simplicity.


Predicting Library of Congress Classifications from Library of Congress Subject Headings

E. T. Frank and G. W. Paynter. Predicting library of congress classifications from library of congress subject headings. Technical Report 01/03, The University of Waikato, Department of Computer Science, Hamilton, New Zealand, 2003.
[ bib | .ps | .pdf ]
This paper addresses the problem of automatically assigning a Library of Congress Classification (LCC) to work given its set of Library of Congress Subject Headings (LCSH). LCC are organized in a tree: the root node of this hierarchy comprises all possible topics, and leaf nodes correspond to the most specialized topic areas defined. We describe a procedure that, given a resource identified by its LCSH, automatically places that resource in the LCC hierarchy. The procedure uses machine learning techniques and training data from a large library catalog to learn a classification model mapping from sets of LCSH to nodes in the LCC tree. We present empirical results for our technique showing its accuracy on an independent collection of 50,000 LCSH/LCC pairs.


Visualizing Class Probability Estimators

Eibe Frank and Mark Hall. Visualizing class probability estimators. In N. Lavrac and et al., editors, Proc 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, volume 2838 of LNCS, pages 168-179. Springer, 2003. Also published as Working Paper 02/03, Department of Computer Science, The University of Waikato, Hamilton.
[ bib | .ps.gz | .pdf ]
Inducing classifiers that make accurate predictions on future data is a driving force for research in inductive learning. However, also of importance to the users is how to gain information from the models produced. Unfortunately, some of the most powerful inductive learning algorithms generate black boxes?that is, the representation of the model makes it virtually impossible to gain any insight into what has been learned. This paper presents a technique that can help the user understand why a classifier makes the predictions that it does by providing a two-dimensional visualization of its class probability estimates. It requires the classifier to generate class probabilities but most practical algorithms are able to do so (or can be modified to this end).


Benchmarking Attribute Selection Techniques for Discrete Class Data Mining

Mark Andrew Hall and Geoffrey Holmes. Benchmarking attribute selection techniques for discrete class data mining. IEEE Transactions on Knowledge and Data Engineering, 15(3):1437-1447, November/December 2003.
[ bib | .pdf ]
Data engineering is generally considered to be a central issue in the development of data mining applications. The success of many learning schemes, in their attempts to construct models of data, hinges on the reliable identification of a small set of highly predictive attributes. The inclusion of irrelevant, redundant and noisy attributes in the model building process phase can result in poor predictive performance and increased computation.

Attribute selection generally involves a combination of search and attribute utility estimation plus evaluation with respect to specific learning schemes. This leads to a large number of possible permutations and has led to a situation where very few benchmark studies have been conducted.

This paper presents a benchmark comparison of several attribute selection methods for supervised classification. All the methods produce an attribute ranking, a useful devise for isolating the individual merit of an attribute. Attribute selection is achieved by cross-validating the attribute rankings with respect to a classification learner to find the best attributes. Results are reported for a selection of standard data sets and two diverse learning schemes C4.5 and naive Bayes.


Mining data streams using option trees

G. Holmes, B. Pfahringer, and R. B. Kirkby. Mining data streams using option trees. Technical Report 08/03, The University of Waikato, Department of Computer Science, Hamilton, New Zealand, 2003.
[ bib ]
The data stream model for data mining places harsh restrictions on a learning algorithm. A model must be induced following the briefest interrogation of the data, must use only available memory and must update itself over time within these constraints. Additionally, the model must be able to be used for data mining at any point in time. This paper describes a data stream classification algorithm using an ensemble of option trees. The ensemble of trees is induced by boosting and iteratively combined into a single interpretable model. The algorithm is evaluated using benchmark datasets for accuracy against state-of-the-art algorithms that make use of the entire dataset.


Logistic Model Trees

Niels Landwehr, Mark Hall, and Eibe Frank. Logistic model trees. In N. Lavrac and et al., editors, Proc 14th European Conference on Machine Learning, volume 2837 of LNCS, pages 241-252, Cavtat-Dubrovnik, Croatia, 2003. Springer.
[ bib | .ps | .pdf ]
Tree induction methods and linear models are popular techniques for supervised learning tasks, both for the prediction of nominal classes and continuous numeric values. For predicting numeric quantities, there has been work on combining these two schemes into ?model trees?, i.e. trees that contain linear regression functions at the leaves. In this paper, we present an algorithm that adapts this idea for classification problems, using logistic regression instead of linear regression. We use a stagewise fitting process to construct the logistic regression models that can select relevant attributes in the data in a natural way, and show how this approach can be used to build the logistic regression models at the leaves by incrementally refining those constructed at higher levels in the tree. We compare the performance of our algorithm against that of decision trees and logistic regression on 32 benchmark UCI datasets, and show that it achieves a higher classification accuracy on average than the other two methods.


The ability for fourier transform infrared spectroscopy to classify persimmon genotypes by epicuticular leaf waxes

A.D. Mowat and G. Holmes. The ability for fourier transform infrared spectroscopy to classify persimmon genotypes by epicuticular leaf waxes. In Proc Proc Second International Persimmon Symposium, ISHS Acta Horticulturae 601, pages 65-69, Queensland, Australia, 2003.
[ bib ]
The ability of Fourier Transform Infrared (FTIR) spectroscopic measurements of epicuticular leaf waxes to classify five non-astringent persimmon cultivars ( Diospyros kaki L. cv. Fuyu, cv. Matsumoto wase Fuyu, cv. Hanna Fuyu, cv. Oku Gosho and cv. II-IQ-12) was investigated. Principal component analysis was used to reduce the FTIR spectral data to 20 variables. Machine learning models classified the cultivars into five groups. Three of the cultivars, Oku Gosho, Hanna Fuyu and II-IQ-12 were readily distinguished from each other and from Fuyu and Matsumoto wase Fuyu. However, the leaf wax data was unable to distinguished between two closely related persimmon genotypes, Fuyu and the early maturing L1 mutation of Fuyu, Matsumoto wase Fuyu. Here, epicuticular wax biosynthesis appears unaffected by the LI mutation found in Matsumoto wase Fuyu. When the classification was based on four groups, where Fuyu and Matsumoto wase Fuyu samples were combined, the best performing machine learning model correctly classified 98.7% of the samples.


Propositionalization through Stochastic Discrimination

B. Pfahringer and G. Holmes. Propositionalization through stochastic discrimination. In Proceedings of the Work-in-Progress Track at the 13th International Conference on Inductive Logic Programming, pages 60-68. Department of Informatics, University of Szeged, Hungary, September 2003.
[ bib | .ps | .pdf ]
A simple algorithm based on the theory of stochastic discrimination is developed for the fast extraction of sub-graphs with potential discriminative power from a given set of pre-classified graphs. A preliminary experimental evaluation indicates the potential of the approach. Limitations are discussed as well as directions for future research.


Text Categorisation Using Document Profiling

Maximilien Sauban and Bernhard Pfahringer. Text categorisation using document profiling. In N Lavrac, editor, Proc 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, volume 2838 of LNCS, pages 411-422. Springer, 2003.
[ bib | .ps | .pdf ]
This paper presents an extension of prior work by Michael D. Lee on psychologically plausible text categorisation. Our approach utilises Lee?s model as a pre-processing filter to generate a dense representation for a given text document (a document profile) and passes that on to an arbitrary standard propositional learning algorithm. Similarly to standard feature selection for text classification, the dimensionality of instances is drastically reduced this way, which in turn greatly lowers the computational load for the subsequent learning algorithm. The filter itself is very fast as well, as it basically is just an interesting variant of Naive Bayes. We present different variations of the filter and conduct an evaluation against the Reuters-21578 collection that shows performance comparable to previously published results on that collection, but at a lower computational cost.


All of MEDLINE indexed to the Gene Ontology

A. C. Smith and J. G. Cleary. All of medline indexed to the gene ontology. In Proc Sixth Annual Bio-Ontologies Meeting, pages 7-11, Brisbane, Australia, 2003.
[ bib ]
Most of what is known about genes and proteins is contained in the biomedical literature. The problem for the biologist is how to connect novel sequence data to relevant published documents. One way is to BLAST the sequence and then follow the literature links established in genomic/proteomic databases for known sequences with similar structure. Another way is to find the closely-matching genes or proteins in the Gene Ontology, then retrieve documents associated with GO terms. The advantage of this approach is that it provides a conceptual context for discovering possible genetic roles or molecular functions for a new sequence. The problems with both search strategies, however, is that they return only a small portion of the available literature. We are solving this problem by amplifying the available documents associated with GO terms to cover the entirety of the MEDLINE corpus.


A Two-Level Learning Method for Generalized Multi-instance Problems

Nils Weidmann, Eibe Frank, and Bernhard Pfahringer. A two-level learning method for generalized multi-instance problems. In N. Lavrac and et al., editors, Proc 14th European Conference on Machine Learning, Cavtat-Dubrovnik, Croatia, pages 468-479. Springer, 2003.
[ bib | .ps.gz | .pdf ]
In traditional multi-instance (MI) learning, a single positive instance in a bag produces a positive class label. Hence, the learner knows how the bag?s class label depends on the labels of the instances in the bag and can explicitly use this information to solve the learning task. In this paper we investigate a generalized view of the MI problem where this simple assumption no longer holds. We assume that an ?interaction? between instances in a bag determines the class label. Our two-level learning method for this type of problem transforms an MI bag into a single meta-instance that can be learned by a standard propositional method. The meta-instance indicates which regions in the instance space are covered by instances of the bag. Results on both artificial and real-world data show that this two-level classification approach is well suited for generalized MI problems.


Statistical learning in multiple instance problem

Xin Xu. Statistical learning in multiple instance problem. Master's thesis, University of Waikato, Hamilton, NZ, 2003. 0657.594.
[ bib | http | .ps.gz ]
Multiple instance (MI) learning is a relatively new topic in machine learning. It is concerned with supervised learning but differs from normal supervised learning in two points: (1) it has multiple instances in an example (and there is only one instance in an example in standard supervised learning), and (2) only one class label is observable for all the instances in an example (whereas each instance has its own class label in normal supervised learning). In MI learning there is a common assumption regarding the relationship between the class label of an example and the unobservable class labels of the instances inside it. This assumption, which is called the MI assumption in this thesis, states that An example is positive if at least one of its instances is positive and negative otherwise.

In this thesis, we first categorize current MI methods into a new framework. According to our analysis, there are two main categories of MI methods, instancebased and metadata-based approaches. Then we propose a new assumption for MI learning, called the collective assumption. Although this assumption has been used in some previous MI methods, it has never been explicitly stated, and this is the first time that it is formally specified. Using this new assumption we develop new algorithms - more specifically two instance-based and one metadata-based methods. All of these methods build probabilistic models and thus implement statistical learning algorithms. The exact generative models underlying these methods are explicitly stated and illustrated so that one may clearly understand the situations to which they can best be applied. The empirical results presented in this thesis show that they are competitive on standard benchmark datasets. Finally, we explore some practical applications of MI learning, both existing and new ones.

This thesis makes three contributions: a new framework for MI learning, new MI methods based on this framework and experimental results for new applications of MI learning.