Improving Hoeffding Trees

Richard Kirkby. Improving Hoeffding Trees. PhD thesis, Department of Computer Science, University of Waikato, 2007.
[ bib | http ]
Modern information technology allows information to be collected at a far greater rate than ever before. So fast, in fact, that the main problem is making sense of it all. Machine learning offers promise of a solution, but the field mainly focusses on achieving high accuracy when data supply is limited. While this has created sophisticated classification algorithms, many do not cope with increasing data set sizes. When the data set sizes get to a point where they could be considered to represent a continuous supply, or data stream, then incremental classification algorithms are required. In this setting, the effectiveness of an algorithm cannot simply be assessed by accuracy alone. Consideration needs to be given to the memory available to the algorithm and the speed at which data is processed in terms of both the time taken to predict the class of a new data sample and the time taken to include this sample in an incrementally updated classification model. The Hoeffding tree algorithm is a state-of-the-art method for inducing decision trees from data streams. The aim of this thesis is to improve this algorithm. To measure improvement, a comprehensive framework for evaluating the performance of data stream algorithms is developed. Within the framework memory size is fixed in order to simulate realistic application scenarios. In order to simulate continuous operation, classes of synthetic data are generated providing an evaluation on a large scale. Improvements to many aspects of the Hoeffding tree algorithm are demonstrated. First, a number of methods for handling continuous numeric features are compared. Second, tree prediction strategy is investigated to evaluate the utility of various methods. Finally, the possibility of improving accuracy using ensemble methods is explored. The experimental results provide meaningful comparisons of accuracy and processing speeds between different modifications of the Hoeffding tree algorithm under various memory limits. The study on numeric attributes demonstrates that sacrificing accuracy for space at the local level often results in improved global accuracy. The prediction strategy shown to perform best adaptively chooses between standard majority class and Naive Bayes prediction in the leaves. The ensemble method investigation shows that combining trees can be worthwhile, but only when sufficient memory is available, and improvement is less likely than in traditional machine learning. In particular, issues are encountered when applying the popular boosting method to streams.


Racing algorithms for conditional independence inference

Remco R. Bouckaert and Milan Studeny. Racing algorithms for conditional independence inference. Int. J. Approx. Reasoning, 45(2):386-401, 2007.
[ bib | .pdf ]
In this article, we consider the computational aspects of deciding whether a conditional independence statement t is implied by a list of conditional independence statements L using the implication related to the method of structural imsets. We present two methods which have the interesting complementary properties that one method performs well to prove that t is implied by L, while the other performs well to prove that t is not implied by L. However, both methods do not well perform the opposite. This gives rise to a parallel algorithm in which both methods race against each other in order to determine effectively whether t is or is not implied. Some empirical evidence is provided that suggest this racing algorithms method performs considerably better than an existing method based on so-called skeletal characterization of the respective implication. Furthermore, unlike previous methods, the method is able to handle more than five variables.


Syntax-driven argument identification and multi-argument classification for semantic role labeling

Chi-San Althon Lin. Syntax-driven argument identification and multi-argument classification for semantic role labeling. PhD thesis, Department of Computer Science, University of Waikato, 2007.
[ bib | http ]
Semantic role labeling is an important stage in systems for Natural Language Understanding. The basic problem is one of identifying who did what to whom for each predicate in a sentence. Thus labeling is a two-step process: identify constituent phrases that are arguments to a predicate, then label those arguments with appropriate thematic roles. Existing systems for semantic role labeling use machine learning methods to assign roles one-at-a-time to candidate arguments. There are several drawbacks to this general approach. First, more than one candidate can be assigned the same role, which is undesirable. Second, the search for each candidate argument is exponential with respect to the number of words in the sentence. Third, single-role assignment cannot take advantage of dependencies known to exist between semantic roles of predicate arguments, such as their relative juxtaposition. And fourth, execution times for existing algorithm are excessive, making them unsuitable for real-time use. This thesis seeks to obviate these problems by approaching semantic role labeling as a multi-argument classification process. It observes that the only valid arguments to a predicate are unembedded constituent phrases that do not overlap that predicate. Given that semantic role labeling occurs after parsing, this thesis proposes an algorithm that systematically traverses the parse tree when looking for arguments, thereby eliminating the vast majority of impossible candidates. Moreover, instead of assigning semantic roles one at a time, an algorithm is proposed to assign all labels simultaneously; leveraging dependencies between roles and eliminating the problem of duplicate assignment. Experimental results are provided as evidence to show that a combination of the proposed argument identification and multi-argument classification algorithms outperforms all existing systems that use the same syntactic information.


An Empirical Comparison of Exact Nearest Neighbour Algorithms

Ashraf M. Kibriya and Eibe Frank. An empirical comparison of exact nearest neighbour algorithms. In Proc 11th European Conference on Principles and Practice of Knowledge Discovery in Databases, Warsaw, Poland, pages 140-151. Springer, 2007.
[ bib | .pdf ]
Abstract. Nearest neighbour search (NNS) is an old problem that is of practical importance in a number of fields. It involves finding, for a given point q, called the query, one or more points from a given set of points that are nearest to the query q. Since the initial inception of the problem a great number of algorithms and techniques have been proposed for its solution. However, it remains the case that many of the proposed algorithms have not been compared against each other on a wide variety of datasets. This research attempts to fill this gap to some extent by presenting a detailed empirical comparison of three prominent data structures for exact NNS: KD-Trees, Metric Trees, and Cover Trees. Our results suggest that there is generally little gain in using Metric Trees or Cover Trees instead of KD-Trees for the standard NNS problem.


Effective Classifiers for Detecting Objects

Michael Mayo. Effective classifiers for detecting objects. In Proc. of the Fourth International Conference on Computational Intelligence, Robotics, and Autonomous Systems (CIRAS '07), 2007.
[ bib | .pdf ]
Several state-of-the-art machine learning classifiers are compared for the purposes of object detection in complex images, using global image features derived from the Ohta color space and Local Binary Patterns. Image complexity in this sense refers to the degree to which the target objects are occluded and/or non-dominant (i.e. not in the foreground) in the image, and also the degree to which the images are cluttered with non-target objects. The results indicate that a voting ensemble of Support Vector Machines, Random Forests, and Boosted Decision Trees provide the best performance with AUC values of up to 0.92 and Equal Error Rate accuracies of up to 85.7% in stratified 10-fold cross validation experiments on the GRAZ02 complex image dataset.


Random Convolution Ensembles

Michael Mayo. Random convolution ensembles. In Hi Shing Ip et. al, editor, Advances in Multimedia Information Processing - PCM 2007, 8th Pacific Rim Conference on Multimedia, Lecture Notes in Computer Science 4810, pages 216-225. Springer, 2007.
[ bib | .pdf ]
A novel method for creating diverse ensembles of image classifiers is proposed. The idea is that, for each base image classifier in the ensemble, a random image transformation is generated and applied to all of the images in the labeled training set. The base classifiers are then learned using features extracted from these randomly transformed versions of the training data, and the result is a highly diverse ensemble of image classifiers. This approach is evaluated on a benchmark pedestrian detection dataset and shown to be effective.


A discriminative approach to structured biological data

S. Mutter and B. Pfahringer. A discriminative approach to structured biological data. In Proc NZCSRSC'07, the Fifth New Zealand Computer Science Research Student Conference, Hamilton, New Zealand, April 2007.
[ bib | .pdf ]
This paper introduces the first author's PhD pro ject which has just got out of its initial stage. Biological sequence data is, on the one hand, highly structured. On the other hand there are large amounts of unlabelled data. Thus we combine probabilistic graphical models and semi-supervised learning. The former to handle structured data and the latter to deal with unlabelled data. We apply our models to genotype- phenotype modelling problems. In particular we predict the set of Single Nucleotide Polymorphisms which underlie a specific phenotypical trait.


Scaling Up Semi-supervised Learning: An Efficient and Effective LLGC Variant

Bernhard Pfahringer, Claire Leschi, and Peter Reutemann. Scaling up semi-supervised learning: An efficient and effective llgc variant. In Proc 11th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Nanjing, China, pages 236-247. Springer, 2007.
[ bib | http ]
Domains like text classification can easily supply large amounts of unlabeled data, but labeling itself is expensive. Semi- supervised learning tries to exploit this abundance of unlabeled training data to improve classification. Unfortunately most of the theoretically well-founded algorithms that have been described in recent years are cubic or worse in the total number of both labeled and unlabeled training examples. In this paper we apply modifications to the standard LLGC algorithm to improve efficiency to a point where we can handle datasets with hundreds of thousands of training data. The modifications are priming of the unlabeled data, and most importantly, sparsification of the similarity matrix. We report promising results on large text classification problems.


New Options for Hoeffding Trees

Bernhard Pfahringer, Geoffrey Holmes, and Richard Kirkby. New options for hoeffding trees. In Proc 20th Australian Conference on Artificial Intelligence, Gold Coast, Australia, pages 90-99. Springer, 2007.
[ bib | http ]
Hoeffding trees are state-of-the-art for processing high-speed data streams. Their ingenuity stems from updating sufficient statistics, only addressing growth when decisions can be made that are guaranteed to be almost identical to those that would be made by conventional batch learning methods. Despite this guarantee, decisions are still subject to limited lookahead and stability issues. In this paper we explore Hoeffding Option Trees, a regular Hoeffding tree containing additional option nodes that allow several tests to be applied, leading to multiple Hoeffding trees as separate paths. We show how to control tree growth in order to generate a mixture of paths, and empirically determine a reasonable number of paths. We then empirically evaluate a spectrum of Hoeffding tree variations: single trees, option trees and bagged trees. Finally, we investigate pruning. We show that on some datasets a pruned option tree can be smaller and more accurate than a single tree.


Clustering Relational Data Based on Randomized Propositionalization

Grant Anderson and Bernhard Pfahringer. Clustering relational data based on randomized propositionalization. In Proc 17th International Conference on Inductive Logic Programming, Corvallis, OR, pages 39-48. Springer, 2007.
[ bib | http ]
Clustering of relational data has so far received a lot less attention than classification of such data. In this paper we investigate a simple approach based on randomized propositionalization, which allows for applying standard clustering algorithms like KMeans to multi-relational data. We describe how random rules are generated and then turned into boolean-valued features. Clustering generally is not straightforward to evaluate, but preliminary experimental results on a number of standard ILP datasets show promising results. Clusters generated without class information usually agree well with the true class labels of cluster members, i.e. class distributions inside clusters generally differ significantly from the global class distributions. The two-tiered algorithm described shows good scalability due to the randomized nature of the first step and the availability of efficient propositional clustering algorithms for the second step.


Clustering for Classification

Reuben Evans. Clustering for classification. Master's thesis, Department of Computer Science, University of Waikato, 2007.
[ bib | http ]
Advances in technology have provided industry with an array of devices for collecting data. The frequency and scale of data collection means that there are now many large datasets being generated. To find patterns in these datasets it would be useful to be able to apply modern methods of classification such as support vector machines. Unfortunately these methods are computationally expensive, quadratic in the number of data points in fact, so cannot be applied directly. This thesis proposes a framework whereby a variety of clustering methods can be used to summarise datasets, that is, reduce them to a smaller but still representative dataset so that these advanced methods can be applied. It compares the results of using this framework against using random selection on a large number of classification and regression problems. Results show that the clustered datasets are on average fifty percent smaller than the original datasets without loss of classification accuracy which is significantly better than random selection. They also show that there is no free lunch, for each dataset it is important to choose a clustering method carefully.


Fast Algorithms for Nearest Neighbour Search

Ashraf Masood Kibriya. Fast algorithms for nearest neighbour search. Master's thesis, Department of Computer Science, University of Waikato, 2007.
[ bib | http ]
The nearest neighbour problem is of practical significance in a number of fields. Often we are interested in finding an object near to a given query object. The problem is old, and a large number of solutions have been proposed for it in the literature. However, it remains the case that even the most popular of the techniques proposed for its solution have not been compared against each other. Also, many techniques, including the old and popular ones, can be implemented in a number of ways, and often the different implementations of a technique have not been thoroughly compared either. This research presents a detailed investigation of different implementations of two popular nearest neighbour search data structures, KDTrees and Metric Trees, and compares the different implementations of each of the two structures against each other. The best implementations of these structures are then compared against each other and against two other techniques, Annulus Method and Cover Trees. Annulus Method is an old technique that was rediscovered during the research for this thesis. Cover Trees are one of the most novel and promising data structures for nearest neighbour search that have been proposed in the literature.


Effective Linear-Time Feature Selection

Nripendra Pradhananga. Effective linear-time feature selection. Master's thesis, Department of Computer Science, University of Waikato, 2007.
[ bib | http ]
The classification learning task requires selection of a subset of features to represent patterns to be classified. This is because the performance of the classifier and the cost of classification are sensitive to the choice of the features used to construct the classifier. Exhaustive search is impractical since it searches every possible combination of features. The runtime of heuristic and random searches are better but the problem still persists when dealing with high-dimensional datasets. We investigate a heuristic, forward, wrapper-based approach, called Linear Sequential Selection, which limits the search space at each iteration of the feature selection process. We introduce randomization in the search space. The algorithm is called Randomized Linear Sequential Selection. Our experiments demonstrate that both methods are faster, find smaller subsets and can even increase the classification accuracy. We also explore the idea of ensemble learning. We have proposed two ensemble creation methods, Feature Selection Ensemble and Random Feature Ensemble. Both methods apply a feature selection algorithm to create individual classifiers of the ensemble. Our experiments have shown that both methods work well with high-dimensional data.


Best-first Decision Tree Learning

Haijian Shi. Best-first decision tree learning. Master's thesis, Department of Computer Science, University of Waikato, 2007.
[ bib | http ]
In best-first top-down induction of decision trees, the best split is added in each step (e.g. the split that maximally reduces the Gini index). This is in contrast to the standard depth-first traversal of a tree. The resulting tree will be the same, just how it is built is different. The objective of this project is to investigate whether it is possible to determine an appropriate tree size on practical datasets by combining best-first decision tree growth with cross-validation-based selection of the number of expansions that are performed. Pre-pruning, post-pruning, CART-pruning can be performed this way to compare.


Prediction Intervals for Class Probabilities

Xiaofeng Yu. Prediction intervals for class probabilities. Master's thesis, Department of Computer Science, University of Waikato, 2007.
[ bib | http ]
Prediction intervals for class probabilities are of interest in machine learning because they can quantify the uncertainty about the class probability estimate for a test instance. The idea is that all likely class probability values of the test instance are included, with a pre-specified confidence level, in the calculated prediction interval. This thesis proposes a probabilistic model for calculating such prediction intervals. Given the unobservability of class probabilities, a Bayesian approach is employed to derive a complete distribution of the class probability of a test instance based on a set of class observations of training instances in the neighbourhood of the test instance. A random decision tree ensemble learning algorithm is also proposed, whose prediction output constitutes the neighbourhood that is used by the Bayesian model to produce a PI for the test instance. The Bayesian model, which is used in conjunction with the ensemble learning algorithm and the standard nearest-neighbour classifier, is evaluated on artificial datasets and modified real datasets.