[1]
|
Improving Hoeffding Trees
Richard Kirkby.
Improving Hoeffding Trees.
PhD thesis, Department of Computer Science, University of Waikato,
2007.
[ bib |
http ]
Modern information technology allows information to be
collected at a far greater rate than ever before. So fast, in fact,
that the main problem is making sense of it all. Machine learning
offers promise of a solution, but the field mainly focusses on
achieving high accuracy when data supply is limited. While this has
created sophisticated classification algorithms, many do not cope
with increasing data set sizes. When the data set sizes get to a
point where they could be considered to represent a continuous
supply, or data stream, then incremental classification algorithms
are required. In this setting, the effectiveness of an algorithm
cannot simply be assessed by accuracy alone. Consideration needs to
be given to the memory available to the algorithm and the speed at
which data is processed in terms of both the time taken to predict
the class of a new data sample and the time taken to include this
sample in an incrementally updated classification model. The
Hoeffding tree algorithm is a state-of-the-art method for inducing
decision trees from data streams. The aim of this thesis is to
improve this algorithm. To measure improvement, a comprehensive
framework for evaluating the performance of data stream algorithms
is developed. Within the framework memory size is fixed in order to
simulate realistic application scenarios. In order to simulate
continuous operation, classes of synthetic data are generated
providing an evaluation on a large scale. Improvements to many
aspects of the Hoeffding tree algorithm are demonstrated. First, a
number of methods for handling continuous numeric features are
compared. Second, tree prediction strategy is investigated to
evaluate the utility of various methods. Finally, the possibility of
improving accuracy using ensemble methods is explored. The
experimental results provide meaningful comparisons of accuracy and
processing speeds between different modifications of the Hoeffding
tree algorithm under various memory limits. The study on numeric
attributes demonstrates that sacrificing accuracy for space at the
local level often results in improved global accuracy. The
prediction strategy shown to perform best adaptively chooses between
standard majority class and Naive Bayes prediction in the
leaves. The ensemble method investigation shows that combining trees
can be worthwhile, but only when sufficient memory is available, and
improvement is less likely than in traditional machine learning. In
particular, issues are encountered when applying the popular
boosting method to streams.
|
|
[2]
|
Racing algorithms for conditional independence inference
Remco R. Bouckaert and Milan Studeny.
Racing algorithms for conditional independence inference.
Int. J. Approx. Reasoning, 45(2):386-401, 2007.
[ bib |
.pdf ]
In this article, we consider the computational aspects of
deciding whether a conditional independence statement t is implied
by a list of conditional independence statements L using the
implication related to the method of structural imsets. We present
two methods which have the interesting complementary properties that
one method performs well to prove that t is implied by L, while
the other performs well to prove that t is not implied by
L. However, both methods do not well perform the opposite. This
gives rise to a parallel algorithm in which both methods race against
each other in order to determine effectively whether t is or is not
implied.
Some empirical evidence is provided that suggest this racing
algorithms method performs considerably better than an existing method
based on so-called skeletal characterization of the respective
implication. Furthermore, unlike previous methods, the method is able
to handle more than five variables.
|
|
[3]
|
Syntax-driven argument identification and multi-argument classification for semantic role labeling
Chi-San Althon Lin.
Syntax-driven argument identification and multi-argument
classification for semantic role labeling.
PhD thesis, Department of Computer Science, University of Waikato,
2007.
[ bib |
http ]
Semantic role labeling is an important stage in systems for Natural Language Understanding. The basic problem is one of identifying who did what to whom for each predicate in a sentence. Thus labeling is a two-step process: identify constituent phrases that are arguments to a predicate, then label those arguments with appropriate thematic roles. Existing systems for semantic role labeling use machine learning methods to assign roles one-at-a-time to candidate arguments. There are several drawbacks to this general approach. First, more than one candidate can be assigned the same role, which is undesirable. Second, the search for each candidate argument is exponential with respect to the number of words in the sentence. Third, single-role assignment cannot take advantage of dependencies known to exist between semantic roles of predicate arguments, such as their relative juxtaposition. And fourth, execution times for existing algorithm are excessive, making them unsuitable for real-time use. This thesis seeks to obviate these problems by approaching semantic role labeling as a multi-argument classification process. It observes that the only valid arguments to a predicate are unembedded constituent phrases that do not overlap that predicate. Given that semantic role labeling occurs after parsing, this thesis proposes an algorithm that systematically traverses the parse tree when looking for arguments, thereby eliminating the vast majority of impossible candidates. Moreover, instead of assigning semantic roles one at a time, an algorithm is proposed to assign all labels simultaneously; leveraging dependencies between roles and eliminating the problem of duplicate assignment. Experimental results are provided as evidence to show that a combination of the proposed argument identification and multi-argument classification algorithms outperforms all existing systems that use the same syntactic information.
|
|
[4]
|
An Empirical Comparison of Exact Nearest Neighbour Algorithms
Ashraf M. Kibriya and Eibe Frank.
An empirical comparison of exact nearest neighbour algorithms.
In Proc 11th European Conference on Principles and Practice of
Knowledge Discovery in Databases, Warsaw, Poland, pages 140-151. Springer,
2007.
[ bib |
.pdf ]
Abstract. Nearest neighbour search (NNS) is an old problem that is
of practical importance in a number of fields. It involves finding, for
a given point q, called the query, one or more points from a given set
of points that are nearest to the query q. Since the initial inception of
the problem a great number of algorithms and techniques have been
proposed for its solution. However, it remains the case that many of the
proposed algorithms have not been compared against each other on a
wide variety of datasets. This research attempts to fill this gap to some
extent by presenting a detailed empirical comparison of three prominent
data structures for exact NNS: KD-Trees, Metric Trees, and Cover Trees.
Our results suggest that there is generally little gain in using Metric Trees
or Cover Trees instead of KD-Trees for the standard NNS problem.
|
|
[5]
|
Effective Classifiers for Detecting Objects
Michael Mayo.
Effective classifiers for detecting objects.
In Proc. of the Fourth International Conference on Computational
Intelligence, Robotics, and Autonomous Systems (CIRAS '07), 2007.
[ bib |
.pdf ]
Several state-of-the-art machine learning classifiers are compared for the purposes of object detection in complex images, using global image features derived from the Ohta color space and Local Binary Patterns. Image complexity in this sense refers to the degree to which the target objects are occluded and/or non-dominant (i.e. not in the foreground) in the image, and also the degree to which the images are cluttered with non-target objects. The results indicate that a voting ensemble of Support Vector Machines, Random Forests, and Boosted Decision Trees provide the best performance with AUC values of up to 0.92 and Equal Error Rate accuracies of up to 85.7% in stratified 10-fold cross validation experiments on the GRAZ02 complex image dataset.
|
|
[6]
|
Random Convolution Ensembles
Michael Mayo.
Random convolution ensembles.
In Hi Shing Ip et. al, editor, Advances in Multimedia
Information Processing - PCM 2007, 8th Pacific Rim Conference on Multimedia,
Lecture Notes in Computer Science 4810, pages 216-225. Springer, 2007.
[ bib |
.pdf ]
A novel method for creating diverse ensembles of image classifiers is proposed. The idea is that, for each base image classifier in the ensemble, a random image transformation is generated and applied to all of the images in the labeled training set. The base classifiers are then learned using features extracted from these randomly transformed versions of the training data, and the result is a highly diverse ensemble of image classifiers. This approach is evaluated on a benchmark pedestrian detection dataset and shown to be effective.
|
|
[7]
|
A discriminative approach to structured biological data
S. Mutter and B. Pfahringer.
A discriminative approach to structured biological data.
In Proc NZCSRSC'07, the Fifth New Zealand Computer Science
Research Student Conference, Hamilton, New Zealand, April 2007.
[ bib |
.pdf ]
This paper introduces the first author's PhD pro ject which
has just got out of its initial stage. Biological sequence data is, on the
one hand, highly structured. On the other hand there are large amounts
of unlabelled data. Thus we combine probabilistic graphical models and
semi-supervised learning. The former to handle structured data and the
latter to deal with unlabelled data. We apply our models to genotype-
phenotype modelling problems. In particular we predict the set of Single
Nucleotide Polymorphisms which underlie a specific phenotypical trait.
|
|
[8]
|
Scaling Up Semi-supervised Learning: An Efficient and Effective LLGC Variant
Bernhard Pfahringer, Claire Leschi, and Peter Reutemann.
Scaling up semi-supervised learning: An efficient and effective llgc
variant.
In Proc 11th Pacific-Asia Conference on Knowledge Discovery and
Data Mining, Nanjing, China, pages 236-247. Springer, 2007.
[ bib |
http ]
Domains like text classification can easily supply large amounts of unlabeled data, but labeling itself is expensive. Semi- supervised learning tries to exploit this abundance of unlabeled training data to improve classification. Unfortunately most of the theoretically well-founded algorithms that have been described in recent years are cubic or worse in the total number of both labeled and unlabeled training examples. In this paper we apply modifications to the standard LLGC algorithm to improve efficiency to a point where we can handle datasets with hundreds of thousands of training data. The modifications are priming of the unlabeled data, and most importantly, sparsification of the similarity matrix. We report promising results on large text classification problems.
|
|
[9]
|
New Options for Hoeffding Trees
Bernhard Pfahringer, Geoffrey Holmes, and Richard Kirkby.
New options for hoeffding trees.
In Proc 20th Australian Conference on Artificial Intelligence,
Gold Coast, Australia, pages 90-99. Springer, 2007.
[ bib |
http ]
Hoeffding trees are state-of-the-art for processing high-speed data streams. Their ingenuity stems from updating sufficient statistics, only addressing growth when decisions can be made that are guaranteed to be almost identical to those that would be made by conventional batch learning methods. Despite this guarantee, decisions are still subject to limited lookahead and stability issues. In this paper we explore Hoeffding Option Trees, a regular Hoeffding tree containing additional option nodes that allow several tests to be applied, leading to multiple Hoeffding trees as separate paths. We show how to control tree growth in order to generate a mixture of paths, and empirically determine a reasonable number of paths. We then empirically evaluate a spectrum of Hoeffding tree variations: single trees, option trees and bagged trees. Finally, we investigate pruning. We show that on some datasets a pruned option tree can be smaller and more accurate than a single tree.
|
|
[10]
|
Clustering Relational Data Based on Randomized Propositionalization
Grant Anderson and Bernhard Pfahringer.
Clustering relational data based on randomized propositionalization.
In Proc 17th International Conference on Inductive Logic
Programming, Corvallis, OR, pages 39-48. Springer, 2007.
[ bib |
http ]
Clustering of relational data has so far received a lot less attention than classification of such data. In this paper we investigate a simple approach based on randomized propositionalization, which allows for applying standard clustering algorithms like KMeans to multi-relational data. We describe how random rules are generated and then turned into boolean-valued features. Clustering generally is not straightforward to evaluate, but preliminary experimental results on a number of standard ILP datasets show promising results. Clusters generated without class information usually agree well with the true class labels of cluster members, i.e. class distributions inside clusters generally differ significantly from the global class distributions. The two-tiered algorithm described shows good scalability due to the randomized nature of the first step and the availability of efficient propositional clustering algorithms for the second step.
|
|
[11]
|
Clustering for Classification
Reuben Evans.
Clustering for classification.
Master's thesis, Department of Computer Science, University of
Waikato, 2007.
[ bib |
http ]
Advances in technology have provided industry with an array of devices
for collecting data. The frequency and scale of data collection means
that there are now many large datasets being generated. To find
patterns in these datasets it would be useful to be able to apply
modern methods of classification such as support vector
machines. Unfortunately these methods are computationally expensive,
quadratic in the number of data points in fact, so cannot be applied
directly. This thesis proposes a framework whereby a variety of
clustering methods can be used to summarise datasets, that is, reduce
them to a smaller but still representative dataset so that these
advanced methods can be applied. It compares the results of using this
framework against using random selection on a large number of
classification and regression problems. Results show that the
clustered datasets are on average fifty percent smaller than the
original datasets without loss of classification accuracy which is
significantly better than random selection. They also show that there
is no free lunch, for each dataset it is important to choose a
clustering method carefully.
|
|
[12]
|
Fast Algorithms for Nearest Neighbour Search
Ashraf Masood Kibriya.
Fast algorithms for nearest neighbour search.
Master's thesis, Department of Computer Science, University of
Waikato, 2007.
[ bib |
http ]
The nearest neighbour problem is of practical significance
in a number of fields. Often we are interested in finding an object
near to a given query object. The problem is old, and a large number
of solutions have been proposed for it in the literature. However, it
remains the case that even the most popular of the techniques proposed
for its solution have not been compared against each other. Also, many
techniques, including the old and popular ones, can be implemented in
a number of ways, and often the different implementations of a
technique have not been thoroughly compared either. This research
presents a detailed investigation of different implementations of two
popular nearest neighbour search data structures, KDTrees and Metric
Trees, and compares the different implementations of each of the two
structures against each other. The best implementations of these
structures are then compared against each other and against two other
techniques, Annulus Method and Cover Trees. Annulus Method is an old
technique that was rediscovered during the research for this
thesis. Cover Trees are one of the most novel and promising data
structures for nearest neighbour search that have been proposed in the
literature.
|
|
[13]
|
Effective Linear-Time Feature Selection
Nripendra Pradhananga.
Effective linear-time feature selection.
Master's thesis, Department of Computer Science, University of
Waikato, 2007.
[ bib |
http ]
The classification learning task requires selection of a
subset of features to represent patterns to be classified. This is
because the performance of the classifier and the cost of
classification are sensitive to the choice of the features used to
construct the classifier. Exhaustive search is impractical since it
searches every possible combination of features. The runtime of
heuristic and random searches are better but the problem still
persists when dealing with high-dimensional datasets. We investigate a
heuristic, forward, wrapper-based approach, called Linear Sequential
Selection, which limits the search space at each iteration of the
feature selection process. We introduce randomization in the search
space. The algorithm is called Randomized Linear Sequential
Selection. Our experiments demonstrate that both methods are faster,
find smaller subsets and can even increase the classification
accuracy. We also explore the idea of ensemble learning. We have
proposed two ensemble creation methods, Feature Selection Ensemble and
Random Feature Ensemble. Both methods apply a feature selection
algorithm to create individual classifiers of the ensemble. Our
experiments have shown that both methods work well with
high-dimensional data.
|
|
[14]
|
Best-first Decision Tree Learning
Haijian Shi.
Best-first decision tree learning.
Master's thesis, Department of Computer Science, University of
Waikato, 2007.
[ bib |
http ]
In best-first top-down induction of decision trees, the
best split is added in each step (e.g. the split that maximally
reduces the Gini index). This is in contrast to the standard
depth-first traversal of a tree. The resulting tree will be the same,
just how it is built is different. The objective of this project is to
investigate whether it is possible to determine an appropriate tree
size on practical datasets by combining best-first decision tree
growth with cross-validation-based selection of the number of
expansions that are performed. Pre-pruning, post-pruning, CART-pruning
can be performed this way to compare.
|
|
[15]
|
Prediction Intervals for Class Probabilities
Xiaofeng Yu.
Prediction intervals for class probabilities.
Master's thesis, Department of Computer Science, University of
Waikato, 2007.
[ bib |
http ]
Prediction intervals for class probabilities are of
interest in machine learning because they can quantify the uncertainty
about the class probability estimate for a test instance. The idea is
that all likely class probability values of the test instance are
included, with a pre-specified confidence level, in the calculated
prediction interval. This thesis proposes a probabilistic model for
calculating such prediction intervals. Given the unobservability of
class probabilities, a Bayesian approach is employed to derive a
complete distribution of the class probability of a test instance
based on a set of class observations of training instances in the
neighbourhood of the test instance. A random decision tree ensemble
learning algorithm is also proposed, whose prediction output
constitutes the neighbourhood that is used by the Bayesian model to
produce a PI for the test instance. The Bayesian model, which is used
in conjunction with the ensemble learning algorithm and the standard
nearest-neighbour classifier, is evaluated on artificial datasets and
modified real datasets.
|
|