[1]
|
A novel two stage scheme utilizing the test set for model selection in text classification
Bernhard Pfahringer, Peter Reutemann, and Mike Mayo.
A novel two stage scheme utilizing the test set for model selection
in text classification.
In Ranadhir Ghosh, Brijesh Verma, and Xue Li, editors, Proc
Workshop on Learning Algorithms for Pattern Recognition, Eighteenth
Australian Joint Conference on Artificial Intelligence (AI'05), pages
60-65, Sydney, Australia, 2005. University of Technology.
5-9 December 2005.
[ bib |
.pdf ]
Text classification is a natural application domain for semi-
supervised learning, as labeling documents is expensive, but
on the other hand usually an abundance of unlabeled
documents is available. We describe a novel simple two-
stage scheme based on dagging which allows for utilizing
the test set in model selection. The dagging ensemble can
also be used by itself instead of the original classifier. We
evaluate the performance of a meta classifier choosing
between various base learners and their respective dagging
ensembles. The selection process seems to perform robustly
especially for small percentages of available labels for
training.
|
|
[2]
|
Tie Breaking in Hoeffding trees
Geoffrey Holmes, Richard Kirkby, and Bernhard Pfahringer.
Tie breaking in hoeffding trees.
In J. Gama and J. S. Aguilar-Ruiz, editors, Proc Workshop W6:
Second International Workshop on Knowledge Discovery in Data Streams, pages
107-116, 2005.
[ bib ]
|
|
[3]
|
Cache hierarchy inspired compression: a novel architecture for data streams
Geoffrey Holmes, Bernhard Pfahringer, and Richard Kirkby.
Cache hierarchy inspired compression: a novel architecture for data
streams.
In Narayanan Kulathuramaiyer, Alvin W. Yeo, Wang Yin Chai, and
Tan Chong Eng, editors, Proc Fourth International Conference on
Information Technology in Asia (CITA'05), pages 130-136, 2005.
12-15 December 2005.
[ bib |
.pdf ]
We present an architecture for data streams
based on structures typically found in web cache
hierarchies. The main idea is to build a meta level
analyser from a number of levels constructed over time
from a data stream. We present the general architecture
for such a system and an application to classification.
This architecture is an instance of the general wrapper
idea allowing us to reuse standard batch learning
algorithms in an inherently incremental learning
environment. By artificially generating data sources we
demonstrate that a hierarchy containing a mixture of
models is able to adapt over time to the source of the
data. In these experiments the hierarchies use an
elementary performance based replacement policy and
unweighted voting for making classification decisions.
|
|
[4]
|
Inductive Logic Programming, 15th International Conference,
ILP 2005, Bonn, Germany, August 10-13, 2005, Proceedings
Stefan Kramer and Bernhard Pfahringer, editors.
Inductive Logic Programming, 15th International Conference, ILP
2005, Bonn, Germany, August 10-13, 2005, Proceedings, volume 3625 of
Lecture Notes in Computer Science. Springer, 2005.
[ bib ]
|
|
[5]
|
Unsupervised Discretization Using Tree-Based Density Estimation
Gabi Schmidberger and Eibe Frank.
Unsupervised discretization using tree-based density estimation.
In Proc 9th European Conference on Principles and Practice of
Knowledge Discovery in Databases, Porto, Portugal, pages 240-251. Springer,
2005.
[ bib |
.ps.gz |
.pdf ]
This paper presents an unsupervised discretization method
that performs density estimation for univariate data. The subintervals
that the discretization produces can be used as the bins of a histogram.
Histograms are a very simple and broadly understood means for display-
ing data, and our method automatically adapts bin widths to the data.
It uses the log-likelihood as the scoring function to select cut points and
the cross-validated log-likelihood to select the number of intervals. We
compare this method with equal-width discretization where we also se-
lect the number of bins using the cross-validated log-likelihood and with
equal-frequency discretization.
|
|
[6]
|
Ensembles of Balanced Nested Dichotomies for Multi-class
Problems
Lin Dong, Eibe Frank, and Stefan Kramer.
Ensembles of balanced nested dichotomies for multi-class problems.
In Proc 9th European Conference on Principles and Practice of
Knowledge Discovery in Databases, Porto, Portugal, pages 84-95. Springer,
2005.
[ bib |
.ps.gz |
.pdf ]
A system of nested dichotomies is a hierarchical
decomposition of a multi-class problem with c classes into c - 1
two-class problems and can be represented as a tree
structure. Ensembles of randomly-generated nested dichotomies have
proven to be an effective approach to multi-class learning problems
[1]. However, sampling trees by giving each tree equal probability
means that the depth of a tree is limited only by the number of
classes, and very unbalanced trees can negatively affect runtime. In
this paper we investigate two approaches to building balanced nested
dichotomies-class-balanced nested dichotomies and data-balanced nested
dichotomies-and evaluate them in the same ensemble setting. Using
C4.5 decision trees as the base models, we show that both approaches
can reduce runtime with little or no effect on accuracy, especially
on problems with many classes. We also investigate the effect of
caching models when building ensembles of nested dichotomies.
|
|
[7]
|
Speeding Up Logistic Model Tree Induction
Marc Sumner, Eibe Frank, and Mark A. Hall.
Speeding up logistic model tree induction.
In Proc 9th European Conference on Principles and Practice of
Knowledge Discovery in Databases, Porto, Portugal, pages 675-683. Springer,
2005.
[ bib |
.ps.gz |
.pdf ]
Logistic Model Trees have been shown to be very accurate
and compact classifiers [8]. Their greatest disadvantage is the computa-
tional complexity of inducing the logistic regression models in the tree.
We address this issue by using the AIC criterion [1] instead of cross-
validation to prevent overfitting these models. In addition, a weight trim-
ming heuristic is used which produces a significant speedup. We compare
the training time and accuracy of the new induction process with the
original one on various datasets and show that the training time often
decreases while the classification accuracy diminishes only slightly.
|
|
[8]
|
Stress-Testing Hoeffding Trees
Geoffrey Holmes, Richard Kirkby, and Bernhard Pfahringer.
Stress-testing hoeffding trees.
In Proc 9th European Conference on Principles and Practice of
Knowledge Discovery in Databases, Porto, Portugal, pages 495-502. Springer,
2005.
[ bib ]
|
|
[9]
|
Logistic Model Trees
Niels Landwehr, Mark Hall, and Eibe Frank.
Logistic model trees.
Machine Learning, 59(1-2):161-205, 2005.
[ bib |
.ps.gz |
.pdf ]
Tree induction methods and linear models are popular techniques for supervised learning tasks, both for the prediction of nominal classes and numeric values. For predicting numeric quantities, there has been work on combining these two schemes into 'model trees', i.e. trees that contain linear regression functions at the leaves. In this paper, we present an algorithm that adapts this idea for classification problems, using logistic regression instead of linear regression. We use a stagewise fitting process to construct the logisitic regression models that can select relevant attributes in the data in a natural way, and show how this approach can be used to build the logistic regression models at the leaves by incrementally refining those constructed at higher levels in the tree. We compare the performance of our algorithm to several other state-of-the-art learning schemes on 36 benchmark UCI datasets, and show that it produces accurate and compact classifiers.
|
|
[10]
|
Gene selection from microarray data for cancer classification
- a machine learning approach
Yu Wang, Igor V. Tetko, Mark A. Hall, Eibe Frank, Axel Facius, Klaus F. X.
Mayer, and Hans-Werner Mewes.
Gene selection from microarray data for cancer classification - a
machine learning approach.
Computational Biology and Chemistry, 29(1):37-46, 2005.
[ bib |
http ]
|
|
[11]
|
Kea: Practical automatic keyphrase extraction
Ian H. Witten, Gordon W. Paynter, Eibe Frank, Carl Gutwin, and Craig G.
Nevill-Manning.
Kea: Practical automatic keyphrase extraction.
In Y.-L. Theng and S. Foo, editors, Design and Usability of
Digital Libraries: Case Studies in the Asia Pacific, pages 129-152.
Information Science Publishing, London, 2005.
[ bib |
.ps.gz |
.pdf ]
Keyphrases provide semantic metadata that summarize and characterize documents. This chapter describes Kea, an algorithm for automatically extracting keyphrases from text. Kea identifies candidate keyphrases using lexical methods, calculates feature values for each candidate, and uses a machine-learning algorithm to predict which candidates are good keyphrases. The machine-learning scheme first builds a prediction model using training documents with known keyphrases, and then uses the model to find keyphrases in new documents. We use a large text corpus to evaluate Kea's effectiveness in terms of how many author-assigned keyphrases are correctly identified. The system is simple, robust, and available under the GNU General Public License; the chapter gives instructions for use.
|
|
[12]
|
Data Mining: Practical Machine Learning Tools and Techniques
Ian H. Witten and Eibe Frank.
Data Mining: Practical Machine Learning Tools and Techniques.
Morgan Kaufmann, San Francisco, 2 edition, 2005.
[ bib |
.html ]
|
|
[13]
|
WEKA - A Machine Learning Workbench for Data Mining
Eibe Frank, Mark A. Hall, Geoffrey Holmes, Richard Kirkby, Bernhard Pfahringer,
Ian H. Witten, and Leonhard Trigg.
Weka - a machine learning workbench for data mining.
In Oded Maimon and Lior Rokach, editors, The Data Mining and
Knowledge Discovery Handbook, pages 1305-1314. Springer, 2005.
[ bib |
.ps.gz |
.pdf ]
The Weka workbench is an organized collection of state-of-the-art machine learning algorithms and data preprocessing tools. The basic way
of interacting with these methods is by invoking them from the com-
mand line. However, convenient interactive graphical user interfaces are
provided for data exploration, for setting up large-scale experiments on
distributed computing platforms, and for designing configurations for
streamed data processing. These interfaces constitute an advanced en-
vironment for experimental data mining. The system is written in Java
and distributed under the terms of the GNU General Public License.
|
|
[14]
|
Semantic Role-Labeling via Consensus Pattern-Matching
Chi-San Lin and Tony C. Smith.
Semantic role-labeling via consensus pattern-matching.
In Proceedings of the Ninth Conference on Computational Natural
Language Learning, pages 185-188, Ann Arbor, Michigan, June 2005.
[ bib |
.pdf ]
This paper describes a system for semantic
role labeling for the CoNLL2005 Shared
task. We divide the task into two sub-tasks:
boundary recognition by a general tree-
based predicate-argument recognition algo-
rithm to convert a parse tree into a flat rep-
resentation of all predicates and their
related boundaries, and role labeling by a
consensus model using a pattern-matching
framework to find suitable roles for core
constituents and adjuncts. We describe the
system architecture and report results for
the CoNLL2005 development dataset.
|
|
[15]
|
Racing for Conditional Independence Inference
Remco R. Bouckaert and Milan Studeny.
Racing for conditional independence inference.
In Symbolic and Quantitative Approaches to Reasoning with
Uncertainty, 8th European Conference, ECSQARU 2005, volume 3571 of
LNCS, pages 221-232, 2005.
[ bib |
.pdf ]
In this article, we consider the computational aspects of deciding
whether a conditional independence statement t is implied by a list
of conditional independence statements L using the implication related
to the method of structural imsets.
We present two methods which have the interesting complementary properties
that one method performs well to prove that t is implied by L, while the
other performs well to prove that t is not implied by L. However, both
methods do not perform well the opposite. This gives rise to a parallel
algorithm in which both methods race against each other in order to
determine effectively whether t is or is not implied.
Some empirical evidence is provided that suggest this racing algorithms
method performs a lot better than an existing method based on so-called
skeletal characterization of the respective implication.
Furthermore, the method is able to handle more than five variables.
|
|
[16]
|
Low Replicability of Machine Learning Experiments is not a Small Data Set Phenomenon
R. R. Bouckaert.
Low replicability of machine learning experiments is not a small data
set phenomenon.
In Proceedings of the ICML'05 Workshop on Meta-Learning, 2005.
[ bib |
.pdf ]
This paper investigates the relation between replicability of
experiments
for deciding which of two algorithms performs better on a given data set.
We prove that lack of replicability is not just a small data phenomenon
(as was shown before),
but is present in experiments on medium and large data sets as well.
We establish intuition in the relation between data set size, power and
replicability.
The main method for improving replicability is to increase the number of
samples. For large data sets and/or inefficient learning algorithms, this
implies that exerperiments may take a long time to completion.
We propose a procedure for deciding which of two learning algorithms is
best that has a high replicability but takes moderate computational effort.
|
|
[17]
|
Bayesian Sequence Learning for Predicting Protein Cleavage Points
Mike Mayo.
Bayesian sequence learning for predicting protein cleavage points.
In Advances in Knowledge Discovery and Data Mining: 9th
Pacific-Asia Conference, PAKDD 2005, page 192, Hanoi, Vietnam, 2005.
May 18-20, 2005.
[ bib |
.pdf ]
A challenging problem in data mining is the application of efficient techniques to automatically annotate the vast databases of biological sequence data. This paper describes one such application in this area, to the prediction of the position of signal peptide cleavage points along protein sequences. It is shown that the method, based on Bayesian statistics, is comparable in terms of accuracy to the existing state-of-the-art neural network techniques while providing explanatory information for its predictions.
|
|
[18]
|
Learning Petri Net Models of Non-Linear Gene Interactions
Mike Mayo.
Learning petri net models of non-linear gene interactions.
BioSystems, 85(1):74-82, 2005.
[ bib |
.pdf ]
Understanding how an individual's genetic make-up influences their risk of disease is a problem of paramount importance. Although machine learning techniques are able to uncover the relationships between genotype and disease, the problem of automatically building the best biochemical model or explanation of the relationship has received less attention. In this paper, I describe a method based on random hill climbing that automatically builds Petri net models of non-linear (or multi-factorial) disease-causing gene-gene interactions. Petri nets are a suitable formalism for this problem, because they are used to model concurrent, dynamic processes analogous to biochemical reaction networks. I show that this method is routinely able to identify perfect Petri net models for three disease-causing gene-gene interactions recently reported in the literature.
|
|