A novel two stage scheme utilizing the test set for model selection in text classification

Bernhard Pfahringer, Peter Reutemann, and Mike Mayo. A novel two stage scheme utilizing the test set for model selection in text classification. In Ranadhir Ghosh, Brijesh Verma, and Xue Li, editors, Proc Workshop on Learning Algorithms for Pattern Recognition, Eighteenth Australian Joint Conference on Artificial Intelligence (AI'05), pages 60-65, Sydney, Australia, 2005. University of Technology. 5-9 December 2005.
[ bib | .pdf ]
Text classification is a natural application domain for semi- supervised learning, as labeling documents is expensive, but on the other hand usually an abundance of unlabeled documents is available. We describe a novel simple two- stage scheme based on dagging which allows for utilizing the test set in model selection. The dagging ensemble can also be used by itself instead of the original classifier. We evaluate the performance of a meta classifier choosing between various base learners and their respective dagging ensembles. The selection process seems to perform robustly especially for small percentages of available labels for training.


Tie Breaking in Hoeffding trees

Geoffrey Holmes, Richard Kirkby, and Bernhard Pfahringer. Tie breaking in hoeffding trees. In J. Gama and J. S. Aguilar-Ruiz, editors, Proc Workshop W6: Second International Workshop on Knowledge Discovery in Data Streams, pages 107-116, 2005.
[ bib ]

Cache hierarchy inspired compression: a novel architecture for data streams

Geoffrey Holmes, Bernhard Pfahringer, and Richard Kirkby. Cache hierarchy inspired compression: a novel architecture for data streams. In Narayanan Kulathuramaiyer, Alvin W. Yeo, Wang Yin Chai, and Tan Chong Eng, editors, Proc Fourth International Conference on Information Technology in Asia (CITA'05), pages 130-136, 2005. 12-15 December 2005.
[ bib | .pdf ]
We present an architecture for data streams based on structures typically found in web cache hierarchies. The main idea is to build a meta level analyser from a number of levels constructed over time from a data stream. We present the general architecture for such a system and an application to classification. This architecture is an instance of the general wrapper idea allowing us to reuse standard batch learning algorithms in an inherently incremental learning environment. By artificially generating data sources we demonstrate that a hierarchy containing a mixture of models is able to adapt over time to the source of the data. In these experiments the hierarchies use an elementary performance based replacement policy and unweighted voting for making classification decisions.


Inductive Logic Programming, 15th International Conference, ILP 2005, Bonn, Germany, August 10-13, 2005, Proceedings

Stefan Kramer and Bernhard Pfahringer, editors. Inductive Logic Programming, 15th International Conference, ILP 2005, Bonn, Germany, August 10-13, 2005, Proceedings, volume 3625 of Lecture Notes in Computer Science. Springer, 2005.
[ bib ]

Unsupervised Discretization Using Tree-Based Density Estimation

Gabi Schmidberger and Eibe Frank. Unsupervised discretization using tree-based density estimation. In Proc 9th European Conference on Principles and Practice of Knowledge Discovery in Databases, Porto, Portugal, pages 240-251. Springer, 2005.
[ bib | .ps.gz | .pdf ]
This paper presents an unsupervised discretization method that performs density estimation for univariate data. The subintervals that the discretization produces can be used as the bins of a histogram. Histograms are a very simple and broadly understood means for display- ing data, and our method automatically adapts bin widths to the data. It uses the log-likelihood as the scoring function to select cut points and the cross-validated log-likelihood to select the number of intervals. We compare this method with equal-width discretization where we also se- lect the number of bins using the cross-validated log-likelihood and with equal-frequency discretization.


Ensembles of Balanced Nested Dichotomies for Multi-class Problems

Lin Dong, Eibe Frank, and Stefan Kramer. Ensembles of balanced nested dichotomies for multi-class problems. In Proc 9th European Conference on Principles and Practice of Knowledge Discovery in Databases, Porto, Portugal, pages 84-95. Springer, 2005.
[ bib | .ps.gz | .pdf ]
A system of nested dichotomies is a hierarchical decomposition of a multi-class problem with c classes into c - 1 two-class problems and can be represented as a tree structure. Ensembles of randomly-generated nested dichotomies have proven to be an effective approach to multi-class learning problems [1]. However, sampling trees by giving each tree equal probability means that the depth of a tree is limited only by the number of classes, and very unbalanced trees can negatively affect runtime. In this paper we investigate two approaches to building balanced nested dichotomies-class-balanced nested dichotomies and data-balanced nested dichotomies-and evaluate them in the same ensemble setting. Using C4.5 decision trees as the base models, we show that both approaches can reduce runtime with little or no effect on accuracy, especially on problems with many classes. We also investigate the effect of caching models when building ensembles of nested dichotomies.


Speeding Up Logistic Model Tree Induction

Marc Sumner, Eibe Frank, and Mark A. Hall. Speeding up logistic model tree induction. In Proc 9th European Conference on Principles and Practice of Knowledge Discovery in Databases, Porto, Portugal, pages 675-683. Springer, 2005.
[ bib | .ps.gz | .pdf ]
Logistic Model Trees have been shown to be very accurate and compact classifiers [8]. Their greatest disadvantage is the computa- tional complexity of inducing the logistic regression models in the tree. We address this issue by using the AIC criterion [1] instead of cross- validation to prevent overfitting these models. In addition, a weight trim- ming heuristic is used which produces a significant speedup. We compare the training time and accuracy of the new induction process with the original one on various datasets and show that the training time often decreases while the classification accuracy diminishes only slightly.


Stress-Testing Hoeffding Trees

Geoffrey Holmes, Richard Kirkby, and Bernhard Pfahringer. Stress-testing hoeffding trees. In Proc 9th European Conference on Principles and Practice of Knowledge Discovery in Databases, Porto, Portugal, pages 495-502. Springer, 2005.
[ bib ]

Logistic Model Trees

Niels Landwehr, Mark Hall, and Eibe Frank. Logistic model trees. Machine Learning, 59(1-2):161-205, 2005.
[ bib | .ps.gz | .pdf ]
Tree induction methods and linear models are popular techniques for supervised learning tasks, both for the prediction of nominal classes and numeric values. For predicting numeric quantities, there has been work on combining these two schemes into 'model trees', i.e. trees that contain linear regression functions at the leaves. In this paper, we present an algorithm that adapts this idea for classification problems, using logistic regression instead of linear regression. We use a stagewise fitting process to construct the logisitic regression models that can select relevant attributes in the data in a natural way, and show how this approach can be used to build the logistic regression models at the leaves by incrementally refining those constructed at higher levels in the tree. We compare the performance of our algorithm to several other state-of-the-art learning schemes on 36 benchmark UCI datasets, and show that it produces accurate and compact classifiers.


Gene selection from microarray data for cancer classification - a machine learning approach

Yu Wang, Igor V. Tetko, Mark A. Hall, Eibe Frank, Axel Facius, Klaus F. X. Mayer, and Hans-Werner Mewes. Gene selection from microarray data for cancer classification - a machine learning approach. Computational Biology and Chemistry, 29(1):37-46, 2005.
[ bib | http ]

Kea: Practical automatic keyphrase extraction

Ian H. Witten, Gordon W. Paynter, Eibe Frank, Carl Gutwin, and Craig G. Nevill-Manning. Kea: Practical automatic keyphrase extraction. In Y.-L. Theng and S. Foo, editors, Design and Usability of Digital Libraries: Case Studies in the Asia Pacific, pages 129-152. Information Science Publishing, London, 2005.
[ bib | .ps.gz | .pdf ]
Keyphrases provide semantic metadata that summarize and characterize documents. This chapter describes Kea, an algorithm for automatically extracting keyphrases from text. Kea identifies candidate keyphrases using lexical methods, calculates feature values for each candidate, and uses a machine-learning algorithm to predict which candidates are good keyphrases. The machine-learning scheme first builds a prediction model using training documents with known keyphrases, and then uses the model to find keyphrases in new documents. We use a large text corpus to evaluate Kea's effectiveness in terms of how many author-assigned keyphrases are correctly identified. The system is simple, robust, and available under the GNU General Public License; the chapter gives instructions for use.


Data Mining: Practical Machine Learning Tools and Techniques

Ian H. Witten and Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco, 2 edition, 2005.
[ bib | .html ]

WEKA - A Machine Learning Workbench for Data Mining

Eibe Frank, Mark A. Hall, Geoffrey Holmes, Richard Kirkby, Bernhard Pfahringer, Ian H. Witten, and Leonhard Trigg. Weka - a machine learning workbench for data mining. In Oded Maimon and Lior Rokach, editors, The Data Mining and Knowledge Discovery Handbook, pages 1305-1314. Springer, 2005.
[ bib | .ps.gz | .pdf ]
The Weka workbench is an organized collection of state-of-the-art machine learning algorithms and data preprocessing tools. The basic way of interacting with these methods is by invoking them from the com- mand line. However, convenient interactive graphical user interfaces are provided for data exploration, for setting up large-scale experiments on distributed computing platforms, and for designing configurations for streamed data processing. These interfaces constitute an advanced en- vironment for experimental data mining. The system is written in Java and distributed under the terms of the GNU General Public License.


Semantic Role-Labeling via Consensus Pattern-Matching

Chi-San Lin and Tony C. Smith. Semantic role-labeling via consensus pattern-matching. In Proceedings of the Ninth Conference on Computational Natural Language Learning, pages 185-188, Ann Arbor, Michigan, June 2005.
[ bib | .pdf ]
This paper describes a system for semantic role labeling for the CoNLL2005 Shared task. We divide the task into two sub-tasks: boundary recognition by a general tree- based predicate-argument recognition algo- rithm to convert a parse tree into a flat rep- resentation of all predicates and their related boundaries, and role labeling by a consensus model using a pattern-matching framework to find suitable roles for core constituents and adjuncts. We describe the system architecture and report results for the CoNLL2005 development dataset.


Racing for Conditional Independence Inference

Remco R. Bouckaert and Milan Studeny. Racing for conditional independence inference. In Symbolic and Quantitative Approaches to Reasoning with Uncertainty, 8th European Conference, ECSQARU 2005, volume 3571 of LNCS, pages 221-232, 2005.
[ bib | .pdf ]
In this article, we consider the computational aspects of deciding whether a conditional independence statement t is implied by a list of conditional independence statements L using the implication related to the method of structural imsets. We present two methods which have the interesting complementary properties that one method performs well to prove that t is implied by L, while the other performs well to prove that t is not implied by L. However, both methods do not perform well the opposite. This gives rise to a parallel algorithm in which both methods race against each other in order to determine effectively whether t is or is not implied.

Some empirical evidence is provided that suggest this racing algorithms method performs a lot better than an existing method based on so-called skeletal characterization of the respective implication. Furthermore, the method is able to handle more than five variables.


Low Replicability of Machine Learning Experiments is not a Small Data Set Phenomenon

R. R. Bouckaert. Low replicability of machine learning experiments is not a small data set phenomenon. In Proceedings of the ICML'05 Workshop on Meta-Learning, 2005.
[ bib | .pdf ]
This paper investigates the relation between replicability of experiments for deciding which of two algorithms performs better on a given data set. We prove that lack of replicability is not just a small data phenomenon (as was shown before), but is present in experiments on medium and large data sets as well. We establish intuition in the relation between data set size, power and replicability.

The main method for improving replicability is to increase the number of samples. For large data sets and/or inefficient learning algorithms, this implies that exerperiments may take a long time to completion. We propose a procedure for deciding which of two learning algorithms is best that has a high replicability but takes moderate computational effort.


Bayesian Sequence Learning for Predicting Protein Cleavage Points

Mike Mayo. Bayesian sequence learning for predicting protein cleavage points. In Advances in Knowledge Discovery and Data Mining: 9th Pacific-Asia Conference, PAKDD 2005, page 192, Hanoi, Vietnam, 2005. May 18-20, 2005.
[ bib | .pdf ]
A challenging problem in data mining is the application of efficient techniques to automatically annotate the vast databases of biological sequence data. This paper describes one such application in this area, to the prediction of the position of signal peptide cleavage points along protein sequences. It is shown that the method, based on Bayesian statistics, is comparable in terms of accuracy to the existing state-of-the-art neural network techniques while providing explanatory information for its predictions.


Learning Petri Net Models of Non-Linear Gene Interactions

Mike Mayo. Learning petri net models of non-linear gene interactions. BioSystems, 85(1):74-82, 2005.
[ bib | .pdf ]
Understanding how an individual's genetic make-up influences their risk of disease is a problem of paramount importance. Although machine learning techniques are able to uncover the relationships between genotype and disease, the problem of automatically building the best biochemical model or explanation of the relationship has received less attention. In this paper, I describe a method based on random hill climbing that automatically builds Petri net models of non-linear (or multi-factorial) disease-causing gene-gene interactions. Petri nets are a suitable formalism for this problem, because they are used to model concurrent, dynamic processes analogous to biochemical reaction networks. I show that this method is routinely able to identify perfect Petri net models for three disease-causing gene-gene interactions recently reported in the literature.