Publications

Scaling Up Semi-Supervised Learning: an Efficient and Effective LLGC Variant
Bernhard Pfahringer, Claire Leschi, and Peter Reutemann. Scaling Up Semi-Supervised Learning: an Efficient and Effective LLGC Variant. In Zhi-Hua Zhou, Hang Li, Qiang Ynag Publisher, editors, The 11th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2007) [paper]
Domains like text classification can easily supply large amounts of unlabeled data, but labeling itself is expensive. Semi-supervised learning tries to exploit this abundance of unlabeled training data to improve classification. Unfortunately most of the theoretically well-founded algorithms that have been described in recent years are cubic or worse in the total number of both labeled and unlabeled training examples. In this paper we apply modifications to the standard LLGC algorithm to improve efficiency to a point where we can handle datasets with hundreds of thousands of training data. The modifications are priming of the unlabeled data, and most importantly, sparsification of the similarity matrix. We report promising results on large text classification problems.

A Semi-Supervised Spam Mail Detector
Bernhard Pfahringer. A Semi-Supervised Spam Mail Detector. In Steffen Bickel, editor, Proceedings of the ECML/PKDD 2006 Discovery Challenge Workshop, pages 48-53. Humboldt University Berlin, 2006. [paper]
This document describes a novel semi-supervised approach to spam classification, which was successful at the ECML/PKDD2006 spam classification challenge. A local learning method based on lazy projections was successfully combined with a variant of a standard semi-supervised learning algorithm.

Using Weighted Nearest Neighbor to Benefit from Unlabeled Data
Kurt Driessens, Peter Reutemann, Bernhard Pfahringer, and Claire Leschi. Using Weighted Nearest Neighbor to Benefit from Unlabeled Data. In Wee Keong Ng, Masaru Kitsuregawa, Jianzhong Li, and Kuiyu Chang, editors, Advances in Knowledge Discovery and Data Mining, 10th Pacific-Asia Conference, PAKDD 2006, volume 3918 of LNCS, pages 60-69, 2006. [paper]
The development of data-mining applications such as textclassification and molecular profiling has shown the need for machine learning algorithms that can benefit from both labeled and unlabeled data, where often the unlabeled examples greatly outnumber the labeled examples. In this paper we present a two-stage classifier that improves its predictive accuracy by making use of the available unlabeled data. It uses a weighted nearest neighbor classification algorithm using the combined example-sets as a knowledge base. The examples from the unlabeled set are pre-labeled by an initial classifier that is build using the limited available training data. By choosing appropriate weights for this pre-labeled data, the nearest neighbor classifier consistently improves on the original classifier.

A novel two stage scheme utilizing the test set for model selection in text classification
Bernhard Pfahringer, Peter Reutemann, and Mike Mayo. A novel two stage scheme utilizing the test set for model selection in text classification. In Ranadhir Ghosh, Brijesh Verma, and Xue Li, editors, Proc Workshop on Learning Algorithms for Pattern Recognition, Eighteenth Australian Joint Conference on Artificial Intelligence (AI'05), pages 60-65, Sydney, Australia, 2005. University of Technology. [paper]
Text classification is a natural application domain for semi-supervised learning, as labeling documents is expensive, but on the other hand usually an abundance of unlabeled documents is available. We describe a novel simple two-stage scheme based on dagging which allows for utilizing the test set in model selection. The dagging ensemble can also be used by itself instead of the original classifier. We evaluate the performance of a meta classifier choosing between various base learners and their respective dagging ensembles. The selection process seems to perform robustly especially for small percentages of available labels for training.