[1]

Combining Modifications to Multinomial Naive Bayes for Text
Classification
Antti Puurula.
Combining modifications to multinomial naive bayes for text
classification.
In Proc 8th Asia Information Retrieval Societies Conference,
Tianjin, China, pages 114125, 2012.
[ bib 
http ]
Multinomial Naive Bayes (MNB) is a preferred classifier for many text classification tasks, due to simplicity and trivial scaling to large scale tasks. However, in terms of classification accuracy it has a performance gap to modern discriminative classifiers, due to strong data assumptions. This paper explores the optimized combination of popular modifications to generative models in the context of MNB text classification. In order to optimize the introduced classifier metaparameters, we explore direct search optimization using random search algorithms. We evaluate 7 basic modifications and 4 search algorithms across 5 publicly availably available datasets, and give comparisons to similarly optimized Multiclass Support Vector Machine (SVM) classifiers. The use of optimized modifications results in over 20% mean reduction in classification errors compared to baseline MNB models, reducing the gap between SVM and MNB mean performance by over 60%. Some of the individual modifications are shown to have substantial and significant effects, while differences between the random search algorithms are smaller and not statistically significant. The evaluated modifications are potentially applicable to many applications of generative text modeling, where similar performance gains can be achieved.


[2]

Scalable Text Classification with Sparse Generative Modeling
Antti Puurula.
Scalable text classification with sparse generative modeling.
In Proc 12th Pacific Rim International Conference on Artificial
Intelligence, Kuching, Malaysia, pages 458469, 2012.
[ bib 
http ]
Machine learning technology faces challenges in handling Big Data: vast volumes of online data such as web pages, news stories and articles. A dominant solution has been parallelization, but this does not make the tasks less challenging. An alternative solution is using sparse computation methods to fundamentally change the complexity of the processing tasks themselves. This can be done by using both the sparsity found in natural data and sparsified models. In this paper we show that sparse representations can be used to reduce the time complexity of generative classifiers to build fundamentally more scalable classifiers. We reduce the time complexity of Multinomial Naive Bayes classification with sparsity and show how to extend these findings into three multilabel extensions: Binary Relevance, Label Powerset and Multilabel Mixture Models. To provide competitive performance we provide the methods with smoothing and pruning modifications and optimize model metaparameters using direct search optimization. We report on classification experiments on 5 publicly available datasets for largescale multilabel classification. All three methods scale easily to the largest available tasks, with training times measured in seconds and classification times in milliseconds, even with millions of training documents, features and classes. The presented sparse modeling techniques should be applicable to many other classifiers, providing the same types of fundamental complexity reductions when applied to large scale tasks.


[3]

Ensembles of Sparse Multinomial Classifiers for Scalable Text Classification
Antti Puurula and Albert Bifet.
Ensembles of sparse multinomial classifiers for scalable text
classification.
In Proc ECML PKDD 2012 Workshop on LargeScale Hierarchical
Classification, Bristol, UK, 2012.
[ bib 
.pdf ]
Machine learning techniques face new challenges in scalability to largescale tasks. Many of the existing algorithms are unable to
scale to potentially millions of features and structured classes encountered in webscale datasets such as Wikipedia. The third Large Scale
Hierarchical Text Classification evaluation (LSHTC3) evaluated systems
for multilabel hierarchical categorization of Wikipedia articles. In this
paper we present a broad overview of our system in the evaluation, performing among the top systems in the challenge. We describe the several
new modeling ideas we used that make text classification systems both
more effective and scalable. These are: reduction of inference time complexity for probabilistic classifiers using inverted indices, classifier modifications optimized with direct search algorithms, ensembles of diverse
multilabel classifiers and a novel featureregression based method for
scalable ensemble combination.


[4]

Online Estimation of Discrete Densities using Classifier Chains
Michael Geilke, Eibe Frank, and Stefan Kramer.
Online estimation of discrete densities using classifier chains.
In Proc ECML PKDD 2012 Workshop on Instant Interactive Data
Mining, Bristol, UK, 2012.
[ bib 
.pdf ]
We propose an approach to estimate a discrete joint density
online, that is, the algorithm is only provided the current example, its
current estimate, and a limited amount of memory. To design an online estimator for discrete densities, we use classifier chains to model
dependencies among features. Each classifier in the chain estimates the
probability of one particular feature. Because a single chain may not pro
vide a reliable estimate, we also consider ensembles of classifier chains.
Our experiments on synthetic data show that the approach is feasible
and the estimated densities approach the true, known distribution with
increasing amounts of data.


[5]

Learning a Conceptbased Document Similarity Measure
Lan Huang, David Milne, Eibe Frank, and Ian H. Witten.
Learning a conceptbased document similarity measure.
Journal of the American Society for Information Science and
Technology, 2012.
[ bib 
.pdf ]
Document similarity measures are crucial components of many text analysis tasks, including information retrieval, document classification, and document clustering. Conventional measures are brittle: they estimate the surface overlap between documents based on the words they mention and ignore deeper semantic connections. We propose a new measure that assesses similarity at both the lexical and semantic levels, and learns from human judgments how to combine them by using machine learning techniques. Experiments show that the new measure produces values for documents that are more consistent with people’s judgments than people are with each other. We also use it to classify and cluster large document sets covering different genres and topics, and find that it improves both classification and clustering performance.


[6]

Ensembles of Restricted Hoeffding Trees
Albert Bifet, Eibe Frank, Geoffrey Holmes, and Bernhard Pfahringer.
Ensembles of restricted hoeffding trees.
ACM Transactions on Intelligent Systems and Technology, 3(2),
2012.
[ bib 
.pdf ]
The success of simple methods for classification shows that is is often not necessary to model complex attribute interactions to obtain good classification accuracy on practical problems. In this paper, we propose to exploit this phenomenon in the data stream context by building an ensemble of Hoeffding trees that are each limited to a small subset of attributes. In this way, each tree is restricted to model interactions between attributes in its corresponding subset. Because it is not known a priori which attribute subsets are relevant for prediction, we build exhaustive ensembles that consider all possible attribute subsets of a given size. As the resulting Hoeffding trees are not all equally important, we weigh them in a suitable manner to obtain accurate classifications. This is done by combining the logodds of their probability estimates using sigmoid perceptrons, with one perceptron per class. We propose a mechanism for setting the perceptrons' learning rate using the ADWIN change detection method for data streams, and also use ADWIN to reset ensemble members (i.e. Hoeffding trees) when they no longer perform well. Our experiments show that the resulting ensemble classifier outperforms bagging for data streams in terms of accuracy when both are used in conjunction with adaptive naive Bayes Hoeffding trees, at the expense of runtime and memory consumption. We also show that our stacking method can improve the performance of a bagged ensemble.


[7]

Experiment databases  A new way to share, organize and
learn from experiments
Joaquin Vanschoren, Hendrik Blockeel, Bernhard Pfahringer, and Geoffrey Holmes.
Experiment databases  a new way to share, organize and learn from
experiments.
Machine Learning, 87(2):127158, 2012.
[ bib ]
Thousands of machine learning research papers contain extensive experimental comparisons. However, the details of those experiments are often lost after publication, making it impossible to reuse these experiments in further research, or reproduce them to verify the claims made. In this paper, we present a collaboration framework designed to easily share machine learning experiments with the community, and automatically organize them in public databases. This enables immediate reuse of experiments for subsequent, possibly much broader investigation and offers faster and more thorough analysis based on a large set of varied results. We describe how we designed such an experiment database, currently holding over 650,000 classification experiments, and demonstrate its use by answering a wide range of interesting research questions and by verifying a number of recent studies.


[8]

Scalable and efficient multilabel classification for evolving
data streams
Jesse Read, Albert Bifet, Geoff Holmes, and Bernhard Pfahringer.
Scalable and efficient multilabel classification for evolving data
streams.
Machine Learning, 88(12):243272, 2012.
[ bib ]
Many challenging real world problems involve multilabel data streams. Efficient methods exist for multilabel classification in nonstreaming scenarios. However, learning in evolving streaming scenarios is more challenging, as classifiers must be able to deal with huge numbers of examples and to adapt to change using limited time and memory while being ready to predict at any point.
This paper proposes a new experimental framework for learning and evaluating on multilabel data streams, and uses it to study the performance of various methods. From this study, we develop a multilabel Hoeffding tree with multilabel classifiers at the leaves. We show empirically that this method is well suited to this challenging task. Using our new framework, which allows us to generate realistic multilabel data streams with concept drift (as well as real data), we compare with a selection of baseline methods, as well as new learning methods from the literature, and show that our Hoeffding tree method achieves fast and more accurate performance.


[9]

Bagging Ensemble Selection for Regression
Quan Sun and Bernhard Pfahringer.
Bagging ensemble selection for regression.
In Australasian Conference on Artificial Intelligence, pages
695706. Springer, 2012.
[ bib ]
Bagging ensemble selection (BES) is a relatively new ensemble learning strategy. The strategy can be seen as an ensemble of the ensemble selection from libraries of models (ES) strategy. Previous experimental results on binary classification problems have shown that using random trees as base classifiers, BESOOB (the most successful variant of BES) is competitive with (and in many cases, superior to) other ensemble learning strategies, for instance, the original ES algorithm, stacking with linear regression, random forests or boosting. Motivated by the promising results in classification, this paper examines the predictive performance of the BESOOB strategy for regression problems. Our results show that the BESOOB strategy outperforms Stochastic Gradient Boosting and Bagging when using regression trees as the base learners. Our results also suggest that the advantage of using a diverse model library becomes clear when the model library size is relatively large. We also present encouraging results indicating that the nonnegative least squares algorithm is a viable approach for pruning an ensemble of ensembles.


[10]

Stream Data Mining Using the MOA Framework
Philipp Kranen, Hardy Kremer, Timm Jansen, Thomas Seidl, Albert Bifet, Geoff
Holmes, Bernhard Pfahringer, and Jesse Read.
Stream data mining using the moa framework.
In Proceedings Database Systems for Advanced Applications,
pages 309313. Springer, 2012.
[ bib ]
Massive Online Analysis (MOA) is a software framework that provides algorithms and evaluation methods for mining tasks on evolving data streams. In addition to supervised and unsupervised learning, MOA has recently been extended to support multilabel classification and graph mining. In this demonstrator we describe the main features of MOA and present the newly added methods for outlier detection on streaming data. Algorithms can be compared to established baseline methods such as LOF and ABOD using standard ranking measures including Spearman rank coefficient and the AUC measure. MOA is an open source project and videos as well as tutorials are publicly available on the MOA homepage.


[11]

Full model selection in the space of data mining operators
Quan Sun, Bernhard Pfahringer, and Michael Mayo.
Full model selection in the space of data mining operators.
In Proceedings Genetic and Evolutionary Computation Conference,
pages 15031504. ACM, 2012.
[ bib ]
We propose a framework and a novel algorithm for the full model selection (FMS) problem. The proposed algorithm, combining both genetic algorithms (GA) and particle swarm optimization (PSO), is named GPS (which stands for GAPSOFMS), in which a GA is used for searching the optimal structure of a data mining solution, and PSO is used for searching the optimal parameter set for a particular structure instance. Given a classification or regression problem, GPS outputs a FMS solution as a directed acyclic graph consisting of diverse data mining operators that are applicable to the problem, including data cleansing, data sampling, feature transformation/selection and algorithm operators. The solution can also be represented graphically in a human readable form. Experimental results demonstrate the benefit of the algorithm.


[12]

BatchIncremental versus InstanceIncremental Learning in
Dynamic and Evolving Data
Jesse Read, Albert Bifet, Bernhard Pfahringer, and Geoff Holmes.
Batchincremental versus instanceincremental learning in dynamic and
evolving data.
In Proceedings Symposium on Advances in Intelligent Data
Analysis, pages 313323, 2012.
[ bib ]
Many real world problems involve the challenging context of data streams, where classifiers must be incremental: able to learn from a theoreticallyinfinite stream of examples using limited time and memory, while being able to predict at any point. Two approaches dominate the literature: batchincremental methods that gather examples in batches to train models; and instanceincremental methods that learn from each example as it arrives. Typically, papers in the literature choose one of these approaches, but provide insufficient evidence or references to justify their choice. We provide a first indepth analysis comparing both approaches, including how they adapt to concept drift, and an extensive empirical study to compare several different versions of each approach. Our results reveal the respective advantages and disadvantages of the methods, which we discuss in detail.


[13]

Maximum Common Subgraph based locally weighted regression
Madeleine Seeland, Fabian Buchwald, Stefan Kramer, and Bernhard Pfahringer.
Maximum common subgraph based locally weighted regression.
In Proceedings of the ACM Symposium on Applied Computing, pages
165172. ACM, 2012.
[ bib ]
This paper investigates a simple, yet effective method for regression on graphs, in particular for applications in cheminformatics and for quantitative structureactivity relationships (QSARs). The method combines Locally Weighted Learning (LWL) with Maximum Common Subgraph (MCS) based graph distances. More specifically, we investigate a variant of locally weighted regression on graphs (structures) that uses the maximum common subgraph for determining and weighting the neighborhood of a graph and feature vectors for the actual regression model. We show that this combination, LWLMCS, outperforms other methods that use the local neighborhood of graphs for regression. The performance of this method on graphs suggests it might be useful for other types of structured data as well.


[14]

Multilabel classification using boolean matrix decomposition
Jörg Wicker, Bernhard Pfahringer, and Stefan Kramer.
Multilabel classification using boolean matrix decomposition.
In Proceedings of the ACM Symposium on Applied Computing, pages
179186. ACM, 2012.
[ bib ]
This paper introduces a new multilabel classifier based on Boolean matrix decomposition. Boolean matrix decomposition is used to extract, from the full label matrix, latent labels representing useful Boolean combinations of the original labels. Base level models predict latent labels, which are subsequently transformed into the actual labels by Boolean matrix multiplication with the second matrix from the decomposition. The new method is tested on six publicly available datasets with varying numbers of labels. The experimental evaluation shows that the new method works particularly well on datasets with a large number of labels and strong dependencies among them.


[15]

Parameter Tuning Using Gaussian Processes
Jinjin Ma.
Parameter tuning using gaussian processes.
Master's thesis, Department of Computer Science, University of
Waikato, 2012.
[ bib 
http ]
Most machine learning algorithms require us to set up their parameter values before applying these algorithms to solve problems. Appropriate parameter settings will bring good performance while inappropriate parameter settings generally result in poor modelling. Hence, it is necessary to acquire the “best” parameter values for a particular algorithm before building the model. The “best” model not only reflects the “real” function and is well fitted to existing points, but also gives good performance when making predictions for new points with previously unseen values.
A number of methods exist that have been proposed to optimize parameter values. The basic idea of all such methods is a trialanderror process whereas the work presented in this thesis employs Gaussian process (GP) regression to optimize the parameter values of a given machine learning algorithm. In this thesis, we consider the optimization of only twoparameter learning algorithms. All the possible parameter values are specified in a 2dimensional grid in this work. To avoid bruteforce search, Gaussian Process Optimization (GPO) makes use of “expected improvement” to pick useful points rather than validating every point of the grid step by step. The point with the highest expected improvement is evaluated using crossvalidation and the resulting data point is added to the training set for the Gaussian process model. This process is repeated until a stopping criterion is met. The final model is built using the learning algorithm based on the best parameter values identified in this process.
In order to test the effectiveness of this optimization method on regression and classification problems, we use it to optimize parameters of some wellknown machine learning algorithms, such as decision tree learning, support vector machines and boosting with trees. Through the analysis of experimental results obtained on datasets from the UCI repository, we find that the GPO algorithm yields competitive performance compared with a bruteforce approach, while exhibiting a distinct advantage in terms of training time and number of crossvalidation runs. Overall, the GPO method is a promising method for the optimization of parameter values in machine learning.


[16]

An application of data mining to fruit and vegetable sample identification using Gas ChromatographyMass Spectrometry
Geoffrey Holmes, Dale Fletcher, and Peter Reutemann.
An application of data mining to fruit and vegetable sample
identification using gas chromatographymass spectrometry.
In Proceedings of the International Congress on Environmental
Modelling and Software (IEMSS), Leipzig, Germany, 2012.
[ bib 
.pdf ]
One of the uses of Gas ChromatographyMass Spectrometry (GCMS) is in the detection of pesticide residues in fruit and vegetables. In a high throughput laboratory there is the potential for sample swaps or mislabelling, as once a sample has been preprocessed to be injected into the GCMS analyser, it is no longer distinguishable by eye. Possible consequences of such mistakes can be the destruction of large amounts of actually safe produce or pesticidecontaminated produce reaching the consumer. For the purposes of food safety and traceability, it can also be extremely valuable to know the source (country of origin) of a food product. This can help uncover fraudulent attempts of trying to sell food originating from countries deemed unsafe. In this study, we use the workﬂow environment ADAMS to examine whether we can determine the fruit/vegetable, and the country of origin of a sample from a GCMS chromatogram. A workﬂow is used to generate data sets using different data preprocessing methods, and data representations from a database of over 8000 GCMS chromatograms, consisting of more than 100 types of fruit and vegetables from more than 120 countries. A variety of classiﬁcation algorithms are evaluated using the WEKA data mining workbench. We demonstrate excellent results, both for the determination of fruit/vegetable type and for the country of origin, using a histogram of ion counts, and Classiﬁcation by Regression using Random Regression Forest with PLStransformed data.


[17]

Evolutionary Data Selection for Enhancing Models of Intraday
Forex Time Series
Michael Mayo.
Evolutionary data selection for enhancing models of intraday forex
time series.
In Proceedings Applications of Evolutionary Computation, pages
184193. Springer, 2012.
[ bib ]
The hypothesis in this paper is that a significant amount of intraday market data is either noise or redundant, and that if it is eliminated, then predictive models built using the remaining intraday data will be more accurate. To test this hypothesis, we use an evolutionary method (called Evolutionary Data Selection, EDS) to selectively remove out portions of training data that is to be made available to an intraday market predictor. After performing experiments in which dataselected and nondataselected versions of the same predictive models are compared, it is shown that EDS is effective and does indeed boost predictor accuracy. It is also shown in the paper that building multiple models using EDS and placing them into an ensemble further increases performance. The datasets for evaluation are large intraday forex time series, specifically series from the EUR/USD, the USD/JPY and the EUR/JPY markets, and predictive models for two primary tasks per market are built: intraday return prediction and intraday volatility prediction.


[18]

Developing data mining applications
Geoff Holmes.
Developing data mining applications.
In International Conference on Knowledge Discovery and Data
Mining, page 225. ACM, 2012.
[ bib ]
In this talk I will review several realworld applications developed at the University of Waikato over the past 15 years. These include the use of near infrared spectroscopy coupled with data mining as an alternate laboratory technique for predicting compound concentrations in soil and plant samples, and the analysis of gas chromatography mass spectrometry (GCMS) data, a technique used to determine in environmental applications, for example, the petroleum content in soil and water samples. I will then briefly discuss how experience with these applications has led to the development of an opensource framework for application development.


[19]

Scientific Workflow Management with ADAMS
Peter Reutemann and Joaquin Vanschoren.
Scientific workflow management with adams.
In PeterA. Flach, Tijl Bie, and Nello Cristianini, editors,
Machine Learning and Knowledge Discovery in Databases, volume 7524 of
Lecture Notes in Computer Science, pages 833837. Springer Berlin
Heidelberg, 2012.
[ bib 
http ]
We demonstrate the Advanced Data mining And Machine learning System (ADAMS), a novel workflow engine designed for rapid prototyping and maintenance of complex knowledge workflows. ADAMS does not require the user to manually connect inputs to outputs on a large canvas. It uses a compact workflow representation, control operators, and a simple interface between operators, allowing them to be autoconnected. It contains an extensive library of operators for various types of analysis, and a convenient plugin architecture to easily add new ones.
Keywords: scientific workflows; machine learning; data mining

