Combining Modifications to Multinomial Naive Bayes for Text Classification

Antti Puurula. Combining modifications to multinomial naive bayes for text classification. In Proc 8th Asia Information Retrieval Societies Conference, Tianjin, China, pages 114-125, 2012.
[ bib | http ]
Multinomial Naive Bayes (MNB) is a preferred classifier for many text classification tasks, due to simplicity and trivial scaling to large scale tasks. However, in terms of classification accuracy it has a performance gap to modern discriminative classifiers, due to strong data assumptions. This paper explores the optimized combination of popular modifications to generative models in the context of MNB text classification. In order to optimize the introduced classifier metaparameters, we explore direct search optimization using random search algorithms. We evaluate 7 basic modifications and 4 search algorithms across 5 publicly availably available datasets, and give comparisons to similarly optimized Multiclass Support Vector Machine (SVM) classifiers. The use of optimized modifications results in over 20% mean reduction in classification errors compared to baseline MNB models, reducing the gap between SVM and MNB mean performance by over 60%. Some of the individual modifications are shown to have substantial and significant effects, while differences between the random search algorithms are smaller and not statistically significant. The evaluated modifications are potentially applicable to many applications of generative text modeling, where similar performance gains can be achieved.


Scalable Text Classification with Sparse Generative Modeling

Antti Puurula. Scalable text classification with sparse generative modeling. In Proc 12th Pacific Rim International Conference on Artificial Intelligence, Kuching, Malaysia, pages 458-469, 2012.
[ bib | http ]
Machine learning technology faces challenges in handling Big Data: vast volumes of online data such as web pages, news stories and articles. A dominant solution has been parallelization, but this does not make the tasks less challenging. An alternative solution is using sparse computation methods to fundamentally change the complexity of the processing tasks themselves. This can be done by using both the sparsity found in natural data and sparsified models. In this paper we show that sparse representations can be used to reduce the time complexity of generative classifiers to build fundamentally more scalable classifiers. We reduce the time complexity of Multinomial Naive Bayes classification with sparsity and show how to extend these findings into three multi-label extensions: Binary Relevance, Label Powerset and Multi-label Mixture Models. To provide competitive performance we provide the methods with smoothing and pruning modifications and optimize model meta-parameters using direct search optimization. We report on classification experiments on 5 publicly available datasets for large-scale multi-label classification. All three methods scale easily to the largest available tasks, with training times measured in seconds and classification times in milliseconds, even with millions of training documents, features and classes. The presented sparse modeling techniques should be applicable to many other classifiers, providing the same types of fundamental complexity reductions when applied to large scale tasks.


Ensembles of Sparse Multinomial Classifiers for Scalable Text Classification

Antti Puurula and Albert Bifet. Ensembles of sparse multinomial classifiers for scalable text classification. In Proc ECML PKDD 2012 Workshop on Large-Scale Hierarchical Classification, Bristol, UK, 2012.
[ bib | .pdf ]
Machine learning techniques face new challenges in scalability to large-scale tasks. Many of the existing algorithms are unable to scale to potentially millions of features and structured classes encountered in web-scale datasets such as Wikipedia. The third Large Scale Hierarchical Text Classification evaluation (LSHTC3) evaluated systems for multi-label hierarchical categorization of Wikipedia articles. In this paper we present a broad overview of our system in the evaluation, performing among the top systems in the challenge. We describe the several new modeling ideas we used that make text classification systems both more effective and scalable. These are: reduction of inference time complexity for probabilistic classifiers using inverted indices, classifier modifications optimized with direct search algorithms, ensembles of diverse multi-label classifiers and a novel feature-regression based method for scalable ensemble combination.


Online Estimation of Discrete Densities using Classifier Chains

Michael Geilke, Eibe Frank, and Stefan Kramer. Online estimation of discrete densities using classifier chains. In Proc ECML PKDD 2012 Workshop on Instant Interactive Data Mining, Bristol, UK, 2012.
[ bib | .pdf ]
We propose an approach to estimate a discrete joint density online, that is, the algorithm is only provided the current example, its current estimate, and a limited amount of memory. To design an on-line estimator for discrete densities, we use classifier chains to model dependencies among features. Each classifier in the chain estimates the probability of one particular feature. Because a single chain may not pro- vide a reliable estimate, we also consider ensembles of classifier chains. Our experiments on synthetic data show that the approach is feasible and the estimated densities approach the true, known distribution with increasing amounts of data.


Learning a Concept-based Document Similarity Measure

Lan Huang, David Milne, Eibe Frank, and Ian H. Witten. Learning a concept-based document similarity measure. Journal of the American Society for Information Science and Technology, 2012.
[ bib | .pdf ]
Document similarity measures are crucial components of many text analysis tasks, including information retrieval, document classification, and document clustering. Conventional measures are brittle: they estimate the surface overlap between documents based on the words they mention and ignore deeper semantic connections. We propose a new measure that assesses similarity at both the lexical and semantic levels, and learns from human judgments how to combine them by using machine learning techniques. Experiments show that the new measure produces values for documents that are more consistent with people’s judgments than people are with each other. We also use it to classify and cluster large document sets covering different genres and topics, and find that it improves both classification and clustering performance.


Ensembles of Restricted Hoeffding Trees

Albert Bifet, Eibe Frank, Geoffrey Holmes, and Bernhard Pfahringer. Ensembles of restricted hoeffding trees. ACM Transactions on Intelligent Systems and Technology, 3(2), 2012.
[ bib | .pdf ]
The success of simple methods for classification shows that is is often not necessary to model complex attribute interactions to obtain good classification accuracy on practical problems. In this paper, we propose to exploit this phenomenon in the data stream context by building an ensemble of Hoeffding trees that are each limited to a small subset of attributes. In this way, each tree is restricted to model interactions between attributes in its corresponding subset. Because it is not known a priori which attribute subsets are relevant for prediction, we build exhaustive ensembles that consider all possible attribute subsets of a given size. As the resulting Hoeffding trees are not all equally important, we weigh them in a suitable manner to obtain accurate classifications. This is done by combining the log-odds of their probability estimates using sigmoid perceptrons, with one perceptron per class. We propose a mechanism for setting the perceptrons' learning rate using the ADWIN change detection method for data streams, and also use ADWIN to reset ensemble members (i.e. Hoeffding trees) when they no longer perform well. Our experiments show that the resulting ensemble classifier outperforms bagging for data streams in terms of accuracy when both are used in conjunction with adaptive naive Bayes Hoeffding trees, at the expense of runtime and memory consumption. We also show that our stacking method can improve the performance of a bagged ensemble.


Experiment databases - A new way to share, organize and learn from experiments

Joaquin Vanschoren, Hendrik Blockeel, Bernhard Pfahringer, and Geoffrey Holmes. Experiment databases - a new way to share, organize and learn from experiments. Machine Learning, 87(2):127-158, 2012.
[ bib ]
Thousands of machine learning research papers contain extensive experimental comparisons. However, the details of those experiments are often lost after publication, making it impossible to reuse these experiments in further research, or reproduce them to verify the claims made. In this paper, we present a collaboration framework designed to easily share machine learning experiments with the community, and automatically organize them in public databases. This enables immediate reuse of experiments for subsequent, possibly much broader investigation and offers faster and more thorough analysis based on a large set of varied results. We describe how we designed such an experiment database, currently holding over 650,000 classification experiments, and demonstrate its use by answering a wide range of interesting research questions and by verifying a number of recent studies.


Scalable and efficient multi-label classification for evolving data streams

Jesse Read, Albert Bifet, Geoff Holmes, and Bernhard Pfahringer. Scalable and efficient multi-label classification for evolving data streams. Machine Learning, 88(1-2):243-272, 2012.
[ bib ]
Many challenging real world problems involve multi-label data streams. Efficient methods exist for multi-label classification in non-streaming scenarios. However, learning in evolving streaming scenarios is more challenging, as classifiers must be able to deal with huge numbers of examples and to adapt to change using limited time and memory while being ready to predict at any point. This paper proposes a new experimental framework for learning and evaluating on multi-label data streams, and uses it to study the performance of various methods. From this study, we develop a multi-label Hoeffding tree with multi-label classifiers at the leaves. We show empirically that this method is well suited to this challenging task. Using our new framework, which allows us to generate realistic multi-label data streams with concept drift (as well as real data), we compare with a selection of baseline methods, as well as new learning methods from the literature, and show that our Hoeffding tree method achieves fast and more accurate performance.


Bagging Ensemble Selection for Regression

Quan Sun and Bernhard Pfahringer. Bagging ensemble selection for regression. In Australasian Conference on Artificial Intelligence, pages 695-706. Springer, 2012.
[ bib ]
Bagging ensemble selection (BES) is a relatively new ensemble learning strategy. The strategy can be seen as an ensemble of the ensemble selection from libraries of models (ES) strategy. Previous experimental results on binary classification problems have shown that using random trees as base classifiers, BES-OOB (the most successful variant of BES) is competitive with (and in many cases, superior to) other ensemble learning strategies, for instance, the original ES algorithm, stacking with linear regression, random forests or boosting. Motivated by the promising results in classification, this paper examines the predictive performance of the BES-OOB strategy for regression problems. Our results show that the BES-OOB strategy outperforms Stochastic Gradient Boosting and Bagging when using regression trees as the base learners. Our results also suggest that the advantage of using a diverse model library becomes clear when the model library size is relatively large. We also present encouraging results indicating that the non-negative least squares algorithm is a viable approach for pruning an ensemble of ensembles.


Stream Data Mining Using the MOA Framework

Philipp Kranen, Hardy Kremer, Timm Jansen, Thomas Seidl, Albert Bifet, Geoff Holmes, Bernhard Pfahringer, and Jesse Read. Stream data mining using the moa framework. In Proceedings Database Systems for Advanced Applications, pages 309-313. Springer, 2012.
[ bib ]
Massive Online Analysis (MOA) is a software framework that provides algorithms and evaluation methods for mining tasks on evolving data streams. In addition to supervised and unsupervised learning, MOA has recently been extended to support multi-label classification and graph mining. In this demonstrator we describe the main features of MOA and present the newly added methods for outlier detection on streaming data. Algorithms can be compared to established baseline methods such as LOF and ABOD using standard ranking measures including Spearman rank coefficient and the AUC measure. MOA is an open source project and videos as well as tutorials are publicly available on the MOA homepage.


Full model selection in the space of data mining operators

Quan Sun, Bernhard Pfahringer, and Michael Mayo. Full model selection in the space of data mining operators. In Proceedings Genetic and Evolutionary Computation Conference, pages 1503-1504. ACM, 2012.
[ bib ]
We propose a framework and a novel algorithm for the full model selection (FMS) problem. The proposed algorithm, combining both genetic algorithms (GA) and particle swarm optimization (PSO), is named GPS (which stands for GA-PSO-FMS), in which a GA is used for searching the optimal structure of a data mining solution, and PSO is used for searching the optimal parameter set for a particular structure instance. Given a classification or regression problem, GPS outputs a FMS solution as a directed acyclic graph consisting of diverse data mining operators that are applicable to the problem, including data cleansing, data sampling, feature transformation/selection and algorithm operators. The solution can also be represented graphically in a human readable form. Experimental results demonstrate the benefit of the algorithm.


Batch-Incremental versus Instance-Incremental Learning in Dynamic and Evolving Data

Jesse Read, Albert Bifet, Bernhard Pfahringer, and Geoff Holmes. Batch-incremental versus instance-incremental learning in dynamic and evolving data. In Proceedings Symposium on Advances in Intelligent Data Analysis, pages 313-323, 2012.
[ bib ]
Many real world problems involve the challenging context of data streams, where classifiers must be incremental: able to learn from a theoretically-infinite stream of examples using limited time and memory, while being able to predict at any point. Two approaches dominate the literature: batch-incremental methods that gather examples in batches to train models; and instance-incremental methods that learn from each example as it arrives. Typically, papers in the literature choose one of these approaches, but provide insufficient evidence or references to justify their choice. We provide a first in-depth analysis comparing both approaches, including how they adapt to concept drift, and an extensive empirical study to compare several different versions of each approach. Our results reveal the respective advantages and disadvantages of the methods, which we discuss in detail.


Maximum Common Subgraph based locally weighted regression

Madeleine Seeland, Fabian Buchwald, Stefan Kramer, and Bernhard Pfahringer. Maximum common subgraph based locally weighted regression. In Proceedings of the ACM Symposium on Applied Computing, pages 165-172. ACM, 2012.
[ bib ]
This paper investigates a simple, yet effective method for regression on graphs, in particular for applications in chem-informatics and for quantitative structure-activity relationships (QSARs). The method combines Locally Weighted Learning (LWL) with Maximum Common Subgraph (MCS) based graph distances. More specifically, we investigate a variant of locally weighted regression on graphs (structures) that uses the maximum common subgraph for determining and weighting the neighborhood of a graph and feature vectors for the actual regression model. We show that this combination, LWL-MCS, outperforms other methods that use the local neighborhood of graphs for regression. The performance of this method on graphs suggests it might be useful for other types of structured data as well.


Multi-label classification using boolean matrix decomposition

Jörg Wicker, Bernhard Pfahringer, and Stefan Kramer. Multi-label classification using boolean matrix decomposition. In Proceedings of the ACM Symposium on Applied Computing, pages 179-186. ACM, 2012.
[ bib ]
This paper introduces a new multi-label classifier based on Boolean matrix decomposition. Boolean matrix decomposition is used to extract, from the full label matrix, latent labels representing useful Boolean combinations of the original labels. Base level models predict latent labels, which are subsequently transformed into the actual labels by Boolean matrix multiplication with the second matrix from the decomposition. The new method is tested on six publicly available datasets with varying numbers of labels. The experimental evaluation shows that the new method works particularly well on datasets with a large number of labels and strong dependencies among them.


Parameter Tuning Using Gaussian Processes

Jinjin Ma. Parameter tuning using gaussian processes. Master's thesis, Department of Computer Science, University of Waikato, 2012.
[ bib | http ]
Most machine learning algorithms require us to set up their parameter values before applying these algorithms to solve problems. Appropriate parameter settings will bring good performance while inappropriate parameter settings generally result in poor modelling. Hence, it is necessary to acquire the “best” parameter values for a particular algorithm before building the model. The “best” model not only reflects the “real” function and is well fitted to existing points, but also gives good performance when making predictions for new points with previously unseen values. A number of methods exist that have been proposed to optimize parameter values. The basic idea of all such methods is a trial-and-error process whereas the work presented in this thesis employs Gaussian process (GP) regression to optimize the parameter values of a given machine learning algorithm. In this thesis, we consider the optimization of only two-parameter learning algorithms. All the possible parameter values are specified in a 2-dimensional grid in this work. To avoid brute-force search, Gaussian Process Optimization (GPO) makes use of “expected improvement” to pick useful points rather than validating every point of the grid step by step. The point with the highest expected improvement is evaluated using cross-validation and the resulting data point is added to the training set for the Gaussian process model. This process is repeated until a stopping criterion is met. The final model is built using the learning algorithm based on the best parameter values identified in this process. In order to test the effectiveness of this optimization method on regression and classification problems, we use it to optimize parameters of some well-known machine learning algorithms, such as decision tree learning, support vector machines and boosting with trees. Through the analysis of experimental results obtained on datasets from the UCI repository, we find that the GPO algorithm yields competitive performance compared with a brute-force approach, while exhibiting a distinct advantage in terms of training time and number of cross-validation runs. Overall, the GPO method is a promising method for the optimization of parameter values in machine learning.


An application of data mining to fruit and vegetable sample identification using Gas Chromatography-Mass Spectrometry

Geoffrey Holmes, Dale Fletcher, and Peter Reutemann. An application of data mining to fruit and vegetable sample identification using gas chromatography-mass spectrometry. In Proceedings of the International Congress on Environmental Modelling and Software (IEMSS), Leipzig, Germany, 2012.
[ bib | .pdf ]
One of the uses of Gas Chromatography-Mass Spectrometry (GC-MS) is in the detection of pesticide residues in fruit and vegetables. In a high throughput laboratory there is the potential for sample swaps or mislabelling, as once a sample has been pre-processed to be injected into the GC-MS analyser, it is no longer distinguishable by eye. Possible consequences of such mistakes can be the destruction of large amounts of actually safe produce or pesticide-contaminated produce reaching the consumer. For the purposes of food safety and traceability, it can also be extremely valuable to know the source (country of origin) of a food product. This can help uncover fraudulent attempts of trying to sell food originating from countries deemed unsafe. In this study, we use the workflow environment ADAMS to examine whether we can determine the fruit/vegetable, and the country of origin of a sample from a GC-MS chromatogram. A workflow is used to generate data sets using different data pre-processing methods, and data representations from a database of over 8000 GC-MS chromatograms, consisting of more than 100 types of fruit and vegetables from more than 120 countries. A variety of classification algorithms are evaluated using the WEKA data mining workbench. We demonstrate excellent results, both for the determination of fruit/vegetable type and for the country of origin, using a histogram of ion counts, and Classification by Regression using Random Regression Forest with PLS-transformed data.


Evolutionary Data Selection for Enhancing Models of Intraday Forex Time Series

Michael Mayo. Evolutionary data selection for enhancing models of intraday forex time series. In Proceedings Applications of Evolutionary Computation, pages 184-193. Springer, 2012.
[ bib ]
The hypothesis in this paper is that a significant amount of intraday market data is either noise or redundant, and that if it is eliminated, then predictive models built using the remaining intraday data will be more accurate. To test this hypothesis, we use an evolutionary method (called Evolutionary Data Selection, EDS) to selectively remove out portions of training data that is to be made available to an intraday market predictor. After performing experiments in which data-selected and non-data-selected versions of the same predictive models are compared, it is shown that EDS is effective and does indeed boost predictor accuracy. It is also shown in the paper that building multiple models using EDS and placing them into an ensemble further increases performance. The datasets for evaluation are large intraday forex time series, specifically series from the EUR/USD, the USD/JPY and the EUR/JPY markets, and predictive models for two primary tasks per market are built: intraday return prediction and intraday volatility prediction.


Developing data mining applications

Geoff Holmes. Developing data mining applications. In International Conference on Knowledge Discovery and Data Mining, page 225. ACM, 2012.
[ bib ]
In this talk I will review several real-world applications developed at the University of Waikato over the past 15 years. These include the use of near infrared spectroscopy coupled with data mining as an alternate laboratory technique for predicting compound concentrations in soil and plant samples, and the analysis of gas chromatography mass spectrometry (GCMS) data, a technique used to determine in environmental applications, for example, the petroleum content in soil and water samples. I will then briefly discuss how experience with these applications has led to the development of an open-source framework for application development.


Scientific Workflow Management with ADAMS

Peter Reutemann and Joaquin Vanschoren. Scientific workflow management with adams. In PeterA. Flach, Tijl Bie, and Nello Cristianini, editors, Machine Learning and Knowledge Discovery in Databases, volume 7524 of Lecture Notes in Computer Science, pages 833-837. Springer Berlin Heidelberg, 2012.
[ bib | http ]
We demonstrate the Advanced Data mining And Machine learning System (ADAMS), a novel workflow engine designed for rapid prototyping and maintenance of complex knowledge workflows. ADAMS does not require the user to manually connect inputs to outputs on a large canvas. It uses a compact workflow representation, control operators, and a simple interface between operators, allowing them to be auto-connected. It contains an extensive library of operators for various types of analysis, and a convenient plug-in architecture to easily add new ones.
Keywords: scientific workflows; machine learning; data mining