2011

[1]

Clustering mixed data

Lynette Hunt and Murray Jorgensen. Clustering mixed data. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(4):352-361, 2011.
[ bib | http ]
Mixture model clustering proceeds by fitting a finite mixture of multivariate distributions to data, the fitted mixture density then being used to allocate the data to one of the components. Common model formulations assume that either all the attributes are continuous or all the attributes are categorical. In this paper, we consider options for model formulation in the more practical case of mixed data: multivariate data sets that contain both continuous and categorical attributes.

[2]

Using the online cross-entropy method to learn relational policies for playing different games

Samuel Sarjant, Bernhard Pfahringer, Kurt Driessens, and Tony Smith. Using the online cross-entropy method to learn relational policies for playing different games. In Proceedings IEEE Conference on Computational Intelligence and Games, pages 182-189. IEEE, 2011.
[ bib ]
By defining a video-game environment as a collection of objects, relations, actions and rewards, the relational reinforcement learning algorithm presented in this paper generates and optimises a set of concise, human-readable relational rules for achieving maximal reward. Rule learning is achieved using a combination of incremental specialisation of rules and a modified online cross-entropy method, which dynamically adjusts the rate of learning as the agent progresses. The algorithm is tested on the Ms. Pac-Man and Mario environments, with results indicating the agent learns an effective policy for acting within each environment.

[3]

Classifier chains for multi-label classification

Jesse Read, Bernhard Pfahringer, Geoff Holmes, and Eibe Frank. Classifier chains for multi-label classification. Machine Learning, 85(3):333-359, 2011.
[ bib | http ]
The widely known binary relevance method for multi-label classification, which considers each label as an independent binary problem, has often been overlooked in the literature due to the perceived inadequacy of not directly modelling label correlations. Most current methods invest considerable complexity to model interdependencies between labels. This paper shows that binary relevance-based methods have much to offer, and that high predictive performance can be obtained without impeding scalability to large datasets. We exemplify this with a novel classifier chains method that can model label correlations while maintaining acceptable computational complexity. We extend this approach further in an ensemble framework. An extensive empirical evaluation covers a broad range of multi-label datasets with a variety of evaluation metrics. The results illustrate the competitiveness of the chaining method against related and state-of-the-art methods, both in terms of predictive performance and time complexity.

[4]

Beyond Trees: Adopting MITI to Learn Rules and Ensemble Classifiers for Multi-Instance Data

Luke Bjerring and Eibe Frank. Beyond trees: Adopting MITI to learn rules and ensemble classifiers for multi-instance data. In Proc 24th Australasian Joint Conference on Artificial Intelligence, Perth, Australia, pages 41-50. Springer, 2011.
[ bib | .pdf ]
MITI is a simple and elegant decision tree learner designed for multi-instance classification problems, where examples for learning consist of bags of instances. MITI grows a tree in best-first manner by maintaining a priority queue containing the unexpanded nodes in the fringe of the tree. When the head node contains instances from positive examples only, it is made into a leaf, and any bag of data that is associated with this leaf is removed. In this paper we first revisit the basic algorithm and consider the effect of parameter settings on classification accuracy, using several benchmark datasets. We show that the chosen splitting criterion in particular can have a significant effect on accuracy. We identify a potential weakness of the algorithm—subtrees can contain structure that has been created using data that is subsequently removed—and show that a simple modification turns the algorithm into a rule learner that avoids this problem. This rule learner produces more compact classifiers with comparable accuracy on the benchmark datasets we consider. Finally, we present randomized algorithm variants that enable us to generate ensemble classifiers. We show that these can yield substantially improved classification accuracy.

[5]

Using Output Codes for Two-class Classification Problems

Fanhua Zeng. Using output codes for two-class classification problems. Master's thesis, Department of Computer Science, University of Waikato, 2011.
[ bib | http ]
Error-correcting output codes (ECOCs) have been widely used in many applications for multi-class classification problems. The problem is that ECOCs cannot be ap- plied directly on two-class datasets. The goal of this thesis is to design and evaluate an approach to solve this problem, and then investigate whether the approach can yield better classification models. To be able to use ECOCs, we turn two-class datasets into multi-class datasets first, by using clustering. With the resulting multi-class datasets in hand, we evalu- ate three different encoding methods for ECOCs: exhaustive coding, random coding and a “pre-defined” code that is found using random search. The exhaustive coding method has the highest error-correcting abilities. However, this method is limited due to the exponential growth of bit columns in the codeword matrix precluding it from being used for problems with large numbers of classes. Random coding can be used to cover situations with large numbers of classes in the data. To improve on completely random matrices, “pre-defined” codeword matrices can be generated by using random search that optimizes row separation yielding better error correction than a purely random matrix. To speed up the process of finding good matrices, GPU parallel programming is investigated in this thesis. From the empirical results, we can say that the new algorithm, which applies multi-class ECOCs on two-class data using clustering, does improve the performance for some base learners, when compared to applying them directly to the original two- class datasets.

[6]

Smoothing in Probability Estimation Trees

Zhimeng Han. Smoothing in probability estimation trees. Master's thesis, Department of Computer Science, University of Waikato, 2011.
[ bib | http ]
Classification learning is a type of supervised machine learning technique that uses a classification model (e.g. decision tree) to predict unknown class labels for previously unseen instances. In many applications it can be very useful to additionally obtain class probabilities for the different class labels. Decision trees that yield these probabilities are also called probability estimation trees (PETs). Smoothing is a technique used to improve the probability estimates. There are several existing smoothing methods, such as the Laplace correction, M-Estimate smoothing and M-Branch smoothing. Smoothing does not just apply to PETs. In the field of text compression, PPM in particular, smoothing methods play a important role. This thesis migrates smoothing methods from text compression to PETs. The newly migrated methods in PETs are compared with the best of the existing smoothing methods considered in this thesis under different experiment setups. Unpruned, pruned and bagged trees are considered in the experiments. The main finding is that the PPM-based methods yield the best probability estimate when used with bagged trees, but not when used with individual (pruned or unpruned) trees.

[7]

Concept-based text clustering

Lan Huang. Concept-based text clustering. PhD thesis, Department of Computer Science, University of Waikato, 2011.
[ bib | http ]
Thematic organization of text is a natural practice of humans and a crucial task for today’s vast repositories. Clustering automates this by assessing the similarity between texts and organizing them accordingly, grouping like ones together and separating those with different topics. Clusters provide a comprehensive logical structure that facilitates exploration, search and interpretation of current texts, as well as organization of future ones. Automatic clustering is usually based on words. Text is represented by the words it mentions, and thematic similarity is based on the proportion of words that texts have in common. The resulting bag-of-words model is semantically ambiguous and undesirably orthogonal—it ignores the connections between words. This thesis claims that using concepts as the basis of clustering can signifi- cantly improve effectiveness. Concepts are defined as units of knowledge. When organized according to the relations among them, they form a concept system. Two concept systems are used here: WordNet, which focuses on word knowledge, and Wikipedia, which encompasses world knowledge. We investigate a clustering procedure with three components: using concepts to represent text; taking the semantic relations among them into account dur- ing clustering; and learning a text similarity measure from concepts and their relations. First, we demonstrate that concepts provide a succinct and informa- tive representation of the themes in text, exemplifying this with the two concept systems. Second, we define methods for utilizing concept relations to enhance clustering by making the representation models more discriminative and extend- ing thematic similarity beyond surface overlap. Third, we present a similarity measure based on concepts and their relations that is learned from a small num- ber of examples, and show that it both predicts similarity consistently with human judgement and improves clustering. The thesis provides strong support for the use of concept-based representations instead of the classic bag-of-words model.

[8]

Sequence-based protein classification: binary Profile Hidden Markov Models and propositionalisation

Stefan Mutter. Sequence-based protein classification: binary Profile Hidden Markov Models and propositionalisation. PhD thesis, Department of Computer Science, University of Waikato, 2011.
[ bib | http ]
Detecting similarity in biological sequences is a key element to understanding the mechanisms of life. Researchers infer potential structural, functional or evolutionary relationships from similarity. However, the concept of similarity is complex in biology. Sequences consist of different molecules with different chemical properties, have short and long distance interactions, form 3D structures and change through evolutionary processes. Amino acids are one of the key molecules of life. Most importantly, a sequence of amino acids constitutes the building block for proteins which play an essential role in cellular processes. This thesis investigates similarity amongst proteins. In this area of research there are two important and closely related classification tasks – the detection of similar proteins and the discrimination amongst them. Hidden Markov Models (HMMs) have been successfully applied to the detection task as they model sequence similarity very well. From a Machine Learning point of view these HMMs are essentially one-class classifiers trained solely on a small number of similar proteins neglecting the vast number of dissimilar ones. Our basic assumption is that integrating this neglected information will be highly beneficial to the classification task. Thus, we transform the problem representation from a one-class to a binary one. Equipped with the necessary sound understanding of Machine Learning, especially concerning problem representation and statistically significant evaluation, our work pursues and combines two different avenues on this aforementioned transformation. First, we introduce a binary HMM that discriminates significantly better than the standard one, even when only a fraction of the negative information is used. Second, we interpret the HMM as a structured graph of information. This information cannot be accessed by highly optimised standard Machine Learning classifiers as they expect a fixed length feature vector representation. Propositionalisation is a technique to transform the former representation into the latter. This thesis introduces new propositionalisation techniques. The change in representation changes the learning problem from a one-class, generative to a propositional, discriminative one. It is a common assumption that discriminative techniques are better suited for classification tasks, and our results validate this assumption. We suggest a new way to significantly improve on discriminative power and runtime by means of terminating the time-intense training of HMMs early, subsequently applying propositionalisation and classifying with a discriminative, binary learner.

[9]

Experiments with Multi-view Multi-instance Learning for Supervised Image Classification

Michael Mayo and Eibe Frank. Experiments with multi-view multi-instance learning for supervised image classification. In Proc 26th International Conference Image and Vision Computing New Zealand, Auckland, New Zealand, pages 363-369, 2011.
[ bib | .pdf ]
In this paper we empirically investigate the benefits of multi-view multi-instance (MVMI) learning for supervised image classification. In multi-instance learning, examples for learning contain bags of feature vectors and thus data from different views cannot simply be concatenated as in the single-instance case. Hence, multi-view learning, where one classifier is built per view, is particularly attractive when applying multi-instance learning to image classification. We take several diverse image data sets—ranging from person detection to astronomical object classification to species recognition—and derive a set of multiple instance views from each of them. We then show via an extensive set of 10×10 stratified cross-validation experiments that MVMI, based on averaging predicted confidence scores, generally exceeds the performance of traditional single-view multi-instance learning, when using support vector machines and boosting as the underlying learning algorithms.

[10]

A comparison of methods for estimating prediction intervals in NIR spectroscopy: Size matters

Remco R. Bouckaert, Eibe Frank, Geoffrey Holmes, and Dale Fletcher. A comparison of methods for estimating prediction intervals in NIR spectroscopy: Size matters. Chemometrics and Intelligent Laboratory Systems, 109(2):139 - 145, 2011.
[ bib | http ]
In this article we demonstrate that, when evaluating a method for determining prediction intervals, interval size matters more than coverage because the latter can be fixed at a chosen confidence level with good reliability. To achieve the desired coverage, we employ a simple non-parametric method to compute prediction intervals by calibrating estimated prediction errors, and we extend the basic method with a continuum correction to deal with small data sets. In our experiments on a collection of several NIR data sets, we evaluate several existing methods of computing prediction intervals for partial least-squares (PLS) regression. Our results show that, when coverage is fixed at a chosen confidence level, and the number of PLS components is selected to minimize squared error of point estimates, interval estimation based on the classic ordinary least-squares method produces the narrowest intervals, outperforming the U-deviation method and linearization, regardless of the confidence level that is chosen.

[11]

Data Mining: Practical Machine Learning Tools and Techniques

Ian H. Witten, Eibe Frank, and Mark A. Hall. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Burlington, MA, 3 edition, 2011.
[ bib | .html ]
[12]

Bagging Ensemble Selection

Quan Sun and Bernhard Pfahringer. Bagging ensemble selection. In Australasian Conference on Artificial Intelligence, pages 251-260. Springer, 2011.
[ bib | http ]
Ensemble selection has recently appeared as a popular ensemble learning method, not only because its implementation is fairly straightforward, but also due to its excellent predictive performance on practical problems. The method has been highlighted in winning solutions of many data mining competitions, such as the Netflix competition, the KDD Cup 2009 and 2010, the UCSD FICO contest 2010, and a number of data mining competitions on the Kaggle platform. In this paper we present a novel variant: bagging ensemble selection. Three variations of the proposed algorithm are compared to the original ensemble selection algorithm and other ensemble algorithms. Experiments with ten real world problems from diverse domains demonstrate the benefit of the bagging ensemble selection algorithm.

[13]

Semi-random Model Tree Ensembles: An Effective and Scalable Regression Method

Bernhard Pfahringer. Semi-random model tree ensembles: An effective and scalable regression method. In Australasian Conference on Artificial Intelligence, pages 231-240. Springer, 2011.
[ bib | http ]
We present and investigate ensembles of semi-random model trees as a novel regression method. Such ensembles combine the scalability of tree-based methods with predictive performance rivalling the state of the art in numeric prediction. An empirical investigation shows that Semi-Random Model Trees produce predictive performance which is competitive with state-of-the-art methods like Gaussian Processes Regression or Additive Groves of Regression Trees. The training and optimization of Random Model Trees scales better than Gaussian Processes Regression to larger datasets, and enjoys a constant advantage over Additive Groves of the order of one to two orders of magnitude.

[14]

A Model for Predicting the Resolution of Type 2 Diabetes in Severely Obese Subjects Following Roux-en Y Gastric Bypass Surgery

Mark Thomas Hayes, Lynette Anne Hunt, Jonathan Foo, Yulia Tychinskaya, and Richard Strawson Stubbs. A model for predicting the resolution of type 2 diabetes in severely obese subjects following roux-en y gastric bypass surgery. Obesity Surgery, 21(7):910-916, 2011.
[ bib | http ]
Severely obese type 2 diabetics who undergo Roux-en Y gastric bypass surgery have significant improvements in glycaemic control. Little work has been undertaken to establish the independent predictors of such resolution or to develop a predictive model. The aim of this study was to develop a mathematical model and establish independent predictors for the resolution of diabetes. METHODS: A consecutive sample of 130 severely obese type 2 diabetics who underwent gastric bypass surgery for weight loss from November 1997 to May 2007 with prospective pre-operative documentation of biochemical and clinical measurements was followed up over 12 months. Logistic discrimination analysis was undertaken to identify those variables with independent predictive value and to develop a predictive model for resolution of type 2 diabetes. Consecutive samples of 130 patients with body mass index (BMI) ≥ 35 with type 2 diabetes were selected. One hundred and twenty-seven patients completed the study with a sufficient data set. Patients were deemed unresolved if (1) diabetic medication was still required after surgery; (2) if fasting plasma glucose (FPG) remained >7 mmol/L; or (3) HbA1c remained >7%. RESULTS: Resolution of diabetes was seen in 84%, while diabetes remained but was improved in 16% of patients. Resolution was rapid and sustained with 74% of those on medication before surgery being able to discontinue this by the time of discharge 6 days following surgery. Five pre-operative variables were found to have independent predictive value for resolution of diabetes, including BMI, HbA1c, FPG, hypertension and requirement for insulin. Two models have been proposed for prediction of diabetes resolution, each with 86% correct classification in this cohort of patients. CONCLUSIONS: Type 2 diabetes resolves in a very high percentage of patients undergoing gastric bypass surgery for severe obesity. The key predictive variables include pre-operative BMI, HbA1c, FPG, the presence of hypertension and diabetic status.

[15]

Hybridizing Data Stream Mining and Technical Indicators in Automated Trading Systems

Michael Mayo. Hybridizing data stream mining and technical indicators in automated trading systems. In Modeling Decision for Artificial Intelligence - 8th International Conference, MDAI 2011, Changsha, Hunan, China, July 28-30, 2011, Proceedings, pages 79-90. Springer, 2011.
[ bib | http ]
Automated trading systems for financial markets can use data mining techniques for future price movement prediction. However, classifier accuracy is only one important component in such a system: the other is a decision procedure utilizing the prediction in order to be long, short or out of the market. In this paper, we investigate the use of technical indicators as a means of deciding when to trade in the direction of a classifier’s prediction. We compare this “hybrid” technical/data stream mining-based system with a naive system that always trades in the direction of predicted price movement. We are able to show via evaluations across five financial market datasets that our novel hybrid technique frequently outperforms the naive system. To strengthen our conclusions, we also include in our evaluation several “simple” trading strategies without any data mining component that provide a much stronger baseline for comparison than traditional buy-and-hold or sell-and-hold strategies.

[16]

Modelling epistasis in genetic disease using Petri nets, evolutionary computation and frequent itemset mining

Michael Mayo and Lorenzo Beretta. Modelling epistasis in genetic disease using petri nets, evolutionary computation and frequent itemset mining. Expert Syst. Appl., 38(4):4006-4013, 2011.
[ bib | http ]
Petri nets are useful for mathematically modelling disease-causing genetic epistasis. A Petri net model of an interaction has the potential to lead to biological insight into the cause of a genetic disease. However, defining a Petri net by hand for a particular interaction is extremely difficult because of the sheer complexity of the problem and degrees of freedom inherent in a Petri net’s architecture. We propose therefore a novel method, based on evolutionary computation and data mining, for automatically constructing Petri net models of non-linear gene interactions. The method comprises two main steps. Firstly, an initial partial Petri net is set up with several repeated sub-nets that model individual genes and a set of constraints, comprising relevant common sense and biological knowledge, is also defined. These constraints characterise the class of Petri nets that are desired. Secondly, this initial Petri net structure and the constraints are used as the input to a genetic algorithm. The genetic algorithm searches for a Petri net architecture that is both a superset of the initial net, and also conforms to all of the given constraints. The genetic algorithm evaluation function that we employ gives equal weighting to both the accuracy of the net and also its parsimony. We demonstrate our method using an epistatic model related to the presence of digital ulcers in systemic sclerosis patients that was recently reported in the literature. Our results show that although individual “perfect” Petri nets can frequently be discovered for this interaction, the true value of this approach lies in generating many different perfect nets, and applying data mining techniques to them in order to elucidate common and statistically significant patterns of interaction.

[17]

MOA-TweetReader: Real-Time Analysis in Twitter Streaming Data

Albert Bifet, Geoffrey Holmes, and Bernhard Pfahringer. Moa-tweetreader: Real-time analysis in twitter streaming data. In Discovery Science - 14th International Conference, DS 2011, Espoo, Finland, October 5-7, 2011. Proceedings, pages 46-60. Springer, 2011.
[ bib | http ]
Twitter is a micro-blogging service built to discover what is happening at any moment in time, anywhere in the world. Twitter messages are short, generated constantly, and well suited for knowledge discovery using data stream mining. We introduce MOA-TweetReader, a system for processing tweets in real time. We show two main applications of the new system for studying Twitter data: detecting changes in term frequencies and performing real-time sentiment analysis.

[18]

Active Learning with Evolving Streaming Data

Indre Zliobaite, Albert Bifet, Bernhard Pfahringer, and Geoff Holmes. Active learning with evolving streaming data. In Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2011, Athens, Greece, September 5-9, 2011, Proceedings, Part III, pages 597-612. Springer, 2011.
[ bib | http ]
In learning to classify streaming data, obtaining the true labels may require major effort and may incur excessive cost. Active learning focuses on learning an accurate model with as few labels as possible. Streaming data poses additional challenges for active learning, since the data distribution may change over time (concept drift) and classifiers need to adapt. Conventional active learning strategies concentrate on querying the most uncertain instances, which are typically concentrated around the decision boundary. If changes do not occur close to the boundary, they will be missed and classifiers will fail to adapt. In this paper we develop two active learning strategies for streaming data that explicitly handle concept drift. They are based on uncertainty, dynamic allocation of labeling efforts over time and randomization of the search space. We empirically demonstrate that these strategies react well to changes that can occur anywhere in the instance space and unexpectedly.

[19]

MOA: A Real-Time Analytics Open Source Framework

Albert Bifet, Geoff Holmes, Bernhard Pfahringer, Jesse Read, Philipp Kranen, Hardy Kremer, Timm Jansen, and Thomas Seidl. Moa: A real-time analytics open source framework. In Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2011, Athens, Greece, September 5-9, 2011, Proceedings, Part III, pages 617-620. Springer, 2011.
[ bib | http ]
Massive Online Analysis (MOA) is a software environment for implementing algorithms and running experiments for online learning from evolving data streams. MOA is designed to deal with the challenging problems of scaling up the implementation of state of the art algorithms to real world dataset sizes and of making algorithms comparable in benchmark streaming settings. It contains a collection of offline and online algorithms for classification, clustering and graph mining as well as tools for evaluation. For researchers the framework yields insights into advantages and disadvantages of different approaches and allows for the creation of benchmark streaming data sets through stored, shared and repeatable settings for the data feeds. Practitioners can use the framework to easily compare algorithms and apply them to real world data sets and settings. MOA supports bi-directional interaction with WEKA, the Waikato Environment for Knowledge Analysis. Besides providing algorithms and measures for evaluation and comparison, MOA is easily extensible with new contributions and allows for the creation of benchmark scenarios.

[20]

Online Evaluation of Email Streaming Classifiers Using GNUsmail

José M. Carmona-Cejudo, Manuel Baena-García, José del Campo-Ávila, Albert Bifet, Joao Gama, and Rafael Morales Bueno. Online evaluation of email streaming classifiers using gnusmail. In Advances in Intelligent Data Analysis X - 10th International Symposium, IDA 2011, Porto, Portugal, October 29-31, 2011. Proceedings, pages 90-100. Springer, 2011.
[ bib | http ]
Real-time email classification is a challenging task because of its online nature, subject to concept-drift. Identifying spam, where only two labels exist, has received great attention in the literature. We are nevertheless interested in classification involving multiple folders, which is an additional source of complexity. Moreover, neither cross-validation nor other sampling procedures are suitable for data streams evaluation. Therefore, other metrics, like the prequential error, have been proposed. However, the prequential error poses some problems, which can be alleviated by using mechanisms such as fading factors. In this paper we present GNUsmail, an open-source extensible framework for email classification, and focus on its ability to perform online evaluation. GNUsmail’s architecture supports incremental and online learning, and it can be used to compare different online mining methods, using state-of-art evaluation metrics. We show how GNUsmail can be used to compare different algorithms, including a tool for launching replicable experiments.

[21]

Mining frequent closed graphs on evolving data streams

Albert Bifet, Geoff Holmes, Bernhard Pfahringer, and Ricard Gavaldà. Mining frequent closed graphs on evolving data streams. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, August 21-24, 2011, pages 591-599. ACM, 2011.
[ bib | http ]
Graph mining is a challenging task by itself, and even more so when processing data streams which evolve in real-time. Data stream mining faces hard constraints regarding time and space for processing, and also needs to provide for concept drift detection. In this paper we present a framework for studying graph pattern mining on time-varying streams. Three new methods for mining frequent closed subgraphs are presented. All methods work on coresets of closed subgraphs, compressed representations of graph sets, and maintain these sets in a batch-incremental manner, but use different approaches to address potential concept drift. An evaluation study on datasets comprising up to four million graphs explores the strength and limitations of the proposed methods. To the best of our knowledge this is the first work on mining frequent closed subgraphs in non-stationary data streams.

[22]

An effective evaluation measure for clustering on evolving data streams

Hardy Kremer, Philipp Kranen, Timm Jansen, Thomas Seidl, Albert Bifet, Geoff Holmes, and Bernhard Pfahringer. An effective evaluation measure for clustering on evolving data streams. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, August 21-24, 2011, pages 868-876. ACM, 2011.
[ bib | http ]
Due to the ever growing presence of data streams, there has been a considerable amount of research on stream mining algorithms. While many algorithms have been introduced that tackle the problem of clustering on evolving data streams, hardly any attention has been paid to appropriate evaluation measures. Measures developed for static scenarios, namely structural measures and ground-truth-based measures, cannot correctly reflect errors attributable to emerging, splitting, or moving clusters. These situations are inherent to the streaming context due to the dynamic changes in the data distribution. In this paper we develop a novel evaluation measure for stream clustering called Cluster Mapping Measure (CMM). CMM effectively indicates different types of errors by taking the important properties of evolving data streams into account. We show in extensive experiments on real and synthetic data that CMM is a robust measure for stream clustering evaluation.

[23]

Mining frequent closed trees in evolving data streams

Albert Bifet and Ricard Gavaldà. Mining frequent closed trees in evolving data streams. Intell. Data Anal., 15(1):29-48, 2011.
[ bib | http ]
We propose new algorithms for adaptively mining closed rooted trees, both labeled and unlabeled, from data streams that change over time. Closed patterns are powerful representatives of frequent patterns, since they eliminate redundant information. Our approach is based on an advantageous representation of trees and a low-complexity notion of relaxed closed trees, as well as ideas from Galois Lattice Theory. More precisely, we present three closed tree mining algorithms in sequence: an incremental one, IncTreeMiner, a sliding-window based one, WinTreeMiner, and finally one that mines closed trees adaptively from data streams, AdaTreeMiner. By adaptive we mean here that it presents at all times the closed trees that are frequent in the current state of the data stream. To the best of our knowledge this is the first work on mining closed frequent trees in streaming data varying with time. We give a first experimental evaluation of the proposed algorithms.

[24]

Using GNUsmail to Compare Data Stream Mining Methods for On-line Email Classification

José M. Carmona-Cejudo, Manuel Baena-García, Rafael Morales Bueno, Joao Gama, and Albert Bifet. Using GNUsmail to compare data stream mining methods for on-line email classification. Journal of Machine Learning Research - Proceedings Track, 17:12-18, 2011.
[ bib | .pdf ]
Real-time classification of emails is a challenging task because of its online nature, and also because email streams are subject to concept drift. Identifying email spam, where only two different labels or classes are defined (spam or not spam), has received great attention in the literature. We are nevertheless interested in a more specific classification where multiple folders exist, which is an additional source of complexity: the class can have a very large number of different values. Moreover, neither cross-validation nor other sampling procedures are suitable for evaluation in data stream contexts, which is why other metrics, like the prequential error, have been proposed. In this paper, we present GNUsmail, an open-source extensible framework for email classification, and we focus on its ability to perform online evaluation. GNUsmails architecture supports incremental and online learning, and it can be used to compare different data stream mining methods, using state-of-art online evaluation metrics. Besides describing the framework, characterized by two overlapping phases, we show how it can be used to compare different algorithms in order to find the most appropriate one. The GNUsmail source code includes a tool for launching replicable experiments.

[25]

Streaming Multi-label Classification

Jesse Read, Albert Bifet, Geoff Holmes, and Bernhard Pfahringer. Streaming multi-label classification. Journal of Machine Learning Research - Proceedings Track, 17:19-25, 2011.
[ bib | .pdf ]
This paper presents a new experimental framework for studying multi-label evolving stream classification, with efficient methods that combine the best practices in streaming scenarios with the best practices in multi-label classification. Many real world problems involve data which can be considered as multi-label data streams. Efficient methods exist for multi-label classification in non streaming scenarios. However, learning in evolving streaming scenarios is more challenging, as the learners must be able to adapt to change using limited time and memory. We present a new experimental software that extends the MOA framework. Massive Online Analysis (MOA) is a software environment for implementing algorithms and running experiments for online learning from evolving data streams. It is released under the GNU GPL license.

[26]

MOA Concept Drift Active Learning Strategies for Streaming Data

Indre Zliobaite, Albert Bifet, Geoff Holmes, and Bernhard Pfahringer. MOA concept drift active learning strategies for streaming data. Journal of Machine Learning Research - Proceedings Track, 17:48-55, 2011.
[ bib | .pdf ]
We present a framework for active learning on evolving data streams, as an extension to the MOA system. In learning to classify streaming data, obtaining the true labels may require major effort and may incur excessive cost. Active learning focuses on learning an accurate model with as few labels as possible. Streaming data poses additional challenges for active learning, since the data distribution may change over time (concept drift) and classifiers need to adapt. Conventional active learning strategies concentrate on querying the most uncertain instances, which are typically concentrated around the decision boundary. If changes do not occur close to the boundary, they will be missed and classifiers will fail to adapt. We propose a software system that implements active learning strategies, extending the MOA framework. This software is released under the GNU GPL license.

[27]

Detecting Sentiment Change in Twitter Streaming Data

Albert Bifet, Geoffrey Holmes, Bernhard Pfahringer, and Ricard Gavaldà. Detecting sentiment change in Twitter streaming data. Journal of Machine Learning Research - Proceedings Track, 17:5-11, 2011.
[ bib | .pdf ]
MOA-TweetReader is a real-time system to read tweets in real time, to detect changes, and to find the terms whose frequency changed. Twitter is a micro-blogging service built to discover what is happening at any moment in time, anywhere in the world. Twitter messages are short, and generated constantly, and well suited for knowledge discovery using data stream mining. MOA-TweetReader is a software extension to the MOA framework. Massive Online Analysis (MOA) is a software environment for implementing algorithms and running experiments for online learning from evolving data streams.