[1]

Clustering mixed data
Lynette Hunt and Murray Jorgensen.
Clustering mixed data.
Wiley Interdisciplinary Reviews: Data Mining and Knowledge
Discovery, 1(4):352361, 2011.
[ bib 
http ]
Mixture model clustering proceeds by fitting a finite mixture of multivariate distributions to data, the fitted mixture density then being used to allocate the data to one of the components. Common model formulations assume that either all the attributes are continuous or all the attributes are categorical. In this paper, we consider options for model formulation in the more practical case of mixed data: multivariate data sets that contain both continuous and categorical attributes.


[2]

Using the online crossentropy method to learn relational
policies for playing different games
Samuel Sarjant, Bernhard Pfahringer, Kurt Driessens, and Tony Smith.
Using the online crossentropy method to learn relational policies
for playing different games.
In Proceedings IEEE Conference on Computational Intelligence and
Games, pages 182189. IEEE, 2011.
[ bib ]
By defining a videogame environment as a collection of objects, relations, actions and rewards, the relational reinforcement learning algorithm presented in this paper generates and optimises a set of concise, humanreadable relational rules for achieving maximal reward. Rule learning is achieved using a combination of incremental specialisation of rules and a modified online crossentropy method, which dynamically adjusts the rate of learning as the agent progresses. The algorithm is tested on the Ms. PacMan and Mario environments, with results indicating the agent learns an effective policy for acting within each environment.


[3]

Classifier chains for multilabel classification
Jesse Read, Bernhard Pfahringer, Geoff Holmes, and Eibe Frank.
Classifier chains for multilabel classification.
Machine Learning, 85(3):333359, 2011.
[ bib 
http ]
The widely known binary relevance method for multilabel classification, which considers each label as an independent binary problem, has often been overlooked in the literature due to the perceived inadequacy of not directly modelling label correlations. Most current methods invest considerable complexity to model interdependencies between labels. This paper shows that binary relevancebased methods have much to offer, and that high predictive performance can be obtained without impeding scalability to large datasets. We exemplify this with a novel classifier chains method that can model label correlations while maintaining acceptable computational complexity. We extend this approach further in an ensemble framework. An extensive empirical evaluation covers a broad range of multilabel datasets with a variety of evaluation metrics. The results illustrate the competitiveness of the chaining method against related and stateoftheart methods, both in terms of predictive performance and time complexity.


[4]

Beyond Trees: Adopting MITI to Learn Rules and Ensemble
Classifiers for MultiInstance Data
Luke Bjerring and Eibe Frank.
Beyond trees: Adopting MITI to learn rules and ensemble classifiers
for multiinstance data.
In Proc 24th Australasian Joint Conference on Artificial
Intelligence, Perth, Australia, pages 4150. Springer, 2011.
[ bib 
.pdf ]
MITI is a simple and elegant decision tree learner designed for multiinstance classification problems, where examples for learning consist of bags of instances. MITI grows a tree in bestfirst manner by maintaining a priority queue containing the unexpanded nodes in the fringe of the tree. When the head node contains instances from positive examples only, it is made into a leaf, and any bag of data that is associated with this leaf is removed. In this paper we first revisit the basic algorithm and consider the effect of parameter settings on classification accuracy, using several benchmark datasets. We show that the chosen splitting criterion in particular can have a significant effect on accuracy. We identify a potential weakness of the algorithm—subtrees can contain structure that has been created using data that is subsequently removed—and show that a simple modification turns the algorithm into a rule learner that avoids this problem. This rule learner produces more compact classifiers with comparable accuracy on the benchmark datasets we consider. Finally, we present randomized algorithm variants that enable us to generate ensemble classifiers. We show that these can yield substantially improved classification accuracy.


[5]

Using Output Codes for Twoclass Classification Problems
Fanhua Zeng.
Using output codes for twoclass classification problems.
Master's thesis, Department of Computer Science, University of
Waikato, 2011.
[ bib 
http ]
Errorcorrecting output codes (ECOCs) have been widely used in many applications for multiclass classification problems. The problem is that ECOCs cannot be ap plied directly on twoclass datasets. The goal of this thesis is to design and evaluate an approach to solve this problem, and then investigate whether the approach can yield better classification models. To be able to use ECOCs, we turn twoclass datasets into multiclass datasets first, by using clustering. With the resulting multiclass datasets in hand, we evalu ate three different encoding methods for ECOCs: exhaustive coding, random coding and a “predefined” code that is found using random search. The exhaustive coding method has the highest errorcorrecting abilities. However, this method is limited due to the exponential growth of bit columns in the codeword matrix precluding it from being used for problems with large numbers of classes. Random coding can be used to cover situations with large numbers of classes in the data. To improve on completely random matrices, “predefined” codeword matrices can be generated by using random search that optimizes row separation yielding better error correction than a purely random matrix. To speed up the process of finding good matrices, GPU parallel programming is investigated in this thesis. From the empirical results, we can say that the new algorithm, which applies multiclass ECOCs on twoclass data using clustering, does improve the performance for some base learners, when compared to applying them directly to the original two class datasets.


[6]

Smoothing in Probability Estimation Trees
Zhimeng Han.
Smoothing in probability estimation trees.
Master's thesis, Department of Computer Science, University of
Waikato, 2011.
[ bib 
http ]
Classification learning is a type of supervised machine learning technique that uses a classification model (e.g. decision tree) to predict unknown class labels for previously unseen instances. In many applications it can be very useful to additionally obtain class probabilities for the different class labels. Decision trees that yield these probabilities are also called probability estimation trees (PETs). Smoothing is a technique used to improve the probability estimates. There are several existing smoothing methods, such as the Laplace correction, MEstimate smoothing and MBranch smoothing. Smoothing does not just apply to PETs. In the field of text compression, PPM in particular, smoothing methods play a important role. This thesis migrates smoothing methods from text compression to PETs. The newly migrated methods in PETs are compared with the best of the existing smoothing methods considered in this thesis under different experiment setups. Unpruned, pruned and bagged trees are considered in the experiments. The main finding is that the PPMbased methods yield the best probability estimate when used with bagged trees, but not when used with individual (pruned or unpruned) trees.


[7]

Conceptbased text clustering
Lan Huang.
Conceptbased text clustering.
PhD thesis, Department of Computer Science, University of Waikato,
2011.
[ bib 
http ]
Thematic organization of text is a natural practice of humans and a crucial task
for today’s vast repositories. Clustering automates this by assessing the similarity
between texts and organizing them accordingly, grouping like ones together and
separating those with diﬀerent topics. Clusters provide a comprehensive logical
structure that facilitates exploration, search and interpretation of current texts,
as well as organization of future ones.
Automatic clustering is usually based on words. Text is represented by the
words it mentions, and thematic similarity is based on the proportion of words
that texts have in common. The resulting bagofwords model is semantically
ambiguous and undesirably orthogonal—it ignores the connections between words.
This thesis claims that using concepts as the basis of clustering can signiﬁ
cantly improve eﬀectiveness. Concepts are deﬁned as units of knowledge. When
organized according to the relations among them, they form a concept system.
Two concept systems are used here: WordNet, which focuses on word knowledge,
and Wikipedia, which encompasses world knowledge.
We investigate a clustering procedure with three components: using concepts
to represent text; taking the semantic relations among them into account dur
ing clustering; and learning a text similarity measure from concepts and their
relations. First, we demonstrate that concepts provide a succinct and informa
tive representation of the themes in text, exemplifying this with the two concept
systems. Second, we deﬁne methods for utilizing concept relations to enhance
clustering by making the representation models more discriminative and extend
ing thematic similarity beyond surface overlap. Third, we present a similarity
measure based on concepts and their relations that is learned from a small num
ber of examples, and show that it both predicts similarity consistently with human
judgement and improves clustering. The thesis provides strong support for the
use of conceptbased representations instead of the classic bagofwords model.


[8]

Sequencebased protein classification: binary Profile Hidden Markov Models and propositionalisation
Stefan Mutter.
Sequencebased protein classification: binary Profile Hidden
Markov Models and propositionalisation.
PhD thesis, Department of Computer Science, University of Waikato,
2011.
[ bib 
http ]
Detecting similarity in biological sequences is a key element to understanding the mechanisms of life. Researchers infer potential structural, functional or evolutionary relationships from similarity. However, the concept of similarity is complex in biology. Sequences consist of different molecules with different chemical properties, have short and long distance interactions, form 3D structures and change through evolutionary processes. Amino acids are one of the key molecules of life. Most importantly, a sequence of amino acids constitutes the building block for proteins which play an essential role in cellular processes. This thesis investigates similarity amongst proteins. In this area of research there are two important and closely related classification tasks – the detection of similar proteins and the discrimination amongst them. Hidden Markov Models (HMMs) have been successfully applied to the detection task as they model sequence similarity very well. From a Machine Learning point of view these HMMs are essentially oneclass classifiers trained solely on a small number of similar proteins neglecting the vast number of dissimilar ones. Our basic assumption is that integrating this neglected information will be highly beneficial to the classification task. Thus, we transform the problem representation from a oneclass to a binary one. Equipped with the necessary sound understanding of Machine Learning, especially concerning problem representation and statistically significant evaluation, our work pursues and combines two different avenues on this aforementioned transformation. First, we introduce a binary HMM that discriminates significantly better than the standard one, even when only a fraction of the negative information is used. Second, we interpret the HMM as a structured graph of information. This information cannot be accessed by highly optimised standard Machine Learning classifiers as they expect a fixed length feature vector representation. Propositionalisation is a technique to transform the former representation into the latter. This thesis introduces new propositionalisation techniques. The change in representation changes the learning problem from a oneclass, generative to a propositional, discriminative one. It is a common assumption that discriminative techniques are better suited for classification tasks, and our results validate this assumption. We suggest a new way to significantly improve on discriminative power and runtime by means of terminating the timeintense training of HMMs early, subsequently applying propositionalisation and classifying with a discriminative, binary learner.


[9]

Experiments with Multiview Multiinstance Learning for Supervised Image Classification
Michael Mayo and Eibe Frank.
Experiments with multiview multiinstance learning for supervised
image classification.
In Proc 26th International Conference Image and Vision Computing
New Zealand, Auckland, New Zealand, pages 363369, 2011.
[ bib 
.pdf ]
In this paper we empirically investigate the benefits of multiview multiinstance (MVMI) learning for supervised
image classification. In multiinstance learning, examples for learning contain bags of feature vectors and thus data from
different views cannot simply be concatenated as in the singleinstance case. Hence, multiview learning, where one classifier
is built per view, is particularly attractive when applying multiinstance learning to image classification. We take several diverse
image data sets—ranging from person detection to astronomical object classification to species recognition—and derive a set of
multiple instance views from each of them. We then show via an extensive set of 10×10 stratified crossvalidation experiments that
MVMI, based on averaging predicted confidence scores, generally exceeds the performance of traditional singleview multiinstance
learning, when using support vector machines and boosting as the underlying learning algorithms.


[10]

A comparison of methods for estimating prediction intervals in NIR spectroscopy: Size matters
Remco R. Bouckaert, Eibe Frank, Geoffrey Holmes, and Dale Fletcher.
A comparison of methods for estimating prediction intervals in NIR
spectroscopy: Size matters.
Chemometrics and Intelligent Laboratory Systems, 109(2):139 
145, 2011.
[ bib 
http ]
In this article we demonstrate that, when evaluating a method for determining prediction intervals, interval size matters more than coverage because the latter can be fixed at a chosen confidence level with good reliability. To achieve the desired coverage, we employ a simple nonparametric method to compute prediction intervals by calibrating estimated prediction errors, and we extend the basic method with a continuum correction to deal with small data sets. In our experiments on a collection of several NIR data sets, we evaluate several existing methods of computing prediction intervals for partial leastsquares (PLS) regression. Our results show that, when coverage is fixed at a chosen confidence level, and the number of PLS components is selected to minimize squared error of point estimates, interval estimation based on the classic ordinary leastsquares method produces the narrowest intervals, outperforming the Udeviation method and linearization, regardless of the confidence level that is chosen.


[11]

Data Mining: Practical Machine Learning Tools and Techniques
Ian H. Witten, Eibe Frank, and Mark A. Hall.
Data Mining: Practical Machine Learning Tools and Techniques.
Morgan Kaufmann, Burlington, MA, 3 edition, 2011.
[ bib 
.html ]


[12]

Bagging Ensemble Selection
Quan Sun and Bernhard Pfahringer.
Bagging ensemble selection.
In Australasian Conference on Artificial Intelligence, pages
251260. Springer, 2011.
[ bib 
http ]
Ensemble selection has recently appeared as a popular ensemble learning method, not only because its implementation is fairly straightforward, but also due to its excellent predictive performance on practical problems. The method has been highlighted in winning solutions of many data mining competitions, such as the Netflix competition, the KDD Cup 2009 and 2010, the UCSD FICO contest 2010, and a number of data mining competitions on the Kaggle platform. In this paper we present a novel variant: bagging ensemble selection. Three variations of the proposed algorithm are compared to the original ensemble selection algorithm and other ensemble algorithms. Experiments with ten real world problems from diverse domains demonstrate the benefit of the bagging ensemble selection algorithm.


[13]

Semirandom Model Tree Ensembles: An Effective and Scalable
Regression Method
Bernhard Pfahringer.
Semirandom model tree ensembles: An effective and scalable
regression method.
In Australasian Conference on Artificial Intelligence, pages
231240. Springer, 2011.
[ bib 
http ]
We present and investigate ensembles of semirandom model trees as a novel regression method. Such ensembles combine the scalability of treebased methods with predictive performance rivalling the state of the art in numeric prediction. An empirical investigation shows that SemiRandom Model Trees produce predictive performance which is competitive with stateoftheart methods like Gaussian Processes Regression or Additive Groves of Regression Trees. The training and optimization of Random Model Trees scales better than Gaussian Processes Regression to larger datasets, and enjoys a constant advantage over Additive Groves of the order of one to two orders of magnitude.


[14]

A Model for Predicting the Resolution of Type 2 Diabetes in Severely Obese Subjects Following Rouxen Y Gastric Bypass Surgery
Mark Thomas Hayes, Lynette Anne Hunt, Jonathan Foo, Yulia Tychinskaya, and
Richard Strawson Stubbs.
A model for predicting the resolution of type 2 diabetes in severely
obese subjects following rouxen y gastric bypass surgery.
Obesity Surgery, 21(7):910916, 2011.
[ bib 
http ]
Severely obese type 2 diabetics who undergo Rouxen Y gastric bypass surgery have significant improvements in glycaemic control. Little work has been undertaken to establish the independent predictors of such resolution or to develop a predictive model. The aim of this study was to develop a mathematical model and establish independent predictors for the resolution of diabetes.
METHODS:
A consecutive sample of 130 severely obese type 2 diabetics who underwent gastric bypass surgery for weight loss from November 1997 to May 2007 with prospective preoperative documentation of biochemical and clinical measurements was followed up over 12 months. Logistic discrimination analysis was undertaken to identify those variables with independent predictive value and to develop a predictive model for resolution of type 2 diabetes. Consecutive samples of 130 patients with body mass index (BMI) ≥ 35 with type 2 diabetes were selected. One hundred and twentyseven patients completed the study with a sufficient data set. Patients were deemed unresolved if (1) diabetic medication was still required after surgery; (2) if fasting plasma glucose (FPG) remained >7 mmol/L; or (3) HbA1c remained >7%.
RESULTS:
Resolution of diabetes was seen in 84%, while diabetes remained but was improved in 16% of patients. Resolution was rapid and sustained with 74% of those on medication before surgery being able to discontinue this by the time of discharge 6 days following surgery. Five preoperative variables were found to have independent predictive value for resolution of diabetes, including BMI, HbA1c, FPG, hypertension and requirement for insulin. Two models have been proposed for prediction of diabetes resolution, each with 86% correct classification in this cohort of patients.
CONCLUSIONS:
Type 2 diabetes resolves in a very high percentage of patients undergoing gastric bypass surgery for severe obesity. The key predictive variables include preoperative BMI, HbA1c, FPG, the presence of hypertension and diabetic status.


[15]

Hybridizing Data Stream Mining and Technical Indicators
in Automated Trading Systems
Michael Mayo.
Hybridizing data stream mining and technical indicators in automated
trading systems.
In Modeling Decision for Artificial Intelligence  8th
International Conference, MDAI 2011, Changsha, Hunan, China, July 2830,
2011, Proceedings, pages 7990. Springer, 2011.
[ bib 
http ]
Automated trading systems for financial markets can use data mining techniques for future price movement prediction. However, classifier accuracy is only one important component in such a system: the other is a decision procedure utilizing the prediction in order to be long, short or out of the market. In this paper, we investigate the use of technical indicators as a means of deciding when to trade in the direction of a classifier’s prediction. We compare this “hybrid” technical/data stream miningbased system with a naive system that always trades in the direction of predicted price movement. We are able to show via evaluations across five financial market datasets that our novel hybrid technique frequently outperforms the naive system. To strengthen our conclusions, we also include in our evaluation several “simple” trading strategies without any data mining component that provide a much stronger baseline for comparison than traditional buyandhold or sellandhold strategies.


[16]

Modelling epistasis in genetic disease using Petri nets,
evolutionary computation and frequent itemset mining
Michael Mayo and Lorenzo Beretta.
Modelling epistasis in genetic disease using petri nets, evolutionary
computation and frequent itemset mining.
Expert Syst. Appl., 38(4):40064013, 2011.
[ bib 
http ]
Petri nets are useful for mathematically modelling diseasecausing genetic epistasis. A Petri net model of an interaction has the potential to lead to biological insight into the cause of a genetic disease. However, defining a Petri net by hand for a particular interaction is extremely difficult because of the sheer complexity of the problem and degrees of freedom inherent in a Petri net’s architecture.
We propose therefore a novel method, based on evolutionary computation and data mining, for automatically constructing Petri net models of nonlinear gene interactions. The method comprises two main steps. Firstly, an initial partial Petri net is set up with several repeated subnets that model individual genes and a set of constraints, comprising relevant common sense and biological knowledge, is also defined. These constraints characterise the class of Petri nets that are desired. Secondly, this initial Petri net structure and the constraints are used as the input to a genetic algorithm. The genetic algorithm searches for a Petri net architecture that is both a superset of the initial net, and also conforms to all of the given constraints. The genetic algorithm evaluation function that we employ gives equal weighting to both the accuracy of the net and also its parsimony.
We demonstrate our method using an epistatic model related to the presence of digital ulcers in systemic sclerosis patients that was recently reported in the literature. Our results show that although individual “perfect” Petri nets can frequently be discovered for this interaction, the true value of this approach lies in generating many different perfect nets, and applying data mining techniques to them in order to elucidate common and statistically significant patterns of interaction.


[17]

MOATweetReader: RealTime Analysis in Twitter Streaming
Data
Albert Bifet, Geoffrey Holmes, and Bernhard Pfahringer.
Moatweetreader: Realtime analysis in twitter streaming data.
In Discovery Science  14th International Conference, DS 2011,
Espoo, Finland, October 57, 2011. Proceedings, pages 4660. Springer,
2011.
[ bib 
http ]
Twitter is a microblogging service built to discover what is happening at any moment in time, anywhere in the world. Twitter messages are short, generated constantly, and well suited for knowledge discovery using data stream mining. We introduce MOATweetReader, a system for processing tweets in real time. We show two main applications of the new system for studying Twitter data: detecting changes in term frequencies and performing realtime sentiment analysis.


[18]

Active Learning with Evolving Streaming Data
Indre Zliobaite, Albert Bifet, Bernhard Pfahringer, and Geoff Holmes.
Active learning with evolving streaming data.
In Machine Learning and Knowledge Discovery in Databases 
European Conference, ECML PKDD 2011, Athens, Greece, September 59, 2011,
Proceedings, Part III, pages 597612. Springer, 2011.
[ bib 
http ]
In learning to classify streaming data, obtaining the true labels may require major effort and may incur excessive cost. Active learning focuses on learning an accurate model with as few labels as possible. Streaming data poses additional challenges for active learning, since the data distribution may change over time (concept drift) and classifiers need to adapt. Conventional active learning strategies concentrate on querying the most uncertain instances, which are typically concentrated around the decision boundary. If changes do not occur close to the boundary, they will be missed and classifiers will fail to adapt. In this paper we develop two active learning strategies for streaming data that explicitly handle concept drift. They are based on uncertainty, dynamic allocation of labeling efforts over time and randomization of the search space. We empirically demonstrate that these strategies react well to changes that can occur anywhere in the instance space and unexpectedly.


[19]

MOA: A RealTime Analytics Open Source Framework
Albert Bifet, Geoff Holmes, Bernhard Pfahringer, Jesse Read, Philipp Kranen,
Hardy Kremer, Timm Jansen, and Thomas Seidl.
Moa: A realtime analytics open source framework.
In Machine Learning and Knowledge Discovery in Databases 
European Conference, ECML PKDD 2011, Athens, Greece, September 59, 2011,
Proceedings, Part III, pages 617620. Springer, 2011.
[ bib 
http ]
Massive Online Analysis (MOA) is a software environment for implementing algorithms and running experiments for online learning from evolving data streams. MOA is designed to deal with the challenging problems of scaling up the implementation of state of the art algorithms to real world dataset sizes and of making algorithms comparable in benchmark streaming settings. It contains a collection of offline and online algorithms for classification, clustering and graph mining as well as tools for evaluation. For researchers the framework yields insights into advantages and disadvantages of different approaches and allows for the creation of benchmark streaming data sets through stored, shared and repeatable settings for the data feeds. Practitioners can use the framework to easily compare algorithms and apply them to real world data sets and settings. MOA supports bidirectional interaction with WEKA, the Waikato Environment for Knowledge Analysis. Besides providing algorithms and measures for evaluation and comparison, MOA is easily extensible with new contributions and allows for the creation of benchmark scenarios.


[20]

Online Evaluation of Email Streaming Classifiers Using GNUsmail
José M. CarmonaCejudo, Manuel BaenaGarcía, José del
CampoÁvila, Albert Bifet, Joao Gama, and Rafael Morales Bueno.
Online evaluation of email streaming classifiers using gnusmail.
In Advances in Intelligent Data Analysis X  10th International
Symposium, IDA 2011, Porto, Portugal, October 2931, 2011. Proceedings,
pages 90100. Springer, 2011.
[ bib 
http ]
Realtime email classification is a challenging task because of its online nature, subject to conceptdrift. Identifying spam, where only two labels exist, has received great attention in the literature. We are nevertheless interested in classification involving multiple folders, which is an additional source of complexity. Moreover, neither crossvalidation nor other sampling procedures are suitable for data streams evaluation. Therefore, other metrics, like the prequential error, have been proposed. However, the prequential error poses some problems, which can be alleviated by using mechanisms such as fading factors. In this paper we present GNUsmail, an opensource extensible framework for email classification, and focus on its ability to perform online evaluation. GNUsmail’s architecture supports incremental and online learning, and it can be used to compare different online mining methods, using stateofart evaluation metrics. We show how GNUsmail can be used to compare different algorithms, including a tool for launching replicable experiments.


[21]

Mining frequent closed graphs on evolving data streams
Albert Bifet, Geoff Holmes, Bernhard Pfahringer, and Ricard Gavaldà.
Mining frequent closed graphs on evolving data streams.
In Proceedings of the 17th ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, San Diego, CA, USA, August 2124,
2011, pages 591599. ACM, 2011.
[ bib 
http ]
Graph mining is a challenging task by itself, and even more so when processing data streams which evolve in realtime. Data stream mining faces hard constraints regarding time and space for processing, and also needs to provide for concept drift detection. In this paper we present a framework for studying graph pattern mining on timevarying streams. Three new methods for mining frequent closed subgraphs are presented. All methods work on coresets of closed subgraphs, compressed representations of graph sets, and maintain these sets in a batchincremental manner, but use different approaches to address potential concept drift. An evaluation study on datasets comprising up to four million graphs explores the strength and limitations of the proposed methods. To the best of our knowledge this is the first work on mining frequent closed subgraphs in nonstationary data streams.


[22]

An effective evaluation measure for clustering on evolving
data streams
Hardy Kremer, Philipp Kranen, Timm Jansen, Thomas Seidl, Albert Bifet, Geoff
Holmes, and Bernhard Pfahringer.
An effective evaluation measure for clustering on evolving data
streams.
In Proceedings of the 17th ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, San Diego, CA, USA, August 2124,
2011, pages 868876. ACM, 2011.
[ bib 
http ]
Due to the ever growing presence of data streams, there has been a considerable amount of research on stream mining algorithms. While many algorithms have been introduced that tackle the problem of clustering on evolving data streams, hardly any attention has been paid to appropriate evaluation measures. Measures developed for static scenarios, namely structural measures and groundtruthbased measures, cannot correctly reflect errors attributable to emerging, splitting, or moving clusters. These situations are inherent to the streaming context due to the dynamic changes in the data distribution. In this paper we develop a novel evaluation measure for stream clustering called Cluster Mapping Measure (CMM). CMM effectively indicates different types of errors by taking the important properties of evolving data streams into account. We show in extensive experiments on real and synthetic data that CMM is a robust measure for stream clustering evaluation.


[23]

Mining frequent closed trees in evolving data streams
Albert Bifet and Ricard Gavaldà.
Mining frequent closed trees in evolving data streams.
Intell. Data Anal., 15(1):2948, 2011.
[ bib 
http ]
We propose new algorithms for adaptively mining closed rooted trees, both labeled and unlabeled, from data streams that change over time. Closed patterns are powerful representatives of frequent patterns, since they eliminate redundant information. Our approach is based on an advantageous representation of trees and a lowcomplexity notion of relaxed closed trees, as well as ideas from Galois Lattice Theory. More precisely, we present three closed tree mining algorithms in sequence: an incremental one, IncTreeMiner, a slidingwindow based one, WinTreeMiner, and finally one that mines closed trees adaptively from data streams, AdaTreeMiner. By adaptive we mean here that it presents at all times the closed trees that are frequent in the current state of the data stream. To the best of our knowledge this is the first work on mining closed frequent trees in streaming data varying with time. We give a first experimental evaluation of the proposed algorithms.


[24]

Using GNUsmail to Compare Data Stream Mining Methods for
Online Email Classification
José M. CarmonaCejudo, Manuel BaenaGarcía, Rafael Morales Bueno,
Joao Gama, and Albert Bifet.
Using GNUsmail to compare data stream mining methods for online
email classification.
Journal of Machine Learning Research  Proceedings Track,
17:1218, 2011.
[ bib 
.pdf ]
Realtime classification of emails is a challenging task because of its online nature, and also because email streams are subject to concept drift. Identifying email spam, where only two different labels or classes are defined (spam or not spam), has received great attention in the literature. We are nevertheless interested in a more specific classification where multiple folders exist, which is an additional source of complexity: the class can have a very large number of different values. Moreover, neither crossvalidation nor other sampling procedures are suitable for evaluation in data stream contexts, which is why other metrics, like the prequential error, have been proposed. In this paper, we present GNUsmail, an opensource extensible framework for email classification, and we focus on its ability to perform online evaluation. GNUsmails architecture supports incremental and online learning, and it can be used to compare different data stream mining methods, using stateofart online evaluation metrics. Besides describing the framework, characterized by two overlapping phases, we show how it can be used to compare different algorithms in order to find the most appropriate one. The GNUsmail source code includes a tool for launching replicable experiments.


[25]

Streaming Multilabel Classification
Jesse Read, Albert Bifet, Geoff Holmes, and Bernhard Pfahringer.
Streaming multilabel classification.
Journal of Machine Learning Research  Proceedings Track,
17:1925, 2011.
[ bib 
.pdf ]
This paper presents a new experimental framework for studying multilabel evolving stream classification, with efficient methods that combine the best practices in streaming scenarios with the best practices in multilabel classification. Many real world problems involve data which can be considered as multilabel data streams. Efficient methods exist for multilabel classification in non streaming scenarios. However, learning in evolving streaming scenarios is more challenging, as the learners must be able to adapt to change using limited time and memory. We present a new experimental software that extends the MOA framework. Massive Online Analysis (MOA) is a software environment for implementing algorithms and running experiments for online learning from evolving data streams. It is released under the GNU GPL license.


[26]

MOA Concept Drift Active Learning Strategies for Streaming
Data
Indre Zliobaite, Albert Bifet, Geoff Holmes, and Bernhard Pfahringer.
MOA concept drift active learning strategies for streaming data.
Journal of Machine Learning Research  Proceedings Track,
17:4855, 2011.
[ bib 
.pdf ]
We present a framework for active learning on evolving data streams, as an extension to the MOA system. In learning to classify streaming data, obtaining the true labels may require major effort and may incur excessive cost. Active learning focuses on learning an accurate model with as few labels as possible. Streaming data poses additional challenges for active learning, since the data distribution may change over time (concept drift) and classifiers need to adapt. Conventional active learning strategies concentrate on querying the most uncertain instances, which are typically concentrated around the decision boundary. If changes do not occur close to the boundary, they will be missed and classifiers will fail to adapt. We propose a software system that implements active learning strategies, extending the MOA framework. This software is released under the GNU GPL license.


[27]

Detecting Sentiment Change in Twitter Streaming Data
Albert Bifet, Geoffrey Holmes, Bernhard Pfahringer, and Ricard Gavaldà.
Detecting sentiment change in Twitter streaming data.
Journal of Machine Learning Research  Proceedings Track,
17:511, 2011.
[ bib 
.pdf ]
MOATweetReader is a realtime system to read tweets in real time, to detect changes, and to find the terms whose frequency changed. Twitter is a microblogging service built to discover what is happening at any moment in time, anywhere in the world. Twitter messages are short, and generated constantly, and well suited for knowledge discovery using data stream mining. MOATweetReader is a software extension to the MOA framework. Massive Online Analysis (MOA) is a software environment for implementing algorithms and running experiments for online learning from evolving data streams.

