Efficient online evaluation of big data stream classifiers

Albert Bifet, Gianmarco de Francisci Morales, Jesse Read, Geoff Holmes, and Bernhard Pfahringer. Efficient online evaluation of big data stream classifiers. In Proc 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 59-68. ACM, 2015.
[ bib | http ]
The evaluation of classifiers in data streams is fundamental so that poorly-performing models can be identified, and either improved or replaced by better-performing models. This is an increasingly relevant and important task as stream data is generated from more sources, in real-time, in large quantities, and is now considered the largest source of big data. Both researchers and practitioners need to be able to effectively evaluate the performance of the methods they employ. However, there are major challenges for evaluation in a stream. Instances arriving in a data stream are usually time-dependent, and the underlying concept that they represent may evolve over time. Furthermore, the massive quantity of data also tends to exacerbate issues such as class imbalance. Current frameworks for evaluating streaming and online algorithms are able to give predictions in real-time, but as they use a prequential setting, they build only one model, and are thus not able to compute the statistical significance of results in real-time. In this paper we propose a new evaluation methodology for big data streams. This methodology addresses unbalanced data streams, data where change occurs on different time scales, and the question of how to split the data between training and testing, over multiple models.


From unlabelled tweets to Twitter-specific opinion words

Felipe Bravo-Marquez, Eibe Frank, and Bernhard Pfahringer. From unlabelled tweets to twitter-specific opinion words. In Proc 38th Annual ACM SIGIR Conference, Santiago de Chile, Chile. ACM Press, 2015.
[ bib | .pdf ]
In this article, we propose a word-level classification model for automatically generating a Twitter-specific opinion lexicon from a corpus of unlabelled tweets. The tweets from the corpus are represented by two vectors: a bag-of-words vector and a semantic vector based on word-clusters. We propose a distributional representation for words by treating them as the centroids of the tweet vectors in which they appear. The lexicon generation is conducted by training a word-level classifier using these centroids to form the instance space and a seed lexicon to label the training instances. Experimental results show that the two types of tweet vectors complement each other in a statistically significant manner and that our generated lexicon produces significant improvements for tweet-level polarity classification.


Positive, Negative, or Neutral: Learning an Expanded Opinion Lexicon from Emoticon-annotated Tweets

Felipe Bravo-Marquez, Eibe Frank, and Bernhard Pfahringer. Positive, negative, or neutral: Learning an expanded opinion lexicon from emoticon-annotated tweets. In Proc 23rd International Joint Conference on Artificial Intelligence, Buenos Aires, Argentina, pages 1229-1235. IJCAI/AAAI, 2015.
[ bib | .pdf ]
We present a supervised framework for expanding an opinion lexicon for tweets. The lexicon contains part-of-speech (POS) disambiguated entries with a three-dimensional probability distribution for pos- itive, negative, and neutral polarities. To obtain this distribution using machine learning, we propose word-level attributes based on POS tags and information calculated from streams of emoticon- annotated tweets. Our experimental results show that our method outperforms the three-dimensional word-level polarity classification performance obtained by semantic orientation, a state-of-the-art measure for establishing world-level sentiment.


Alternating Model Trees

Eibe Frank, Michael Mayo, and Stefan Kramer. Alternating model trees. In Proc 30th ACM Symposium on Applied Computing, Data Mining Track, pages 871-878. ACM Press, 2015.
[ bib | .pdf ]
Model tree induction is a popular method for tackling regression problems requiring interpretable models. Model trees are decision trees with multiple linear regression models at the leaf nodes. In this paper, we propose a method for growing alternating model trees, a form of option tree for regression problems. The motivation is that alternating decision trees achieve high accuracy in classification problems because they represent an ensemble classifier as a single tree structure. As in alternating decision trees for classification, our alternating model trees for regression contain splitter and prediction nodes, but we use simple linear regression functions as opposed to constant predictors at the prediction nodes. Moreover, additive regression using forward stagewise modeling is applied to grow the tree rather than a boosting algorithm. The size of the tree is determined using cross-validation. Our empirical results show that alternating model trees achieve significantly lower squared error than standard model trees on several regression datasets.


Missing data imputation and its effect on the accuracy of classification

Lynette Hunt. Missing data imputation and its effect on the accuracy of classification. In Conference of the International Federation of Classification Societies, Book of Abstracts, pages 105-105. ifcs, 2015.
[ bib ]
Multivariate data sets frequently have missing observations scattered throughout the data set. These missing values can have no particular pattern of occurrence. Several methods have been proposed to address missing data values including imputation, likelihood and weighting approaches. Many machine learning algorithms assume that there is no particular significance in the fact that a particular observation has an attribute value missing. A common approach in coping with these missing values is to replace the missing value using some plausible value, and the resulting completed data set is analysed using standard methods. We evaluate the effect that some commonly used imputation methods have on the accuracy of classifiers in supervised leaning. The effect is assessed in simulations performed on several classical datasets where observations have been made missing at random in different proportions. Our analysis finds that missing data imputation using hot deck, iterative robust model based imputation (IRMI) and factorial analysis for mixed data (FAMD) perform in a similar manner regardless of the amount of missing data and have the highest mean percentage of observations correctly classified. Other methods investigated did not perform as well.


Toward Large-Scale Continuous EDA: A Random Matrix Theory Perspective

Ata Kabán, Jakramate Bootkrajang, and Robert J. Durrant. Toward large-scale continuous eda: A random matrix theory perspective. Evolutionary Computation, 2015.
[ bib | http ]
Estimations of distribution algorithms (EDAs) are a major branch of evolutionary algorithms (EA) with some unique advantages in principle. They are able to take advantage of correlation structure to drive the search more efficiently, and they are able to provide insights about the structure of the search space. However, model building in high dimensions is extremely challenging, and as a result existing EDAs may become less attractive in large-scale problems because of the associated large computational requirements. Large-scale continuous global optimisation is key to many modern-day real-world problems. Scaling up EAs to large-scale problems has become one of the biggest challenges of the field. This paper pins down some fundamental roots of the problem and makes a start at developing a new and generic framework to yield effective and efficient EDA-type algorithms for large-scale continuous global optimisation problems. Our concept is to introduce an ensemble of random projections to low dimensions of the set of fittest search points as a basis for developing a new and generic divide-and-conquer methodology. Our ideas are rooted in the theory of random projections developed in theoretical computer science, and in developing and analysing our framework we exploit some recent results in nonasymptotic random matrix theory.


Scalable Text Mining with Sparse Generative Models

Antti Puurula. Scalable Text Mining with Sparse Generative Models. PhD thesis, Department of Computer Science, University of Waikato, 2015.
[ bib | http ]
The information age has brought a deluge of data. Much of this is in text form, insurmountable in scope for humans and incomprehensible in structure for computers. Text mining is an expanding field of research that seeks to utilize the information contained in vast document collections. General data mining methods based on machine learning face challenges with the scale of text data, posing a need for scalable text mining methods. This thesis proposes a solution to scalable text mining: generative models combined with sparse computation. A unifying formalization for generative text models is defined, bringing together research traditions that have used formally equivalent models, but ignored parallel developments. This framework allows the use of methods developed in different processing tasks such as retrieval and classification, yielding effective solutions across different text mining tasks. Sparse computation using inverted indices is proposed for inference on probabilistic models. This reduces the computational complexity of the common text mining operations according to sparsity, yielding probabilistic models with the scalability of modern search engines. The proposed combination provides sparse generative models: a solution for text mining that is general, effective, and scalable. Extensive experimentation on text classification and ranked retrieval datasets are conducted, showing that the proposed solution matches or outperforms the leading task-specific methods in effectiveness, with a order of magnitude decrease in classification times for Wikipedia article categorization with a million classes. The developed methods were further applied in two 2014 Kaggle data mining prize competitions with over a hundred competing teams, earning first and second places.


Cranioplasty outcomes and associated complications: a single centre observational study

Ee Shern Liang, Geoffrey Tipper, Lyn Hunt, and Peter Yee Chiung Gan. Cranioplasty outcomes and associated complications: a single centre observational study. British Journal of Neurosurgery, 2015.
[ bib | http ]
The resurgence of decompressive craniectomy has led to recent published reviews of the safety of cranioplasties. To date there is a wide range of reported mortality and morbidity. This observational study reports the outcomes of the cranioplasty operations from a single centre and evaluates the factors involved in their management. A retrospective search of all theatre logs was performed for the years 2006-2013 inclusive. 88 operations were documented as 'Cranioplasty'. Data collection include patient demographics, type of cranioplasty used, time lapse between decompression and cranioplasty, seniority of the operating surgeon(s), antibiotic regimen and complications. Outcomes were recorded at the three-month follow-up. The overall complication rate was 6.8%. The mean patient age was 36.2 years. 52.2% of patients had decompressive craniectomy for trauma, 11.3% had infectious pathology, 9% had subarachnoid haemorrhage, 9% had tumour with bone infiltration and 3.4% had stroke. 55.7% of patients had cranioplasty within 6 months of craniectomy. 61.3% of cranioplasties were with autologous bone, 20.4% titanium, 10.2% acrylic and 7.9% polyetheretherketone (PEEK). Significant complications included one case of infection, two cases of subgaleal haematoma and one extradural collection. No deaths were noted. No correlation was found between infection and the use of drains. 68.6% of cases were done by either a senior surgeon or a supervised registrar. There was an observable difference in complication rates in relation to the seniority and experience of the operator. However, patient numbers and complications were insufficient to achieve statistical significance. Strict antimicrobial prescribing was observed. Some potentially preventable complications have been addressed with a resulting rate of complications lower than other published reports. We use two standard adjuncts: the presence of a senior surgeon and strict antimicrobial regimens. We believe that our results could be transferrable to other units by following similar guidelines.


An adaptive model-based mutation operator for the wind farm layout optimisation problem

Michael Mayo and Maisa Daoud. An adaptive model-based mutation operator for the wind farm layout optimisation problem. In IEEE International Conference on System, Man and Cybernetics (IEEE SMC2015). IEEE, 2015.
[ bib | http ]
A novel mutation operator for the wind farm layout optimisation problem is proposed and tested. When a wind farm layout is simulated, statistics such as an individual turbine’s wake free ratio can be computed. These statistics are in addition to the global measure being optimised, for example the overall cost of energy extraction of the farm. We present algorithms that first of all build a predictive model of the wake free ratio across an entire wind farm. This model is then used inside a mutation operator to perturb turbines towards positions of high predicted wake free ratio. We evaluate our approach by comparing a 1+1 Evolutionary Strategy using this new mutation operator vs. the same algorithm with a more standard random mutation operator, and show that our new operator leads to the discovery of wind farm layouts having a statistically significantly lower cost of energy extraction.


AI 2015: Advances in Artificial Intelligence: 28th Australasian Joint Conference Canberra, ACT, Australia, November 30 – December 4, 2015 Proceedings

AI 2015: Advances in Artificial Intelligence: 28th Australasian Joint Conference Canberra, ACT, Australia, November 30 – December 4, 2015 Proceedings, 2015.
[ bib | http ]
This book constitutes the refereed proceedings of the 28th Australasian Joint Conference on Artificial Intelligence, AI 2015, held in Canberra, Australia, in November/December 2015. The 39 full papers and 18 short papers presented were carefully reviewed and selected from 102 submissions.


Big data with ADAMS

Peter Reutemann and Geoff Holmes. Big data with adams. In Proc 4th International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, volume 41, pages 5-8. JMLR Workshop and Conference Proceedings, 2015.
[ bib | .pdf ]
ADAMS is a modular open-source Java framework for developing workflows available for academic research as well as commercial applications. It integrates data mining applications, like MOA, WEKA, MEKA and R, image and video processing and feature generation capabilities, spreadsheet and database access, visualizations, GIS, webservices and fast pro-toyping of new functionality using scripting languages (Groovy/Jython).


Accurate photometric redshift probability density estimation - method comparison and application

Markus Michael Rau, Stella Seitz, Fabrice Brimioulle, Eibe Frank, Oliver Friedrich, Daniel Gruen, and Ben Hoyle. Accurate photometric redshift probability density estimation - method comparison and application. Monthly Notices of the Royal Astronomical Society, 452(4):3710-3725, 2015.
[ bib | http ]
We introduce an ordinal classification algorithm for photometric redshift estimation, which significantly improves the reconstruction of photometric redshift probability density functions (PDFs) for individual galaxies and galaxy samples. As a use case we apply our method to CFHTLS galaxies. The ordinal classification algorithm treats distinct redshift bins as ordered values, which improves the quality of photometric redshift PDFs, compared with non-ordinal classification architectures. We also propose a new single value point estimate of the galaxy redshift, which can be used to estimate the full redshift PDF of a galaxy sample. This method is competitive in terms of accuracy with contemporary algorithms, which stack the full redshift PDFs of all galaxies in the sample, but requires orders of magnitude less storage space. The methods described in this paper greatly improve the log-likelihood of individual object redshift PDFs, when compared with a popular neural network code (annz). In our use case, this improvement reaches 50 per cent for high-redshift objects (z ≥ 0.75). We show that using these more accurate photometric redshift PDFs will lead to a reduction in the systematic biases by up to a factor of 4, when compared with less accurate PDFs obtained from commonly used methods. The cosmological analyses we examine and find improvement upon are the following: gravitational lensing cluster mass estimates, modelling of angular correlation functions and modelling of cosmic shear correlation functions.


Use of ensembles of Fourier spectra in capturing recurrent concepts in data streams

Sripirakas Sakthithasan, Russel Pears, Albert Bifet, and Bernhard Pfahringer. Use of ensembles of fourier spectra in capturing recurrent concepts in data streams. In Proc 2015 International Joint Conference on Neural Networks, pages 1-8. IEEE, 2015.
[ bib | .pdf ]
In this research, we apply ensembles of Fourier encoded spectra to capture and mine recurring concepts in a data stream environment. Previous research showed that compact versions of Decision Trees can be obtained by applying the Discrete Fourier Transform to accurately capture recurrent concepts in a data stream. However, in highly volatile environments where new concepts emerge often, the approach of encoding each concept in a separate spectrum is no longer viable due to memory overload and thus in this research we present an ensemble approach that addresses this problem. Our empirical results on real world data and synthetic data exhibiting varying degrees of recurrence reveal that the ensemble approach outperforms the single spectrum approach in terms of classification accuracy, memory and execution time.


Resampling strategies for regression

Luis Torgo, Paula Branco, Rita P. Ribeiro, and Bernhard Pfahringer. Resampling strategies for regression. Expert Systems, 32(3):465-476, 2015.
[ bib | http ]
Several real world prediction problems involve forecasting rare values of a target variable. When this variable is nominal, we have a problem of class imbalance that was thoroughly studied within machine learning. For regression tasks, where the target variable is continuous, few works exist addressing this type of problem. Still, important applications involve forecasting rare extreme values of a continuous target variable. This paper describes a contribution to this type of tasks. Namely, we propose to address such tasks by resampling approaches that change the distribution of the given data set to decrease the problem of imbalance between the rare target cases and the most frequent ones. We present two modifications of well-known resampling strategies for classification tasks: the under-sampling and the synthetic minority over-sampling technique (SMOTE) methods. These modifications allow the use of these strategies on regression tasks where the goal is to forecast rare extreme values of the target variable. In an extensive set of experiments, we provide empirical evidence for the superiority of our proposals for these particular regression tasks. The proposed resampling methods can be used with any existing regression algorithm, which means that they are general tools for addressing problems of forecasting rare extreme values of a continuous target variable.


Having a BLast: meta-learning and heterogeneous ensembles for data streams

Jan N. van Rijn, Geoffrey Holmes, Bernhard Pfahringer, and Joaquin Vanschoren. Having a blast: meta-learning and heterogeneous ensembles for data streams. In Proc IEEE International Conference on Data Mining (ICDM 2015). IEEE, 2015.
[ bib | http ]
Ensembles of classifiers are among the best performing classifiers available in many data mining applications. However, most ensembles developed specifically for the dynamic data stream setting rely on only one type of base-level classifier, most often Hoeffding Trees. In this paper, we study the use of heterogeneous ensembles, comprised of fundamentally different model types. Heterogeneous ensembles have proven successful in the classical batch data setting, however they do not easily transfer to the data stream setting. We therefore introduce the Online Performance Estimation framework, which can be used in data stream ensembles to weight the votes of (heterogeneous) ensemble members differently across the stream. Experiments over a wide range of data streams show performance that is competitive with state of the art ensemble techniques, including Online Bagging and Leveraging Bagging. All experimental results from this work are easily reproducible and publicly available on OpenML for further analysis.


Case study on bagging stable classifiers for data streams

Jan N. van Rijn, Geoffrey Holmes, Bernhard Pfahringer, and Joaquin Vanschoren. Case study on bagging stable classifiers for data streams. In Proc 24th Belgian-Dutch Conference on Machine Learning (BENELEARN 2015), 2016.
[ bib | .pdf ]
Ensembles of classifiers are among the strongest classi-fiers in most data mining applications. Bagging ensembles exploit the instability of base-classifiers by training them on different bootstrap replicates. It has been shown that Bagging instable classifiers, such as decision trees, yield generally good results, whereas bagging stable classifiers, such as k-NN, makes little difference. However, recent work suggests that this cognition applies to the classical batch data mining setting rather than the data stream setting. We present an empirical study that supports this observation.


Ian H. Witten: A stroll through the gardens of computer science

Ian H. Witten. Ian H. Witten: A stroll through the gardens of computer science, chapter 9, pages 127-138. Imperial College Press, 2016.
[ bib | http ]
Ian Witten, http://www.cs.waikato.ac.nz/ ihw/, is a Professor in the Department of Computer Science at the University of Waikato in New Zealand. His many research interests include information retrieval, machine learning, text compression and programming by demonstration. As head of the New Zealand Digital Library Research Group http://www.nzdl.org/cgi-bin/library.cgi, Professor Witten oversaw the development of Greenstone Digital Library software, used by the BBC, New York Botanical Gardens and UNESCO. He has written several books, the latest being Data Mining (2000), How to Build a Digital Library (2002) and Web Dragons: Inside the Myths of Search Engine Technology (2007). His Birthday Book, 2007 (http://www.nzdl.org/Books/Birthday/index.html) includes contributions from collaborators, friends and students, from all around the world. Professor Witten is a Fellow of the ACM and the Royal Society of New Zealand. He was awarded the IFIP Namur Prize (2004) for “contributions to the awareness of social implications of information technology… and the need for a holistic approach in the use of information technology that takes account of social implications”, and won the 2010 World Class New Zealand Award. In 2012 he was appointed Chief Scientific Advisor at Pingar http://www.pingar.com to provide advice on developing technologies and solutions to augment enterprise ability to manage unstructured data.


Random projections as regularizers: learning a linear discriminant from fewer observations than dimensions

Robert J. Durrant and Ata Kabán. Random projections as regularizers: learning a linear discriminant from fewer observations than dimensions. Machine Learning, 99(2):257-286, 2015.
[ bib | http ]
We prove theoretical guarantees for an averaging-ensemble of randomly projected Fisher linear discriminant classifiers, focusing on the case when there are fewer training observations than data dimensions. The specific form and simplicity of this ensemble permits a direct and much more detailed analysis than existing generic tools in previous works. In particular, we are able to derive the exact form of the generalization error of our ensemble, conditional on the training set, and based on this we give theoretical guarantees which directly link the performance of the ensemble to that of the corresponding linear discriminant learned in the full data space. To the best of our knowledge these are the first theoretical results to prove such an explicit link for any classifier and classifier ensemble pair. Furthermore we show that the randomly projected ensemble is equivalent to implementing a sophisticated regularization scheme to the linear discriminant learned in the original data space and this prevents overfitting in conditions of small sample size where pseudo-inverse FLD learned in the data space is provably poor. Our ensemble is learned from a set of randomly projected representations of the original high dimensional data and therefore for this approach data can be collected, stored and processed in such a compressed form. We confirm our theoretical findings with experiments, and demonstrate the utility of our approach on several datasets from the bioinformatics domain and one very high dimensional dataset from the drug discovery domain, both settings in which fewer observations than dimensions are the norm.


Evaluation methods and decision theory for classification of streaming data with temporal dependence

Indrė Žliobaitė, Albert Bifet, Jesse Read, Bernhard Pfahringer, and Geoff Holmes. Evaluation methods and decision theory for classification of streaming data with temporal dependence. Machine Learning, 98(3):455-482, 2015.
[ bib | http ]
Predictive modeling on data streams plays an important role in modern data analysis, where data arrives continuously and needs to be mined in real time. In the stream setting the data distribution is often evolving over time, and models that update themselves during operation are becoming the state-of-the-art. This paper formalizes a learning and evaluation scheme of such predictive models. We theoretically analyze evaluation of classifiers on streaming data with temporal dependence. Our findings suggest that the commonly accepted data stream classification measures, such as classification accuracy and Kappa statistic, fail to diagnose cases of poor performance when temporal dependence is present, therefore they should not be used as sole performance indicators. Moreover, classification accuracy can be misleading if used as a proxy for evaluating change detectors with datasets that have temporal dependence. We formulate the decision theory for streaming data classification with temporal dependence and develop a new evaluation methodology for data stream classification that takes temporal dependence into account. We propose a combined measure for classification performance, that takes into account temporal dependence, and we recommend using it as the main performance measure in classification of streaming data.