Scientific workflow management with ADAMS: building and data mining a database of crop protection and related data

Peter Reutemann and Geoffrey Holmes. Scientific workflow management with adams: building and data mining a database of crop protection and related data. In Beresford RM, Froud KJ, Kean JM, and Worner SP, editors, The plant protection data tool box, pages 167-174, 2014.
[ bib | .pdf ]
Data mining is said to be a field that encourages data to speak for itself rather than “forcing” data to conform to a pre-specified model, but we have to acknowledge that what is spoken by the data may well be gibberish. To obtain meaning from data it is important to use techniques systematically, to follow sound experimental procedure and to examine results expertly. This paper presents a framework for scientific discovery from data with two examples from the biological sciences. The first case is a re-investigation of previously published work on aphid trap data to predict aphid phenology and the second is a commercial application for identifying and counting insects captured on sticky plates in greenhouses. Using support vector machines rather than neural networks or linear regression gives better results in case of the aphid trap data. For both cases, we use the open source machine learning workbench WEKA for predictive modelling and the open source ADAMS workflow system for automating data collection, preparation, feature generation, application of predictive models and output generation.


Fully Supervised Training of Gaussian Radial Basis Function Networks in WEKA

Eibe Frank. Fully supervised training of Gaussian radial basis function networks in WEKA. Technical Report 04/14, Department of Computer Science, University of Waikato, 2014.
[ bib | .pdf ]
Radial basis function networks are a type of feedforward network with a long history in machine learning. In spite of this, there is relatively little literature on how to train them so that accurate predictions are obtained. A common strategy is to train the hidden layer of the network using k-means clustering and the output layer using supervised learning. However, Wettschereck and Dietterich [2] found that supervised training of hidden layer parameters can improve predictive performance. They investigated learning center locations, local variances of the basis functions, and attribute weights, in a supervised manner. This document discusses supervised training of Gaussian radial basis function networks in the WEKA machine learning software. More specifically, we discuss the RBFClassifier and RBFRegressor classes available as part of the RBFNetwork package for WEKA 3.7 and consider (a) learning of center locations and one global variance parameter, (b) learning of center locations and one local variance parameter per basis function, and (c) learning center locations with per-attribute local variance parameters. We also consider learning attribute weights jointly with other parameters.


DNA methylation-associated colonic mucosal immune and defense responses in treatment-naïve pediatric ulcerative colitis

R Alan Harris, Dorottya Nagy-Szakal, Sabina AV Mir, Eibe Frank, Reka Szigeti, Jess L Kaplan, Jiri Bronsky, Antone Opekun, George D Ferry, Harland Winter, and Richard Kellermayer. DNA methylation-associated colonic mucosal immune and defense responses in treatment-naïve pediatric ulcerative colitis. Epigenetics, 9(8):1131-1137, 2014.
[ bib | http ]
Inflammatory bowel diseases (IBD) are emerging globally, indicating that environmental factors may be important in their pathogenesis. Colonic mucosal epigenetic changes, such as DNa methylation, can occur in response to the environment and have been implicated in IBD pathology. However, mucosal DNa methylation has not been examined in treatment-naive patients. We studied DNa methylation in untreated, left sided colonic biopsy specimens using the Infinium humanMethylation450 Beadchip array. We analyzed 22 control (c) patients, 15 untreated crohn’s disease (cD) patients, and 9 untreated ulcerative colitis (Uc) patients from two cohorts. Samples obtained at the time of clinical remission from two of the treatment-naive Uc patients were also included into the analysis. Uc-specific gene expression was interrogated in a subset of adjacent samples (5 c and 5 Uc) using the affymetrix Genechip PrimeView human Gene expression arrays. Only treatment-naive Uc separated from control. One-hundred-and-20 genes with significant expression change in Uc (> 2-fold, P < 0.05) were associated with differentially methylated regions (DMRs). epigenetically associated gene expression changes (including gene expression changes in the IFITM1, ITGB2, S100A9, SLPI, SAA1, and STAT3 genes) were linked to colonic mucosal immune and defense responses. These findings underscore the relationship between epigenetic changes and inflammation in pediatric treatment-naive Uc and may have potential etiologic, diagnostic, and therapeutic relevance for IBD.


Change detection in categorical evolving data streams

Dino Ienco, Albert Bifet, Bernhard Pfahringer, and Pascal Poncelet. Change detection in categorical evolving data streams. In Proc 29th Annual ACM Symposium on Applied Computing, pages 792-797. ACM, 2014.
[ bib ]
Detecting change in evolving data streams is a central issue for accurate adaptive learning. In real world applications, data streams have categorical features, and changes induced in the data distribution of these categorical features have not been considered extensively so far. Previous work on change detection focused on detecting changes in the accuracy of the learners, but without considering changes in the data distribution. To cope with these issues, we propose a new unsupervised change detection method, called CDCStream (Change Detection in Categorical Data Streams), well suited for categorical data streams. The proposed method is able to detect changes in a batch incremental scenario. It is based on the two following characteristics: (i) a summarization strategy is proposed to compress the actual batch by extracting a descriptive summary and (ii) a new segmentation algorithm is proposed to highlight changes and issue warnings for a data stream. To evaluate our proposal we employ it in a learning task over real world data and we compare its results with state of the art methods. We also report qualitative evaluation in order to show the behavior of CDCStream.


Détection de changements dans des flots de données qualitatives

Dino Ienco, Albert Bifet, Bernhard Pfahringer, and Pascal Poncelet. Détection de changements dans des flots de données qualitatives. In Proc 14èmes Journées Francophones Extraction et Gestion des Connaissances, pages 517-520. Hermann-Éditions, 2014.
[ bib ]
Pour mieux analyser et extraire de la connaissance de flots de données, des approches spécifiques ont été proposées ces dernières années. L’un des challenges auquel elles doivent faire face est la détection de changement dans les données. Alors que de plus en plus de données qualitatives sont générées, peu de travaux de recherche se sont intéressés à la détection de changement dans ce contexte et les travaux existants se sont principalement focalisés sur la qualité d’un modèle appris plutôt qu’au réel changement dans les données. Dans cet article nous proposons une nouvelle méthode de détection de changement non supervisée, appelée CDCStream (Change Detection in Categorical Data Streams), adaptée aux flux de données qualitatives.


Evolving artificial datasets to improve interpretable classifiers

Michael Mayo and Quan Sun. Evolving artificial datasets to improve interpretable classifiers. In Proc 2014 IEEE Congress on Evolutionary Computation, pages 2367-2374. IEEE, 2014.
[ bib ]
Differential Evolution can be used to construct effective and compact artificial training datasets for machine learning algorithms. In this paper, a series of comparative experiments are performed in which two simple interpretable supervised classifiers (specifically, Naive Bayes and linear Support Vector Machines) are trained (i) directly on “real” data, as would be the normal case, and (ii) indirectly, using special artificial datasets derived from real data via evolutionary optimization. The results across several challenging test problems show that supervised classifiers trained indirectly using our novel evolution-based approach produce models with superior predictive classification performance. Besides presenting the accuracy of the learned models, we also analyze the sensitivity of our artificial data optimization process to Differential Evolution's parameters, and then we examine the statistical characteristics of the artificial data that is evolved.


Kaggle LSHTC4 Winning Solution

Antti Puurula, Jesse Read, and Albert Bifet. Kaggle LSHTC4 winning solution. CoRR, abs/1405.0546, 2014.
[ bib | http ]
Our winning submission to the 2014 Kaggle competition for Large Scale Hierarchical Text Classification (LSHTC) consists mostly of an ensemble of sparse generative models extending Multinomial Naive Bayes. The base-classifiers consist of hierarchically smoothed models combining document, label, and hierarchy level Multinomials, with feature pre-processing using variants of TF-IDF and BM25. Additional diversification is introduced by different types of folds and random search optimization for different measures. The ensemble algorithm optimizes macroFscore by predicting the documents for each label, instead of the usual prediction of labels per document. Scores for documents are predicted by weighted voting of base-classifier outputs with a variant of Feature-Weighted Linear Stacking. The number of documents per label is chosen using label priors and thresholding of vote scores. This document describes the models and software used to build our solution. Reproducing the results for our solution can be done by running the scripts included in the Kaggle package. A package omitting precomputed result files is also distributed. All code is open source, released under GNU GPL 2.0, and GPL 3.0 for Weka and Meka dependencies.


Multi-label Classification with Meta-Labels

Jesse Read, Antti Puurula, and Albert Bifet. Multi-label classification with meta-labels. In Proc 2014 IEEE International Conference on Data Mining, pages 941-946. IEEE, 2014.
[ bib ]
The area of multi-label classification has rapidly developed in recent years. It has become widely known that the baseline binary relevance approach can easily be outperformed by methods which learn labels together. A number of methods have grown around the label powerset approach, which models label combinations together as class values in a multi-class problem. We describe the label-powerset-based solutions under a general framework of meta-labels and provide some theoretical justification for this framework which has been lacking; explaining how meta-labels essentially allow a random projection into a space where non-linearities can easily be tackled with established linear learning algorithms. The proposed framework enables comparison and combination of related approaches to different multi-label problems. We present a novel model in the framework and evaluate it empirically against several high-performing methods, with respect to predictive performance and scalability, on a number of datasets and evaluation metrics. This deployment obtains competitive accuracy for a fraction of the computation required by the current meta-label methods for multi-label classification.


Crowd-Sourcing Ontology Content and Curation: The Massive Ontology Interface

Samuel Sarjant, Catherine Legg, Matt Stannett, and Duncan Willcock. Crowd-sourcing ontology content and curation: The massive ontology interface. In Proc 8th International Conference on Formal Ontology in Information Systems, pages 251-260. IOS Press, 2014.
[ bib | http ]
Crowd-sourcing is an increasingly popular approach to building large, complex public-interest projects. The ontology infrastructure that is required to scaffold the goals of the Semantic Web is such a project. We have been thinking hard about what ‘crowd-sourced ontology’ might look like, and are currently advancing on two fronts: user-supplied content and user-supplied curation. We achieve the former by mining 90% of the concepts and relations in our ontology from Wikipedia. However other research groups are also pursuing this strategy (e.g. DBpedia, YAGO). Our claim to be on the cutting edge is in our latter goal. We are building a web portal: The Massive Ontology Interface, for users to interact with our ontology in a clean, syntax-light format. The interface is designed to enable users to identify errors and add new concepts and assertions, and to discuss the knowledge in the open-ended way that fosters real collaboration in Wikipedia. We here present our system, discuss the design decisions that have shaped it and the motivation we offer users to interact with it.


Meta-learning and the full model selection problem

Quan Sun. Meta-learning and the full model selection problem. PhD thesis, Department of Computer Science, University of Waikato, 2014.
[ bib | http ]
When working as a data analyst, one of my daily tasks is to select appropriate tools from a set of existing data analysis techniques in my toolbox, including data preprocessing, outlier detection, feature selection, learning algorithm and evaluation techniques, for a given data project. This indeed was an enjoyable job at the beginning, because to me finding patterns and valuable information from data is always fun. Things become tricky when several projects needed to be done in a relatively short time. Naturally, as a computer science graduate, I started to ask myself, What can be automated here?; because, intuitively, part of my work is more or less a loop that can be programmed. Literally, the loop is choose, run, test and choose again... until some criterion/goals are met. In other words, I use my experience or knowledge about machine learning and data mining to guide and speed up the process of selecting and applying techniques in order to build a relatively good predictive model for a given dataset for some purpose. So the following questions arise: Is it possible to design and implement a system that helps a data analyst to choose from a set of data mining tools? Or at least that provides a useful recommendation about tools that potentially save some time for a human analyst. To answer these questions, I decided to undertake a long-term study on this topic, to think, define, research, and simulate this problem before coding my dream system. This thesis presents research results, including new methods, algorithms, and theoretical and empirical analysis from two directions, both of which try to propose systematic and efficient solutions to the questions above, using different resource requirements, namely, the meta-learning-based algorithm/parameter ranking approach and the meta-heuristic search-based full-model selection approach. Some of the results have been published in research papers; thus, this thesis also serves as a coherent collection of results in a single volume.


Hierarchical meta-rules for scalable meta-learning

Quan Sun and Bernhard Pfahringer. Hierarchical meta-rules for scalable meta-learning. In Duc-Nghia Pham and Seong-Bae Park, editors, Proc 13th Pacific Rim International Conference on Artificial intelligence, pages 383-395. Springer, 2014.
[ bib ]
The Pairwise Meta-Rules (PMR) method proposed in [18] has been shown to improve the predictive performances of several meta-learning algorithms for the algorithm ranking problem. Given m target objects (e.g., algorithms), the training complexity of the PMR method with respect to m is quadratic: (m over 2)=m×(m−1)/2. This is usually not a problem when m is moderate, such as when ranking 20 different learning algorithms. However, for problems with a much larger m, such as the meta-learning-based parameter ranking problem, where m can be 100+, the PMR method is less efficient. In this paper, we propose a novel method named Hierarchical Meta-Rules (HMR), which is based on the theory of orthogonal contrasts. The proposed HMR method has a linear training complexity with respect to m, providing a way of dealing with a large number of objects that the PMR method cannot handle efficiently. Our experimental results demonstrate the benefit of the new method in the context of meta-learning.


Improvements to BM25 and Language Models Examined

Andrew Trotman, Antti Puurula, and Blake Burgess. Improvements to BM25 and language models examined. In Proc 2014 Australasian Document Computing Symposium, page 58. ACM, 2014.
[ bib ]
Recent work on search engine ranking functions report improvements on BM25 and Language Models with Dirichlet Smoothing. In this investigation 9 recent ranking functions (BM25, BM25+, BM25T, BM25-adpt, BM25L, TFlID, LM-DS, LM-PYP, and LM-PYP-TFIDF) are compared by training on the INEX 2009 Wikipedia collection and testing on INEX 2010 and 9 TREC collections. We find that once trained (using particle swarm optimi- zation) there is very little difference in performance between these functions, that relevance feedback is effective, that stemming is effective, and that it remains unclear which function is best overall.


WISE 2014 Challenge: Multi-label Classification of Print Media Articles to Topics

Grigorios Tsoumakas, Apostolos Papadopoulos, Weining Qian, Stavros Vologiannidis, Alexander D'yakonov, Antti Puurula, Jesse Read, Jan Svec, and Stanislav Semenov. WISE 2014 challenge: Multi-label classification of print media articles to topics. In Proc 15th International Conference on Web Information Systems Engineering, pages 541-548, 2014.
[ bib | http ]
The WISE 2014 challenge was concerned with the task of multi-label classification of articles coming from Greek print media. Raw data comes from the scanning of print media, article segmentation, and optical character segmentation, and therefore is quite noisy. Each article is examined by a human annotator and categorized to one or more of the topics being monitored. Topics range from specific persons, products, and companies that can be easily categorized based on keywords, to more general semantic concepts, such as environment or economy. Building multi-label classifiers for the automated annotation of articles into topics can support the work of human annotators by suggesting a list of all topics by order of relevance, or even automate the annotation process for media and/or categories that are easier to predict. This saves valuable time and allows a media monitoring company to expand the portfolio of media being monitored. This paper summarizes the approaches of the top 4 among the 121 teams that participated in the competition.


Algorithm selection on data streams

Jan N van Rijn, Geoffrey Holmes, Bernhard Pfahringer, and Joaquin Vanschoren. Algorithm selection on data streams. In Proc 17th International Conference on Discovery Science, pages 325-336. Springer, 2014.
[ bib ]
We explore the possibilities of meta-learning on data streams, in particular algorithm selection. In a first experiment we calculate the characteristics of a small sample of a data stream, and try to predict which classifier performs best on the entire stream. This yields promising results and interesting patterns. In a second experiment, we build a meta-classifier that predicts, based on measurable data characteristics in a window of the data stream, the best classifier for the next window. The results show that this meta-algorithm is very competitive with state of the art ensembles, such as OzaBag, OzaBoost and Leveraged Bagging. The results of all experiments are made publicly available in an online experiment database, for the purpose of verifiability, reproducibility and generalizability.


Towards meta-learning over data streams

Jan N van Rijn, Geoffrey Holmes, Bernhard Pfahringer, and Joaquin Vanschoren. Towards meta-learning over data streams. In Proc International Workshop on Meta-learning and Algorithm Selection, volume Vol-1201, pages 37-38. ceur-ws.org, 2014.
[ bib | .pdf ]
Modern society produces vast streams of data. Many stream mining algorithms have been developed to capture general trends in these streams, and make predictions for future observations, but relatively little is known about which algorithms perform particularly well on which kinds of data. Moreover, it is possible that the characteristics of the data change over time, and thus that a different algorithm should be recommended at various points in time. Figure 1 illustrates this. As such, we are dealing with the Algorithm Selection Problem [9] in a data stream setting. Based on measurable meta-features from a window of observations from a data stream, a meta-algorithm is built that predicts the best classifier for the next window. Our results show that this meta-algorithm is competitive with state-of-the-art data streaming ensembles, such as OzaBag [6], OzaBoost [6] andLeveraged Bagging[3].


Active learning with drifting streaming data

Indrė Žliobaitė, Albert Bifet, Bernhard Pfahringer, and Geoff Holmes. Active learning with drifting streaming data. IEEE Transactions on Neural Networks and Learning Systems, 25(1):27-39, 2014.
[ bib ]
In learning to classify streaming data, obtaining true labels may require major effort and may incur excessive cost. Active learning focuses on carefully selecting as few labeled instances as possible for learning an accurate predictive model. Streaming data poses additional challenges for active learning, since the data distribution may change over time (concept drift) and models need to adapt. Conventional active learning strategies concentrate on querying the most uncertain instances, which are typically concentrated around the decision boundary. Changes occurring further from the boundary may be missed, and models may fail to adapt. This paper presents a theoretically supported framework for active learning from drifting data streams and develops three active learning strategies for streaming data that explicitly handle concept drift. They are based on uncertainty, dynamic allocation of labeling efforts over time, and randomization of the search space. We empirically demonstrate that these strategies react well to changes that can occur anywhere in the instance space and unexpectedly.


Meta-level sentiment models for big social data analysis

Felipe Bravo-Marquez, Marcelo Mendoza, and Barbara Poblete. Meta-level sentiment models for big social data analysis. Knowl.-Based Syst., 69:86-99, 2014.
[ bib | http | .pdf ]
People react to events, topics and entities by expressing their personal opinions and emotions. These reactions can correspond to a wide range of intensities, from very mild to strong. An adequate processing and understanding of these expressions has been the subject of research in several fields, such as business and politics. In this context, Twitter sentiment analysis, which is the task of automatically identifying and extracting subjective information from tweets, has received increasing attention from the Web mining community. Twitter provides an extremely valuable insight into human opinions, as well as new challenging Big Data problems. These problems include the processing of massive volumes of streaming data, as well as the automatic identification of human expressiveness within short text messages. In that area, several methods and lexical resources have been proposed in order to extract sentiment indicators from natural language texts at both syntactic and semantic levels. These approaches address different dimensions of opinions, such as subjectivity, polarity, intensity and emotion. This article is the first study of how these resources, which are focused on different sentiment scopes, complement each other. With this purpose we identify scenarios in which some of these resources are more useful than others. Furthermore, we propose a novel approach for sentiment classification based on meta-level features. This supervised approach boosts existing sentiment classification of subjectivity and polarity detection on Twitter. Our results show that the combination of meta-level features provides significant improvements in performance. However, we observe that there are important differences that rely on the type of lexical resource, the dataset used to build the model, and the learning strategy. Experimental results indicate that manually generated lexicons are focused on emotional words, being very useful for polarity prediction. On the other hand, lexicons generated with automatic methods include neutral words, introducing noise in the detection of subjectivity. Our findings indicate that polarity and subjectivity prediction are different dimensions of the same problem, but they need to be addressed using different subspace features. Lexicon-based approaches are recommendable for polarity, and stylistic part-of-speech based approaches are meaningful for subjectivity. With this research we offer a more global insight of the resource components for the complex task of classifying human emotion and opinion.


A novel deterministic approach for aspect-based opinion mining in tourism products reviews

Edison Marrese-Taylor, Juan D. Velásquez, and Felipe Bravo-Marquez. A novel deterministic approach for aspect-based opinion mining in tourism products reviews. Expert Syst. Appl., 41(17):7764-7775, 2014.
[ bib | http | .pdf ]
This work proposes an extension of Bing Liu’s aspect-based opinion mining approach in order to apply it to the tourism domain. The extension concerns with the fact that users refer differently to different kinds of products when writing reviews on the Web. Since Liu’s approach is focused on physical product reviews, it could not be directly applied to the tourism domain, which presents features that are not considered by the model. Through a detailed study of on-line tourism product reviews, we found these features and then model them in our extension, proposing the use of new and more complex NLP-based rules for the tasks of subjective and sentiment classification at the aspect-level. We also entail the task of opinion visualization and summarization and propose new methods to help users digest the vast availability of opinions in an easy manner. Our work also included the development of a generic architecture for an aspect-based opinion mining tool, which we then used to create a prototype and analyze opinions from TripAdvisor in the context of the tourism industry in Los Lagos, a Chilean administrative region also known as the Lake District. Results prove that our extension is able to perform better than Liu’s model in the tourism domain, improving both Accuracy and Recall for the tasks of subjective and sentiment classification. Particularly, the approach is very effective in determining the sentiment orientation of opinions, achieving an F-measure of 92% for the task. However, on average, the algorithms were only capable of extracting 35% of the explicit aspect expressions, using a non-extended approach for this task. Finally, results also showed the effectiveness of our design when applied to solving the industry’s specific issues in the Lake District, since almost 80% of the users that used our tool considered that our tool adds valuable information to their business.