2009

[1]

The Positive Effects of Negative Information: Extending One-Class Classification Models in Binary Proteomic Sequence Classification

Stefan Mutter, Bernhard Pfahringer, and Geoffrey Holmes. The positive effects of negative information: Extending one-class classification models in binary proteomic sequence classification. In Proc 22nd Australasian Conference on Artificial Intelligence, Melbourne, Australia, pages 260-269. Springer, 2009.
[ bib | http ]
Profile Hidden Markov Models (PHMMs) have been widely used as models for Multiple Sequence Alignments. By their nature, they are generative one-class classifiers trained only on sequences belonging to the target class they represent. Nevertheless, they are often used to discriminate between classes. In this paper, we investigate the beneficial effects of information from non-target classes in discriminative tasks. Firstly, the traditional PHMM is extended to a new binary classifier. Secondly, we propose propositional representations of the original PHMM that capture information from target and non-target sequences and can be used with standard binary classifiers. Since PHMM training is time intensive, we investigate whether our approach allows the training of the PHMM to stop, before it is fully converged, without loss of predictive power.

[2]

Relational Random Forests Based on Random Relational Rules

Grant Anderson and Bernhard Pfahringer. Relational random forests based on random relational rules. In Proc 21st International Joint Conference on Artificial Intelligence, Pasadena, California, pages 986-991. AAAI Press, 2009.
[ bib | .pdf ]
Random Forests have been shown to perform very well in propositional learning. FORF is an upgrade of Random Forests for relational data. In this paper we investigate shortcomings of FORF and propose an alternative algorithm, RF, for generating Random Forests over relational data. RF employs randomly generated relational rules as fully self-contained Boolean tests inside each node in a tree and thus can be viewed as an instance of dynamic propositionalization. The implementation of RF allows for the simultaneous or parallel growth of all the branches of all the trees in the ensemble in an efficient shared, but still single-threaded way. Experiments favorably compare RF to both FORF and the combination of static propositionalization together with standard Random Forests. Various strategies for tree initialization and splitting of nodes, as well as resulting ensemble size, diversity, and computational complexity of RF are also investigated.

[3]

The WEKA data mining software: an update

Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. The WEKA data mining software: an update. SIGKDD Explorations, 11(1):10-18, 2009.
[ bib | .pdf ]
More than twelve years have elapsed since the first public release of WEKA. In that time, the software has been re-written entirely from scratch, evolved substantially and now accompanies a text on data mining [35]. These days, WEKA enjoys widespread acceptance in both academia and business, has an active community, and has been downloaded more than 1.4 million times since being placed on SourceForge in April 2000. This paper provides an introduction to the WEKA workbench, reviews the history of the project, and, in light of the recent 3.6 stable release, briefly discusses what has been added since the last stable version (Weka 3.4) released in 2003.

[4]

3D Face Recognition Using Multiview Keypoint Matching

Michael Mayo and Edmond Zhang. 3d face recognition using multiview keypoint matching. In Proc 6th International Conference on Advanced Video and Signal Based Surveillance, Genova, Italy, pages 290-295. IEEE Computer Society, 2009.
[ bib | http ]
A novel algorithm for 3D face recognition based point cloud rotations, multiple projections, and voted keypoint matching is proposed and evaluated. The basic idea is to rotate each 3D point cloud representing an individual's face around the x, y or z axes, iteratively projecting the 3D points onto multiple 2.5D images at each step of the rotation. Labelled keypoints are then extracted from the resulting collection of 2.5D images, and this much smaller set of keypoints replaces the original face scan and its projections in the face database. Unknown test faces are recognised firstly by performing the same multiview keypoint extraction technique, and secondly, the application of a new weighted keypoint matching algorithm. In an extensive evaluation using the GavabDB 3D face recognition dataset (61 subjects, 9 scans per subject), our method achieves up to 95% recognition accuracy for faces with neutral expressions only, and over 90% accuracy for face recognition where expressions (such as a smile or a strong laugh) and random faceoccluding gestures are permitted.

[5]

SIFTing the Relevant from the Irrelevant: Automatically Detecting Objects in Training Images

Edmond Zhang and Michael Mayo. Sifting the relevant from the irrelevant: Automatically detecting objects in training images. In Proc Digital Image Computing: Techniques and Applications, Melbourne, Australia, pages 317-324. IEEE Computer Society, 2009.
[ bib | http ]
Many state-of-the-art object recognition systems rely on identifying the location of objects in images, in order to better learn its visual attributes. In this paper, we propose four simple yet powerful hybrid ROI detection methods (combining both local and global features), based on frequently occurring keypoints. We show that our methods demonstrate competitive performance in two different types of datasets, the Caltech101 dataset and the GRAZ-02 dataset, where the pairs of keypoint bounding box method achieved the best accuracies overall.

[6]

New ensemble methods for evolving data streams

Albert Bifet, Geoff Holmes, Bernhard Pfahringer, Richard Kirkby, and Ricard Gavaldà. New ensemble methods for evolving data streams. In KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 139-148, New York, NY, USA, 2009. ACM.
[ bib | .pdf ]
Advanced analysis of data streams is quickly becoming a key area of data mining research as the number of applications demanding such processing increases. Online mining when such data streams evolve over time, that is when concepts drift or change completely, is becoming one of the core issues. When tackling non-stationary concepts, ensembles of classifiers have several advantages over single classifier methods: they are easy to scale and parallelize, they can adapt to change quickly by pruning under-performing parts of the ensemble, and they therefore usually also generate more accurate concept descriptions. This paper proposes a new experimental data stream framework for studying concept drift, and two new variants of Bagging: ADWIN Bagging and Adaptive-Size Hoeffding Tree (ASHT) Bagging. Using the new experimental framework, an evaluation study on synthetic and real-world datasets comprising up to ten million examples shows that the new ensemble methods perform very well compared to several known methods.

[7]

Improving Adaptive Bagging Methods for Evolving Data Streams

Albert Bifet, Geoff Holmes, Bernhard Pfahringer, and Ricard Gavaldà. Improving adaptive bagging methods for evolving data streams. In Proceedings of the 1st Asian Conference on Machine Learning, Nanjing, China. Springer, 2009.
[ bib | .pdf ]
We propose two new improvements for bagging methods on evolving data streams. Recently, two new variants of Bagging were proposed: ADWIN Bagging and Adaptive-Size Hoeffding Tree (ASHT) Bagging. ASHT Bagging uses trees of different sizes, and ADWIN Bagging uses ADWIN as a change detector to decide when to discard underperforming ensemble members. We improve ADWIN Bagging using Hoeffding Adaptive Trees, trees that can adaptively learn from data streams that change over time. To speed up the time for adapting to change of Adaptive-Size Hoeffding Tree (ASHT) Bagging, we add an error change detector for each classifier. We test our improvements by performing an evaluation study on synthetic and real-world datasets comprising up to ten million examples.

[8]

Conditional Density Estimation with Class Probability Estimators

Eibe Frank and Remco Bouckaert. Conditional density estimation with class probability estimators. In Proceedings of the 1st Asian Conference on Machine Learning, Nanjing, China. Springer, 2009.
[ bib | .pdf ]
Many regression schemes deliver a point estimate only, but often it is useful or even essential to quantify the uncertainty inherent in a prediction. If a conditional density estimate is available, then prediction intervals can be derived from it. In this paper we compare three techniques for computing conditional density estimates using a class probability estimator, where this estimator is applied to the discretized target variable and used to derive instance weights for an underlying univariate density estimator; this yields a conditional density estimate. The three density estimators we compare are: a histogram estimator that has been used previously in this context, a normal density estimator, and a kernel estimator. In our experiments, the latter two deliver better performance, both in terms of cross-validated log-likelihood and in terms of quality of the resulting prediction intervals. The empirical coverage of the intervals is close to the desired confidence level in most cases. We also include results for point estimation, as well as a comparison to Gaussian process regression and nonparametric quantile estimation.

[9]

Tree-based Density Estimation: Algorithms and Applications

Gabi Schmidberger. Tree-based Density Estimation: Algorithms and Applications. PhD thesis, Department of Computer Science, University of Waikato, 2009.
[ bib | http ]
Data Mining can be seen as an extension to statistics. It comprises the preparation of data and the process of gathering new knowledge from it. The extraction of new knowledge is supported by various machine learning methods. Many of the algorithms are based on probabilistic principles or use density estimations for their computations. Density estimation has been practised in the field of statistics for several centuries. In the simplest case, a histogram estimator, like the simple equalwidth histogram, can be used for this task and has been shown to be a practical tool to represent the distribution of data visually and for computation. Like other nonparametric approaches, it can provide a flexible solution. However, flexibility in existing approaches is generally restricted because the size of the bins is fixed either the width of the bins or the number of values in them. Attempts have been made to generate histograms with a variable bin width and a variable number of values per interval, but the computational approaches in these methods have proven too difficult and too slow even with modern computer technology. In this thesis new flexible histogram estimation methods are developed and tested as part of various machine learning tasks, namely discretization, naive Bayes classification, clustering and multiple-instance learning. Not only are the new density estimation methods applied to machine learning tasks, they also borrow design principles from algorithms that are ubiquitous in artificial intelligence: divide-andconquer methods are a well known way to tackle large problems by dividing them into small subproblems. Decision trees, used for machine learning classification, successfully apply this approach. This thesis presents algorithms that build density estimators using a binary split tree to cut a range of values into subranges of varying length. No class values are required for this splitting process, making it an unsupervised method. The result is a histogram estimator that adapts well even to complex density functions a novel density estimation method with flexible density estimation ability and good computational behaviour. Algorithms are presented for both univariate and multivariate data. The univariate histogram estimator is applied to discretization for density estimation and also used as density estimator inside a naive Bayes classifier. The multivariate histogram, used as the basis for a clustering method, is applied to improve the runtime behaviour of a well-known algorithm for multiple-instance classification. Performance in these applications is evaluated by comparing the new approaches with existing methods.

[10]

Classifier Chains for Multi-label Classification

Jesse Read, Bernhard Pfahringer, Geoff Holmes, and Eibe Frank. Classifier chains for multi-label classification. In Proc 13th European Conference on Principles and Practice of Knowledge Discovery in Databases and 20th European Conference on Machine Learning, Bled, Slovenia. Springer, 2009.
[ bib | .pdf ]
The widely known binary relevance method for multi-label classification, which considers each label as an independent binary problem, has been sidelined in the literature due to the perceived inadequacy of its label-independence assumption. Instead, most current methods invest considerable complexity to model interdependencies between labels. This paper shows that binary relevance-based methods have much to offer, especially in terms of scalability to large datasets. We exemplify this with a novel chaining method that can model label correlations while maintaining acceptable computational complexity. Empirical evaluation over a broad range of multi-label datasets with a variety of evaluation metrics demonstrates the competitiveness of our chaining method against related and state-of-the-art methods, both in terms of predictive performance and time complexity.

[11]

Continuous Typist Verification using Machine Learning

Kathryn Hempstalk. Continuous Typist Verification using Machine Learning. PhD thesis, Department of Computer Science, University of Waikato, 2009.
[ bib | http ]
A keyboard is a simple input device. Its function is to send keystroke information to the computer (or other device) to which it is attached. Normally this information is employed solely to produce text, but it can also be utilized as part of an authentication system. Typist verification exploits a typist's patterns to check whether they are who they say they are, even after standard authentication schemes have confirmed their identity. This thesis investigates whether typists behave in a sufficiently unique yet consistent manner to enable an effective level of verification based on their typing patterns. Typist verification depends on more than the typist's behaviour. The quality of the patterns and the algorithms used to compare them also determine how accurately verification is performed. This thesis sheds light on all technical aspects of the problem, including data collection, feature identification and extraction, and sample classification. A dataset has been collected that is comparable in size, timing accuracy and content to others in the field, with one important exception: it is derived from real emails, rather than samples collected in an artificial setting. This dataset is used to gain insight into what features distinguish typists from one another. The features and dataset are used to train learning algorithms that make judgements on the origin of previously unseen typing samples. These algorithms use 'one-class classification'; they make predictions for a particular user having been trained on only that user's patterns. This thesis examines many one-class classification algorithms, including ones designed specifically for typist verification. New algorithms and features are proposed to increase speed and accuracy. The best method proposed performs at the state of the art in terms of classification accuracy, while decreasing the time taken for a prediction from minutes to seconds, and - more importantly - without requiring any negative data from other users. Also, it is general: it applies not only to typist verification, but to any other one-class classification problem. Overall, this thesis concludes that typist verification can be considered a useful biometric technique.

[12]

Human-competitive Automatic Topic Indexing

Olena Medelyan. Human-competitive Automatic Topic Indexing. PhD thesis, Department of Computer Science, University of Waikato, 2009.
[ bib | http ]
Topic indexing is the task of identifying the main topics covered by a document. These are useful for many purposes: as subject headings in libraries, as keywords in academic publications and as tags on the web. Knowing a document's topics helps people judge its relevance quickly. However, assigning topics manually is labor intensive. This thesis shows how to generate them automatically in a way that competes with human performance. Three kinds of indexing are investigated: term assignment, a task commonly performed by librarians, who select topics from a controlled vocabulary; tagging, a popular activity of web users, who choose topics freely; and a new method of keyphrase extraction, where topics are equated to Wikipedia article names. A general two-stage algorithm is introduced that first selects candidate topics and then ranks them by significance based on their properties. These properties draw on statistical, semantic, domain-specific and encyclopedic knowledge. They are combined using a machine learning algorithm that models human indexing behavior from examples. This approach is evaluated by comparing automatically generated topics to those assigned by professional indexers, and by amateurs. We claim that the algorithm is human-competitive because it chooses topics that are as consistent with those assigned by humans as their topics are with each other. The approach is generalizable, requires little training data and applies across different domains and languages.

[13]

Human-competitive tagging using automatic keyphrase extraction

Olena Medelyan, Eibe Frank, and Ian H. Witten. Human-competitive tagging using automatic keyphrase extraction. In Proc Conf on Empirical Methods in Natural Language Processing. ACL, 2009.
[ bib | .pdf ]
This paper connects two research areas: automatic tagging on the web and statistical keyphrase extraction. First, we analyze the quality of tags in a collaboratively created folksonomy using traditional evaluation techniques. Next, we demonstrate how documents can be tagged automatically with a state-of-the-art keyphrase extraction algorithm, and further improve performance in this new domain using a new algorithm, ``Maui'', that utilizes semantic information extracted from Wikipedia. Maui outperforms existing approaches and extracts tags that are competitive with those assigned by the best performing human taggers.

[14]

Clustering Documents using a Wikipedia-based Concept Representation

Anna Huang, David Milne, Eibe Frank, and Ian H. Witten. Clustering documents using a wikipedia-based concept representation. In Proc 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Bangkog, Thailand, pages 628-636. Springer, 2009.
[ bib | .pdf ]
This paper shows how Wikipedia and the semantic knowledge it contains can be exploited for document clustering. We first create a concept-based document representation by mapping the terms and phrases within documents to their corresponding articles (or concepts) in Wikipedia. We also developed a similarity measure that evaluates the semantic relatedness between concept sets for two documents. We test the concept-based representation and the similarity measure on two standard text document datasets. Empirical results show that although further optimizations could be performed, our approach already improves upon related techniques.

[15]

Sampling-based Prediction of Algorithm Runtime

Quan Sun. Sampling-based prediction of algorithm runtime. Master's thesis, Department of Computer Science, University of Waikato, 2009.
[ bib | http ]
The ability to handle and analyse massive amounts of data has been progres- sively improved during the last decade with the growth of computing power and the opening up of the Internet era. Nowadays, machine learning algorithms have been widely applied in various fields of engineering sciences and in real world applications. However, currently, users of machine learning algorithms do not usually receive feedback on when a given algorithm will have finished building a model for a particular data set. While in theory such estimation can be obtained by asymptotic performance analysis, the complexity of machine learning algorithms means theoretical asymptotic performance analysis can be a very difficult task. This work has two goals. The first goal is to investigate how to use sampling-based techniques to predict the running time of a ma- chine learning algorithm training on a particular data set. The second goal is to empirically evaluate a set of sampling-based running time prediction meth- ods. Experimental results show that, with some care in the sampling stage, application of appropriate transformations on the running time observations followed by the use of suitable curve fitting algorithms makes it possible to obtain useful average-case running time predictions and an approximate time function for a given machine learning algorithm building a model on a particular data set. There are 41 WEKA (Witten Frank, 2005) machine learning algorithms are used for the experiments.

[16]

Large-scale attribute selection using wrappers

Martin Gütlein, Eibe Frank, Mark Hall, and Andreas Karwath. Large-scale attribute selection using wrappers. In Proc IEEE Symposium on Computational Intelligence and Data Mining, pages 332-339. IEEE, 2009.
[ bib | .pdf ]
Scheme-specific attribute selection with the wrapper and variants of forward selection is a popular attribute selection technique for classification that yields good results. However, it can run the risk of overfitting because of the extent of the search and the extensive use of internal cross-validation. Moreover, although wrapper evaluators tend to achieve superior accuracy compared to filters, they face a high computational cost. The problems of overfitting and high runtime occur in particular on high-dimensional datasets, like microarray data. We investigate Linear Forward Selection, a technique to reduce the number of attributes expansions in each forward selection step. Our experiments demonstrate that this approach is faster, finds smaller subsets and can even increase the accuracy compared to standard forward selection. We also investigate a variant that applies explicit subset size determination in forward selection to combat overfitting, where the search is forced to stop at a precomputed ``optimal'' subset size. We show that this technique reduces subset size while maintaining comparable accuracy.

[17]

Analysing chromatographic data using data mining to monitor petroleum content in water

G. Holmes, D. Fletcher, P. Reutemann, and E. Frank. Analysing chromatographic data using data mining to monitor petroleum content in water. In Proc 4th International ICSC Symposium, Thessaloniki, Greece, Information Technologies in Enviromental Engineering, pages 279-290, May 2009.
[ bib | .pdf ]
Chromatography is an important analytical technique that has wide- spread use in environmental applications. A typical application is the monitoring of water samples to determine if they contain petroleum. These tests are mandated in many countries to enable environmental agencies to determine if tanks used to store petrol are leaking into local water systems.
Chromatographic techniques, typically using gas or liquid chromatography coupled with mass spectrometry, allow an analyst to detect a vast array of compounds-potentially in the order of thousands. Accurate analysis relies heavily on the skills of a limited pool of experienced analysts utilising semi-automatic techniques to analyse these datasets-making the out- comes subjective.
The focus of current laboratory data analysis systems has been on refine- ments of existing approaches. The work described here represents a para- digm shift achieved through applying data mining techniques to tackle the problem. These techniques are compelling because the efficacy of pre- processing methods, which are essential in this application area, can be ob- jectively evaluated. This paper presents preliminary results using a data mining framework to predict the concentrations of petroleum compounds in water samples. Experiments demonstrate that the framework can be used to produce models of sufficient accuracy-measured in terms of root mean squared error and correlation coefficients-to offer the potential fo significantly reducing the time spent by analysts on this task.

[18]

Off-line signature verification

Robert L. Larkins. Off-line signature verification. Master's thesis, Department of Computer Science, University of Waikato, 2009.
[ bib | http ]
In today's society signatures are the most accepted form of identity verification. However, they have the unfortunate side-effect of being easily abused by those who would feign the identification or intent of an individual. This thesis implements and tests current approaches to off-line signature verification with the goal of determining the most beneficial techniques that are available. This investigation will also introduce novel techniques that are shown to significantly boost the achieved classification accuracy for both person-dependent (one-class training) and person-independent (two-class training) signature verification learning strategies. The findings presented in this thesis show that many common techniques do not always give any significant advantage and in some cases they actually detract from the classification accuracy. Using the techniques that are proven to be most beneficial, an effective approach to signature verification is constructed, which achieves approximately 90% and 91% on the standard CEDAR and GPDS signature datasets respectively. These results are significantly better than the majority of results that have been previously published. Additionally, this approach is shown to remain relatively stable when a minimal number of training signatures are used, representing feasibility for real-world situations.

[19]

Machine learning for adaptive computer game opponents

Jonathan Miles. Machine learning for adaptive computer game opponents. Master's thesis, Department of Computer Science, University of Waikato, 2009.
[ bib ]
[20]

Prediction of Oestrus in Dairy Cows: An Application of Machine Learning to Skewed Data

Adam David Lynam. Prediction of oestrus in dairy cows: An application of machine learning to skewed data. Master's thesis, Department of Computer Science, University of Waikato, 2009.
[ bib | http ]
The Dairy industry requires accurate detection of oestrus(heat) in dairy cows to maximise output of the animals. Traditionally this is a process dependant on human observation and interpretation of the various signs of heat. Many areas of the dairy industry can be automated, however the detection of oestrus is an area that still requires human experts. This thesis investigates the application of Machine Learning classification techniques, on dairy cow milking data provided by the Livestock Improvement Corporation, to predict oestrus. The usefulness of various ensemble learning algorithms such as Bagging and Boosting are explored as well as specific skewed data techniques. An empirical study into the effectiveness of classifiers designed to target skewed data is included as a significant part of the investigation. Roughly Balanced Bagging and the novel Under Bagging classifiers are explored in considerable detail and found to perform quite favourably over the SMOTE technique for the datasets selected. This study uses non-dairy, commonplace, Machine Learning datasets; many of which are found in the UCI Machine Learning Repository.

[21]

Accuracy of machine learning models versus hand crafted expert systems - A credit scoring case study

Arie Ben-David and Eibe Frank. Accuracy of machine learning models versus hand crafted expert systems - a credit scoring case study. Expert Systems with Applications, 36(3):5264-5271, 2009.
[ bib | http ]
Relatively few publications compare machine learning models with expert systems when applied to the same problem domain. Most publications emphasize those cases where the former beat the latter. Is it a realistic picture of the state of the art? Some other findings are presented here. The accuracy of a real world mind crafted credit scoring expert system is compared with dozens of machine learning models. The results show that while some machine learning models can surpass the expert system's accuracy with statistical significance, most models do not. More interestingly, this happened only when the problem was treated as regression. In contrast, no machine learning model showed any statistically significant advantage over the expert system's accuracy when the same problem was treated as classification. Since the true nature of the class data was ordinal, the latter is the more appropriate setting. It is also shown that the answer to the question is highly dependent on the meter that is being used to define accuracy.

[22]

Encyclopedia of Database Systems

Ian H. Witten. Encyclopedia of Database Systems, chapter Classification, pages 331-335. Springer, 2009.
[ bib | http ]