[1]

Unsupervised learning from incomplete data using a mixture model approach
Lynette Hunt and Murray Jorgensen.
Unsupervised learning from incomplete data using a mixture model
approach.
In Statistical Data Mining and Knowledge Discovery, pages
173188. Chapman & Hall/CRC, 2004.
[ bib ]
Many unsupervised learning tasks involve high dimensional data sets where some of the attributes are continuous and some are categorical. One possible approach to clustering such data is to assume that the data to be clustered arise from a finite mixture of populations. The mixture likelihood approach has been well developed and much used, especially for mixtures where the component distributions are multivariate normal. Hunt and Jorgensen [17] presented methodology that enabled the clustering of mixed categorical and continuous data using a mixture approach. However many multivariate data sets encountered also contain unobserved or missing values. In this paper we demonstrate that the mixture approach to clustering can handle mixed categorical and continuous data where data are missing at random in the sense of Little and Rubin [30]. The methodology is illustrated by clustering two data sets.


[2]

Experiments in predicting biodegradability
H. Blockeel, S. Dzeroski, B. Kompare, S. Kramer, B. Pfahringer, and
W. Van Laer.
Experiments in predicting biodegradability.
Journal of Applied Artificial Intelligence, 18(2), Feburary
2004.
[ bib 
.ps 
.pdf ]
This paper is concerned with the use of AI techniques in ecology. More specifically, we present a novel application of inductive logic programming (ILP) in the area of quantitative structureactivity relationships (QSARs). The activity we want to predict is the biodegradability of chemical compounds in water. In particular, the target variable is the halflife for aerobic aqueous biodegradation. Structural descriptions of chemicals in terms of atoms and bonds are derived from the chemicals' SMILES encodings. Definition of substructures are used as background knowledge. Predicting biodegradability is essentially a regression problem, but we also consider a discretized version of the target variable. We thus employ a number of relational classification and regression methods on the relational representation and compare these to propositional methods applied to different propositionalizations of the problem. We also experiment with the prediction technique that consists of merging upper and lower bound predictions into one prediction. Some conclusions are drawn concerning the applicability of machine learning systems and the merging technique in this domain and concerning the evaluation of hypotheses.


[3]

Bayesian network classifiers in Weka
R. Bouckaert.
Bayesian network classifiers in weka.
Technical Report 14/2004, The University of Waikato, Department of
Computer Science, Hamilton, New Zealand, 2004.
[ bib 
.pdf ]
Various Bayesian network classifier learning algorithms are implemented in Weka. This note provides some user documentation and implementation details. Summary of the main capabilities:
(a) Structure learning of Bayesian networks using various hill climbing (K2, B, etc) and general purpose (simulated annealing, tabu search) algorithms.
(b) Local score metrics implemented; Bayes, BDe, MDL, entropy, AIC.
(c) Global score metrics implemented; leave one out cv, kfold cv and cumulative cv.
(d) Conditional independence based causal recovery algorithm available.
(e) Parameter estimation using direct estimates and Bayesian model averaging.
(f) GUI for easy inspection of Bayesian networks.
(g) Part of Weka allowing systematic experiments to compare Bayes net performance with general purpose classifiers like C4.5, nearest neighbor, support vector, etc.
(h) Source code available under GPL allows for integration in other systems and makes it easy to extend.


[4]

Estimating replicability of classifier learning
R. Bouckaert.
Estimating replicability of classifier learning.
In Proc International Conference on Machine Learning, 2004.
[ bib 
.ps 
.pdf ]
Replicability of machine learning experiments measures how likely it is that the outcome of one experiment is repeated when performed with a different randomization of the data. In this paper, we present an estimator or replicability of an experiment that is efficient. More precisely, the estimator is unbiased and has lowest variance in the class of estimators formed by a linear combination of outcomes of experiments on a given data set.
We gathered empirical data for comparing experiments consisting of different sampling schemes and hypothesis tests. Both factors are shown to have an impact on replicability of experiments. The data suggests that sign tests should not be used due to low replicability. Ranked sum tests show better performance, but the combination of a sorted runs sampling scheme with a ttest gives the most desirable performance judged on Type I and II error and replicability.


[5]

Evaluating the Replicability of Significance Tests for Comparing
Learning Algorithms
Remco R. Bouckaert and Eibe Frank.
Evaluating the replicability of significance tests for comparing
learning algorithms.
In Proc 8th PacificAsia Conference on Knowledge Discovery and
Data Mining, volume 3056 of LNAI, pages 312, Sydney, Australia,
2004. Springer.
[ bib 
.ps 
.pdf ]
Empirical research in learning algorithms for classification tasks generally requires the use of significance tests. The quality of a test is typically judged on Type I error (how often the test indicates a difference when it should not) and Type II error (how often it indicates no difference when it should). In this paper we argue that the replicability of a test is also of importance. We say that a test has low replicability if its outcome strongly depends on the particular random partitioning of the data that is used to perform it. We present empirical measures of replicability and use them to compare the performance of several popular tests in a realistic setting involving standard learning algorithms and benchmark datasets. Based on our results we give recommendations on which test to use.


[6]

Naive Bayes classifiers that perform well with continuous variables
R. Bouckaert.
Naive bayes classifiers that perform well with continuous variables.
In G.I. Webb and X. Yu, editors, Proc Seventeenth Australian
Joint Conference on Artificial Intelligence (AI 2004), Advances in Artificial
Intelligence, volume 3339 of LNAI, pages 10891094, Cairns,
Australia, December 2004. Springer.
[ bib ]
There are three main methods for handling continuous variables in naive Bayes classifiers, namely, the normal method (parametric approach), the kernel method (non parametric approach) and discretization. In this article, we perform a methodologically sound comparison of the three methods, which shows large mutual differences of each of the methods and no single method being universally better. This suggests that a method for selecting one of the three approaches to continuous variables could improve overall performance of the naive Bayes classifier. We present three methods that can be implemented efficiently vfold cross validation for the normal, kernel and discretization method. Empirical evidence suggests that selection using 10 fold cross validation (especially when repeated 10 times) can largely and significantly improve over all performance of naive Bayes classifiers and consistently outperform any of the three popular methods for dealing with continuous variables on their own. This is remarkable, since selection among more classifiers does not consistently result in better accuracy.


[7]

Data mining in bioinformatics using Weka
Eibe Frank, Mark Hall, Leonard E. Trigg, Geoffrey Holmes, and Ian H. Witten.
Data mining in bioinformatics using Weka.
Bioinformatics, 20(15):24792481, April 2004.
[ bib 
http 
.ps 
.pdf ]
The Weka machine learning workbench provides a generalpurpose environment for automatic classification, regression, clustering, and feature selectioncommon data mining problems in bioinformatics research. It contains an extensive collection of machine learning algorithms and data preprocessing methods complemented by graphical user interfaces for data exploration and the experimental comparison of different machine learning techniques on the same problem. Weka can process data given in the form of a single relational table. Its main objectives are to (a) assist users in extracting useful information from data and (b) enable them to easily identify a suitable algorithm for generating an accurate predictive model from it.


[8]

Predicting Library of Congress classifications from Library
of Congress subject headings
Eibe Frank and Gordon W. Paynter.
Predicting library of congress classifications from library of
congress subject headings.
JASIST, 55(3):214227, 2004.
Also published as Working Paper 01/2003, Department of Computer
Science, The University of Waikato, Hamilton.
[ bib 
.ps 
.pdf ]
This paper addresses the problem of automatically assigning a Library of Congress Classification (LCC) to work given its set of Library of Congress Subject Headings (LCSH). LCC are organized in a tree: the root node of this hierarchy comprises all possible topics, and leaf nodes correspond to the most specialized topic areas defined. We describe a procedure that, given a resource identified by its LCSH, automatically places that resource in the LCC hierarchy. The procedure uses machine learning techniques and training data from a large library catalog to learn a classification model mapping from sets of LCSH to nodes in the LCC tree. We present empirical results for our technique showing its accuracy on an independent collection of 50,000 LCSH/LCC pairs.


[9]

Ensembles of nested dichotomies for multiclass problems
Eibe Frank and Stefan Kramer.
Ensembles of nested dichotomies for multiclass problems.
In R. Greiner and D. Schuurmans, editors, Proc 21st
International Conference on Machine Learning, pages 305312, Banff, Canada,
2004. ACM Press.
Also published as Working Paper 06/2004, Department of Computer
Science, The University of Waikato, Hamilton.
[ bib 
.ps.gz 
.pdf ]
Nested dichotomies are a standard statistical technique for tackling certain polytomous classification problems with logistic regression. They can be represented as binary trees that recursively split a multiclass classification task into a system of dichotomies and provide a statistically sound way of applying twoclass learning algorithms to multiclass problems (assuming these algorithms generate class probability estimates). However, there are usually many candidate trees for a given problem and in the standard approach the choice of a particular tree is based on domain knowledge that may not be available in practice. An alternative is to treat every system of nested dichotomies as equally likely and to form an ensemble classifier based on this assumption. We show that this approach produces more accurate classifications than applying C4.5 and logistic regression directly to multiclass problems. Our results also show that ensembles of nested dichotomies produce more accurate classifiers than pairwise classification if both techniques are used with C4.5, and comparable results for logistic regression. Compared to errorcorrecting output codes, they are preferable if logistic regression is used, and comparable in the case of C4.5. An additional benefit is that they generate class probability estimates. Consequently they appear to be a good generalpurpose method for applying binary classifiers to multiclass problems.


[10]

An Instrument Control System Using Predictive Modelling
Geoffrey Holmes and Dale Fletcher.
An instrument control system using predictive modelling.
In H. Araujo and et al., editors, Proceedings of the First
International Conference on Informatics in Control, Automation and Robotics,
pages 292295, Setubal, Portugal, 2004. INSTICC Press.
[ bib ]
We describe a system for providing early warning of possible error to an operator in control of an instrument providing results in batches from samples, for example, chemical elements found in soil or water samples. The system has the potential to be used with any form of instrument that provides multiple results for a given sample. The idea is to train models for each measurement, using historical data. The set of trained models are then capable of making predictions on new data based on the values of the other measurements. This approach has the potential to uncover previously unknown relationships between the measurements. An example application has been constructed that highlights the difference of the actual value for a measurement from its predicted value. The operator is provided with sliders to attenuate the sensitivity of the measurement perhaps based on its importance or its known sensitivity.


[11]

Mining data streams using option trees (revised edition, 2004)
G. Holmes, R. Kirkby, and B. Pfahringer.
Mining data streams using option trees (revised edition, 2004).
Technical Report 03/2004, The University of Waikato, Department of
Computer Science, Hamilton, New Zealand, 2004.
[ bib 
.pdf ]
The data stream model for data mining places harsh restrictions on a learning algorithm. A model must be induced following the briefest interrogation of the data, must use only available memory and must update itself over time within these constraints. Additionally, the model must be able to be used for data mining at any point in time. This paper describes a data stream classification algorithm using an ensemble of option trees. The ensemble of trees is induced by boosting and iteratively combined into a single interpretable model. The algorithm is evaluated using benchmark datasets for accuracy against stateoftheart algorithms that make use of the entire dataset.


[12]

Mining data streams using option trees
G. Holmes, R. Kirkby, and B. Pfahrinnger.
Mining data streams using option trees.
In J. AguilarRuiz and J. Gama, editors, Proc Workshop W10 on
Knowledge Discovery in Data Streams, Fifthteenth Conference on Machine
Learning, Eighth European Conference on Principles and Practice of Knowledge
Discovery in Databases (ECML/PKDD), pages 7584, Pisa, Italy, 2004.
[ bib ]
The data stream model for data mining places harsh restrictions on a learning algorithm. A model must be induced following the briefest interrogation of the data, must use only available memory and must update itself over time within these constraints. Additionally, the model must be able to be used for data mining at any point in time. This paper describes a data stream classification algorithm using an ensemble of option trees. This ensemble is iteratively combined into a single interpretable model. The algorithm is evaluated using benchmark datasets for accuracy against stateoftheart algorithms that make use of the entire dataset.


[13]

Multinomial Naive Bayes for Text Categorization Revisited
Ashraf M. Kibriya, Eibe Frank, Bernhard Pfahringer, and Geoffrey Holmes.
Multinomial naive Bayes for text categorization revisited.
In G. I. Webb and X. Yu, editors, Proc 17th Australian Joint
Conference on Artificial Intelligence, volume 3339 of LNAI, pages
488499, Cairns, Australia, 2004. Springer.
[ bib 
.ps.gz 
.pdf ]
This paper presents empirical results for several versions of the multinomial naive Bayes classifier on four text categorization problems, and a way of improving it using locally weighted learning. More specifically, it compares standard multinomial naive Bayes to the recently proposed transformed weightnormalized complement naive Bayes classifier (TWCNB) [1], and shows that some of the modifications included in TWCNB may not be necessary to achieve optimum performance on some datasets. However, it does show that TFIDF conversion and document length normalization are important. It also shows that support vector machines can, in fact, sometimes very significantly outperform both methods. Finally, it shows how the performance of multinomial naive Bayes can be improved using locally weighted learning. However, the overall conclusion of our paper is that support vector machines are still the method of choice if the aim is to maximize accuracy.


[14]

Clustering Large Datasets Using Cobweb and KMeans in Tandem
Mi Li, Geoffrey Holmes, and Bernhard Pfahringer.
Clustering large datasets using cobweb and kmeans in tandem.
In G. I. Webb and X. Yu, editors, Proc 17th Australian Joint
Conference on Artificial Intelligence, volume 3339 of LNAI, pages
368379, Cairns, Australia, 2004. Springer.
[ bib ]
This paper presents a single scan algorithm for clustering large datasets based on a two phase process which combines two well known clustering methods. The Cobweb algorithm is modified to produce a balanced tree with subclusters at the leaves, and then Kmeans is applied to the resulting subclusters. The resulting method, Scalable Cobweb, is then compared to a single pass Kmeans algorithm and standard Kmeans. The evaluation looks at error as measured by the sum of squared error and vulnerability to the order in which data points are processed.


[15]

Flow clustering using machine learning techniques
A. McGregor, M. Hall, P. Lorier, and J. Brunskill.
Flow clustering using machine learning techniques.
In C. Barakat and I. Pratt, editors, Proc Fifth International
Workshop on Passive and Active Network Measurement (PAM 2004), volume 3015
of LNCS, pages 205214, Antibes JuanlesPins, France, 2004. Springer.
[ bib ]
Packet header traces are widely used in network analysis. Header traces are the aggregate of traffic from many concurrent applications. We present a methodology, based on machine learning, that can break the trace down into clusters of traffic where each cluster has different traffic characteristics. Typical clusters include bulk transfer, single and multiple transactions and interactive traffic, amongst others. The paper includes a description of the methodology, a visualisation of the attribute statistics that aids in recognising cluster types and a discussion of the stability and effectiveness of the methodology.


[16]

Using Classification to Evaluate the Output of ConfidenceBased
Association Rule Mining
Stefan Mutter, Mark Hall, and Eibe Frank.
Using classification to evaluate the output of confidencebased
association rule mining.
In G. I. Webb and X. Yu, editors, Proc 17th Australian Joint
Conference on Artificial Intelligence, volume 3339 of LNAI, pages
538549, Cairns, Australia, 2004. Springer.
[ bib 
.ps.gz 
.pdf ]
Association rule mining is a data mining technique that reveals interesting relationships in a database. Existing approaches employ different parameters to search for interesting rules. This fact and the large number of rules make it difficult to compare the output of confidencebased association rule miners. This paper explores the use of classification performance as a metric for evaluating their output. Previous work on forming classifiers from association rules has focussed on accurate classification, whereas we concentrate on using the properties of the resulting classifiers as a basis for comparing confidencebased association rule learners. Therefore, we present experimental results on 12 UCI datasets showing that the quality of small rule sets generated by Apriori can be improved by using the predictive Apriori algorithm. We also show that CBA, the standard method for classification using association rules, is generally inferior to standard rule learners concerning both running time and size of rule sets.


[17]

Applying machine learning to programming by demonstration
Gordon W. Paynter and Ian H. Witten.
Applying machine learning to programming by demonstration.
J. Exp. Theor. Artif. Intell., 16(3):161188, 2004.
[ bib ]
'Familiar' is a tool that helps endusers automate iterative tasks in their applications by showing examples of what they want to do. It observes the user's actions, predicts what they will do next, and then offers to complete their task. Familiar learns in two ways. First, it creates a model, based on data gathered from training tasks, that selects the best prediction from among several candidates. Experiments show that decision trees outperform heuristic methods, and can be further improved by incrementally updating the classifier at task time. Second, it uses decision stumps inferred from analogous examples in the event trace to predict the parameters of conditional rules. Because data is sparsefor most users balk at giving more than a few training examplespermutation tests are used to calculate the statistical significance of each stump, successfully eliminating bias towards attributes with many different values.


[18]

A Toolbox for Learning from Relational Data with Propositional
and Multiinstance Learners
Peter Reutemann, Bernhard Pfahringer, and Eibe Frank.
A toolbox for learning from relational data with propositional and
multiinstance learners.
In G. I. Webb and X. Yu, editors, Proc 17th Australian Joint
Conference on Artificial Intelligence, volume 3339 of LNAI, pages
10171023, Cairns, Australia, 2004. Springer.
[ bib 
.ps.gz 
.pdf ]
Most databases employ the relational model for data storage. To use this data in a propositional learner, a propositionalization step has to take place. Similarly, the data has to be transformed to be amenable to a multiinstance learner. The Proper Toolbox contains an extended version of RELAGGS, the MultiInstance Learning Kit MILK, and can also combine the multiinstance data with aggregated data from RELAGGS. RELAGGS was extended to handle arbitrarily nested relations and to work with both primary keys and indices. For MILK the relational model is flattened into a single table and this data is fed into a multiinstance learner. REMILK finally combines the aggregated data produced by RELAGGS and the multiinstance data, flattened for MILK, into a single table that is once again the input for a multiinstance learner. Several wellknown datasets are used for experiments which highlight the strengths and weaknesses of the different approaches.


[19]

Computer Aided Software Engineering (CASE), empowered by Natural Language Processing (NLP) for object oriented analysis and design using Unified Modeling Language (UML)
T. C. Smith and R. T. Giganto.
Computer aided software engineering (case), empowered by natural
language processing (nlp) for object oriented analysis and design using
unified modeling language (uml).
In Proc Second Mindanao Conference on Information Technology
Education (MCITE'04), pages 1923, Davao City, Philippines, 2004.
[ bib ]
Computer Aided Software Engineering (CASE) tools are often regarded as the ultimate aide for software engineers in developing software. However, studies show that CASE tools are not enough to assist developers fully in their analysis tasks. Natural Language Processing (NLP) has been empowering these CASE tools to at least aid developers in the analysis phases. Still, NLPbased CASE tools have certain weaknesses when it comes to detecting classes or objects in the system. With the advent of object oriented analysis and design (OOAD), Unified Modeling Language (UML) has come to be the preferred medium for modeling systems. Existing NLPbased CASE tools produce object models in UML; however, unnecessary classes are often created in the process, while other object models are not produced at all. This paper introduces a research motivation aimed at designing a new NLPbased CASE system that minimizes generation of unnecessary classes and expresses object models in UML.


[20]

A textclassification approach to the prediction and characterization of signal peptides
T. C. Smith.
A textclassification approach to the prediction and characterization
of signal peptides.
In Proc X Convencion Internacional y Feria, Libro Reumen,
Informatica 2004, pages 361362, La Habana, Cuba, 2004.
[ bib ]
This paper describes an unconventional machine learning approach to predicting signal peptide cleavage points through a kind of pseudotext classification procedure. It is based on the view that the first amino residue in the mature protein can be characterized with a textual summary of some of its specific properties. By creating a textual description of every amino residuethat is, an account of such things as the chemical properties of the residue, its context in the aminoacid chain, its relative proximity to the amino terminus, and so fortha text classification algorithm can infer a characteristic model of those textual features that imply whether or not any one residue is or is not the start of the mature protein. More to the point, insofar as a biologist might give an account, in English, as to why some particular residue in the aminoacid chain actually marks the cleavage site, a ranked list of the textual attributes inferred by a text classification algorithm can do the same.


[21]

Data mining bread quality and process data in a plant bakery
A.J. Wilson, M.P. Morgenstern, B. Pfahringer, and C. Leschi.
Data mining bread quality and process data in a plant bakery.
In Proc Twelfth ICC Cereal and Bread Congress, Harrogate, UK,
May 2004.
[ bib 
.ps 
.pdf ]
In modern automated plant bakeries a large amount of data is collected on the operation of the plant. When this data is combined with product quality data such as loaf colour, appearance, consumer complaints sales data etc, it has the potential to be used to improve processing efficiency, final product quality, and product marketability. However the huge volume of this data means it is often ignored as being too hard to analyse in any meaningful way. Data mining, which is a combination of techniques that produces information from large data sets, has the potential to be applied to this data to extract useful information.
This paper describes our experience of applying data mining techniques to a plant bakery in New Zealand. The process involved setting up the systems required to extract data from the bakeries SCADA system, setting up sensors to automatically measure and record quality parameters, cleaning the data to remove faulted or anomalous results and then combining all the separate data blocks into one complete database for analysis.
Data were analysed at two levels. Firstly, selected data were analysed for simple trends on an individual loaf basis which served to identify variability caused by divider pockets, tin positioning etc. Secondly data mining techniques such as various classifiers and principal components were applied to the whole data set to find relationships between process data and product quality.


[22]

Text mining in a digital library
Ian H. Witten, Katherine J. Don, Michael Dewsnip, and Valentin Tablan.
Text mining in a digital library.
Int. J. on Digital Libraries, 4(1):5659, 2004.
[ bib ]
Digital librarians strive to add value to the collections they create and maintain. One way is through selectivity: a carefully chosen set of authoritative documents in a particular topic area is far more useful to those working in the area than a huge, unfocused collections (like the Web). Another is by augmenting the collection with highquality metedata, which supports activities of searching and browsing in a uniform and useful way. A third way, and our topic here, is to enrich the documents by examining their content, extracting information, and using it to enhance the ways they can be located and presented.


[23]

Adaptive text mining: inferring structure from sequences
Ian H. Witten.
Adaptive text mining: inferring structure from sequences.
J. Discrete Algorithms, 2(2):137159, 2004.
[ bib ]
Text mining is about inferring structure from sequences representing natural language text, and may be defined as the process of analyzing text to extract information that is useful for particular purposes. Although handcrafted heuristics are a common practical approach for extracting information from text, a general, and generalizable, approach requires adaptive techniques. This paper studies the way in which the adaptive techniques used in text compression can be applied to text mining. It develops several examples: extraction of hierarchical phrase structures from text, identification of keyphrases in documents, locating proper names and quantities of interest in a piece of text, text categorization, word segmentation, acronym extraction, and structure recognition. We conclude that compression forms a sound unifying principle that allows many text mining problems to be tacked adaptively.


[24]

Logistic Regression and Boosting for Labeled Bags of Instances
Xin Xu and Eibe Frank.
Logistic regression and boosting for labeled bags of instances.
In H. Dai, R. Srikant, and C. Zhang, editors, Proc 8th
PacificAsia Conference on Knowledge Discovery and Data Mining, volume 3056
of LNAI, pages 272281, Sydney, Australia, 2004. Springer.
[ bib 
.ps 
.pdf ]
In this paper we upgrade linear logistic regression and boosting to multiinstance data, where each example consists of a labeled bag of instances. This is done by connecting predictions for individual instances to a baglevel probability estimate by simple averaging and maximizing the likelihood at the bag levelin other words, by assuming that all instances contribute equally and independently to a bag?s label. We present empirical results for artificial data generated according to the underlying generative model that we assume, and also show that the two algorithms produce competitive results on the Musk benchmark datasets.

