Dataset cataloging metadata for machine learning applications and research

S.J. Cunningham. Dataset cataloging metadata for machine learning applications and research. In Proc International Artificial Intelligence and Statistics Workshop, Ft. Lauderdale, Florida, 1997.
[ bib | .ps | .pdf ]
As the field of machine learning (ML) matures, two types of data archives are developing: collections of benchmark data sets used to test the performance of new algorithms, and data stores to which machine learning/data mining algorithms are applied to create scientific or commercial applications. At present, the catalogs of these archives are ad hoc and not tailored to machine learning analysis. This paper considers the cataloging metadata required to support these two types of repositories, and discusses the organizational support necessary for archive catalog maintenance.


Machine learning applications in anthropology: automated discovery over kinship structures

S.J. Cunningham. Machine learning applications in anthropology: automated discovery over kinship structures. Computers and the Humanities, 30:401-406, 1997.
[ bib | .ps | .pdf ]
A common problem in anthropological field work is generalizing rules governing social interactions and relations (particularly kinship) from a series of examples. One class of machine learning algorithms is particularly well-suited to this task: inductive logic programming systems, as exemplified by FOIL. A knowledge base of relationships among individuals is established, in the form of a series of single-predicate facts. Given a set of positive and negative examples of a new relationship, the machine learning programs build a Horn clause description of the target relationship. The power of these algorithms to derive complex hypotheses is demonstrated for a set of kinship relationships drawn from the anthropological literature. FOIL extends the capabilities of earlier anthropology-specific learning programs by providing a more powerful representation for induced relationships, and is better able to learn in the face of noisy or incomplete data.


Brain-Like Computing and Intelligent Information Systems

S.J. Cunningham, G. Holmes, J. Littin, R. Beale, and I.H. Witten. Brain-Like Computing and Intelligent Information Systems, chapter Applying connectionist models to information retrieval, pages 435-457. Springer-Verlag, 1997.
[ bib | .ps | .pdf ]
Adaptive information retrieval (IR) systems based on connectionist architectures have captured the attention of researchers over the past decade. This paper provides a review of connectionist IR research, including the major models for connectionist document and query representation, techniques to enhance query re-formulation, dynamic document routing (information filtering), and connectionist techniques for document clustering.


Applications of machine learning in information retrieval

S.J. Cunningham, J.N. Littin, and I.H. Witten. Applications of machine learning in information retrieval. Technical Report 97/6, University of Waikato, Department of Computer Science, Hamilton, New Zealand, February 1997.
[ bib | .ps | .pdf ]
Information retrieval systems provide access to collections of thousands, or millions, of documents, from which, by providing an appropriate description, users can recover any one. Typically, users iteratively refine the descriptions they provide to satisfy their needs, and retrieval systems can utilize user feedback on selected documents to indicate the accuracy of the description at any stage. The style of description required from the user, and the way it is employed to search the document database, are consequences of the indexing method used for the collection. The index may take different forms, from storing keywords with links to individual document, to clustering documents under related topics.

Much of the work in information retrieval can be automated. Processes such as document indexing and query refinement are usually accomplished by computer, while document classification and index term selection are more often performed manually. However, manual development and maintenance of document databases is time-consuming, tedious, and error-prone. Algorithms that mine documents for indexing information, and model user interests to help them formulate queries, reduce the workload and can ensure more consistent behavior. Such algorithms are based in machine learning, a dynamic, burgeoning area of computer science which is finding application in domains ranging from expert systems, where learning algorithms supplement-or even supplant-domain experts for generating rules and explanations (Langley and Simon, 1995), to intelligent agents, which learn to play particular, highly-specialized, support roles for individual people and are seen by some to herald a new renaissance of artificial intelligence in information technology (Hendler, 1997).


Feature subset selection: a correlation based filter approach

M.A. Hall and L.A. Smith. Feature subset selection: a correlation based filter approach. In N. Kasabov and et al., editors, Proc Fourth International Conference on Neural Information Processing and Intelligent Information Systems, pages 855-858, Dunedin, New Zealand, 1997.
[ bib | .ps | .pdf ]
Recent work has shown that feature subset selection can have a positive affect on the performance of machine learning algorithms. Some algorithms can be slowed or their performance adversely affected by too much data, some of which may be irrelevant or redundant to the learning task. Feature subset selection, then, is a method for enhancing the performance of learning algorithms, reducing the storage requirement. This paper describes a feature subset selector that uses a correlation based heuristic to determine the goodness of feature subsets, and evaluates its effectiveness with three common ML algorithms: a decision tree inducer (C4.5), a naive Bayes classifier, and an instance based learner (IB1). Experiments using a number of standard data sets drawn from real and artificial domains are presented. Feature subset selection gave significant improvement for all three algorithms; C4.5 generated smaller decision trees.


Discovering inter-attribute relationships

G. Holmes. Discovering inter-attribute relationships. In N. Kasabov and et al., editors, Proc Fourth International Conference on Neural Information Processing and Intelligent Information Systems, pages 914-917, Dunedin, New Zealand, 1997.
[ bib | .ps | .pdf ]
It is important to discover relationships between attributes being used to predict a class attribute in supervised learning situations for two reasons. First, any such relationship will be potentially interesting to the provider of a dataset in its own right. Second, it would simplify a learning algorithm's search space, and the related irrelevant feature and subset selection problem, if the relationships were removed from datasets ahead of learning. An algorithm to discover such relationships is presented in this paper. The algorithm is described and a surprising number of inter-attribute relationships are discovered in datasets from the University of California at Irvine (UCI) repository.


Mining for causes of cancer: machine learning experiments at various levels of detail

S. Kramer, B. Pfahringer, and C. Helma. Mining for causes of cancer: machine learning experiments at various levels of detail. In Proc International Conference on Knowledge Discovery and Data Mining, pages 223-226, Menlo Park, 1997. AAAI Press.
[ bib | .ps | .pdf ]
This paper presents first results of an interdisciplinary project in scientific data mining. We analyze data about the carcinogenicity of chemicals derived from the carcinogenesis bioassay program performed by the US National Institute of Environmental Health Sciences. The database contains detailed descriptions of 6823 tests performed with more than 330 compounds and animals of different species, strains and sexes. The chemical structures are described at the atom and bond level, and in terms of various relevant structural properties. The goal of this paper is to investigate the effects that various levels of detail and amounts of information have on the resulting hypotheses, both quantitatively and qualitatively. We apply relational and propositional machine learning algorithms to learning problems formulated as regression or as classification tasks. In addition, these experiments have been conducted with two learning problems which are at different levels of detail. Quantitatively, our experiments indicate that additional information not necessarily improves accuracy. Qualitatively, a number of potential discoveries have been made by the algorithm for Relational Regression because it can utilize all the information contained in the relations of the database as well as in the numerical dependent variable.


CIMA: An interactive concept learning system for end-user applications

D. Maulsby and I.H. Witten. Cima: An interactive concept learning system for end-user applications. Applied Artificial Intelligence, 11(7-8):653-671, 1997.
[ bib | .ps | .pdf ]
Personalizable software agents will learn new tasks from their users. In many cases the most appropriate way for users to teach is to demonstrate examples. Learning complex concepts from examples alone is hard, but agents can exploit other forms of instruction that users might give, ranging from yes/no responses to ambiguous, incomplete hints. Agents can also exploit background knowledge customized for applications such as drawing, word processing and form-filling.

The Cima system learns generalized rules for classifying, generating, and modifying data, given examples, hints, and background knowledge. It copes with the ambiguity of user instructions by combining evidence from these sources. A dynamic bias manager generates candidate features (attribute values, functions or relation) from which the learning algorithm selects relevant ones and forms appropriate rules. When tested on dialogues observed in a prior user study on a simulated interface agent, the system achieved 95% of the learning efficiency observed in that study.


Learning agents: from user study to implementation

D. Maulsby and I.H. Witten. Learning agents: from user study to implementation. IEEE Computer, 30(11):36-44, 1997.
[ bib | .ps | .pdf ]
This paper describes the design, implementation and evaluation of an agent that learns to automate repetitive tasks within the human-computer interface. In contrast to most Artificial Intelligence projects, the design centers on a user study, with a human-simulated agent used to discover the interactions that people find natural. This study shows the ways in which users instinctively communicate via hints, or partially-specified, ambiguous, instructions. Hints may be communicated using speech, by pointing, or by selecting from menus. We develop a taxonomy of such instructions. The implementation demonstrates that computers can learn from examples and ambiguous hints.


Compression and explanation using hierarchical grammars

C.G. Nevill-Manning and I.H. Witten. Compression and explanation using hierarchical grammars. Computer Journal, 40(2/3):103-116, 1997.
[ bib | .ps.gz | .pdf ]
This paper describes an algorithm, called SEQUITUR, that identifies hierarchical structure in sequences of discrete symbols and uses that information for compression. On many practical sequences it performs well at both compression and structural inference, producing comprehensible descriptions of sequence structure in the form of grammar rules. The algorithm can be stated concisely in the form of two constraints on a context-free grammar. Inference is performed incrementally, the structure faithfully representing the input at all times. It can be implemented efficiently and operates in a time that is approximately linear in sequence length. Despite its simplicity and efficiency, SEQUITUR succeeds in inferring a range of interesting hierarchical structures from naturally occurring sequences.


Identifying hierarchical structure in sequence: a linear-time algorithm

C.G. Nevill-Manning and I.H. Witten. Identifying hierarchical structure in sequence: a linear-time algorithm. Artificial Intelligence Research, 7:67-82, 1997.
[ bib | .ps | .pdf ]
SEQUITUR is an algorithm that infers a hierarchical structure from a sequence of discrete symbols by replacing repeated phrases with a grammatical rule that generates the phrase, and continuing this process recursively. The result is a hierarchical representation of the original sequence, which offers insights into its lexical structure. The algorithm is driven by two constraints that reduce the size of the grammar, and produce structure as a by-product. SEQUITUR breaks new ground by operating incrementally. Moreover, the method's simple structure permits a proof that it operates in space and time that is linear in the size of the input. Our implementation can process 50,000 symbols per second and has been applied to an extensive range of real world sequences.


Linear-time, incremental hierarchy inference for compression

C.G. Nevill-Manning and I.H. Witten. Linear-time, incremental hierarchy inference for compression. In J.A. Storer and M. Cohn, editors, Proc Data Compression Conference, pages 3-11, Los Alamitos, CA, 1997. IEEE Press.
[ bib | .ps | .pdf ]
Data compression and learning are, in some sense, two sides of the same coin. If we paraphrase Occam's razor by saying that a small theory is better than a larger theory with the same explanatory power, we can characterize data compression as a preoccupation with small, and learning as a preoccupation with better. Nevill-Manning et al. (1994) presented an algorithm, since dubbed SEQUITUR, that presents both faces of the compression/learning coin. Its performance as a data compression scheme outstrips other dictionary schemes, and the structures that it learns from sequences as diverse as DNA and music are intuitively compelling.

In this paper, we present three new results that characterize SEQUITUR's computational and compression performance. First, we prove that SEQUITUR operates in time linear in n, the length of the input sequence, despite its ability to build a hierarchy as deep as log(n). Second, we show that a sequence can be compressed incrementally, improving on the non-incremental algorithm that was described by Nevill-Manning et al. (1994), and making on-line compression feasible. Third, we present an intriguing result that emerged during benchmarking; whereas PPMC (Moffat, 1990) outperforms SEQUITUR on most files in the Calgary corpus, SEQUITUR regains the lead when tested on multi-megabyte sequences. We make some tentative conclusions about the underlying reasons for this phenomenon, and about the nature of current compression benchmarking.


Compression-based pruning of decision lists

B. Pfahringer. Compression-based pruning of decision lists. In Proc European Conference on Machine Learning, volume 1224 of LNAI, pages 199-212, Prague, Czech Republic, 1997. Springer-Verlag.
[ bib | .ps | .pdf ]
We define a formula for estimating the coding costs of decision lists for propositional domains. This formula allows for multiple classes and both categorical and numerical attributes. For artificial domains the formula performs quite satisfactory, whereas results are rather mixed and inconclusive for natural domains. Further experiments lead to a principled simplification of the original formula which is robust in both artificial and natural domains. Simple hill-climbing search for the most compressive decision list significantly reduces the complexity of a given decision list while not impeding and sometimes even improving its predictive accuracy.


Probabilistic unification grammars

T.C. Smith and J.G. Cleary. Probabilistic unification grammars. In Workshop notes: ACSC'97 Australasian Natural Language Processing Summer Workshop, pages 25-32, 1997.
[ bib | .ps | .pdf ]
Recent research has shown that unification grammars can be adapted to incorporate statistical information, thus preserving the processing benefits of stochastic context-free grammars while offering an efficient mechanism for handling dependencies. While complexity studies show that a probabilistic unification grammar achieves an appropriately lower entropy estimate than an equivalent PCFG, the problem of parameter estimation prevents results from reflecting the empirical distribution. This paper describes how a PUG can be implemented as a Prolog DCG annotated with weights, and how the weights can be interpreted to give accurate entropy estimates. An algorithm for learning correct weights is provided, along with results from some complexity analyses.


Models of English text

W.J. Teahan and J.G. Cleary. Models of english text. In J.A. Storer and M. Cohn, editors, Proc Data Compression Conference, pages 12-21, Los Alamitos, CA, 1997. IEEE Press.
[ bib | .ps | .pdf ]
The problem of constructing models of English text is considered. A number of applications of such models including cryptology, spelling correction and speech recognition are reviewed. The best current models for English text have been the result of research into compression. Not only is this an important application of such models but the amount of compression provides a measure of how well such models perform. Three main classes of models are considered: character based models, word based models, and models which use auxiliary information in the form of parts of speech. These models are compared in terms of their memory usage and compression.


Inducing cost-sensitive trees via instance-weighting

K.M. Ting. Inducing cost-sensitive trees via instance-weighting. Technical Report 97/22, University of Waikato, Department of Computer Science, Hamilton, New Zealand, September 1997.
[ bib ]
We introduce an instance-weighting method to induce cost-sensitive trees in this paper. It is a generalization of the standard tree induction process where only the initial instance weights determine the type of tree (i.e., minimum error trees or minimum cost trees) to be induced. We demonstrate that it can be easily adopted to an existing tree learning algorithm.

Previous research gave insufficient evidence to support the fact that the greedy divide-and-conquer algorithm can effectively induce a truly cost-sensitive tree directly from the training data. We provide this empirical evidence in this paper. The algorithm employing the instance-weighting method is found to be comparable to or better than both C4.5 and C5 in terms of total misclassification costs, tree size and the number of high cost errors. The instance-weighting method is also simpler and more effective in implementation than a method based on altered priors.


Model combination in the multiple-data-batched scenario

K.M. Ting and B.T. Low. Model combination in the multiple-data-batched scenario. In Proc European Conference on Machine Learning, volume 1224 of LNAI, pages 250-265, Prague, Czech Republic, 1997. Springer-Verlag.
[ bib | .ps | .pdf ]
The approach of combining models learned from multiple batches of data provide an alternative to the common practice of learning one model from all the available data (i.e., the data combination approach). This paper empirically examines the base-line behaviour of the model combination approach in this multiple-data-batches scenario. We find that model combination can lead to better performance even if the disjoint batches of data are drawn randomly from a larger sample, and relate the relative performance of the two approaches to the learning curve of the classifier used.

The practical implication of our results is that one should consider using model combination rather than data combination, especially when multiple batches of data for the same task are readily available.

Another interesting result is that we empirically show that the near-asymptotic performance of a single model, in some classification task, can be significantly improved by combining multiple models (derived from the same algorithm) if the constituent models are substantially different and there is some regularity in the models to be exploited by the combination method used. Comparisons with known theoretical results are also provided.


Learning from batched data: model combination vs data combination

K.M. Ting, B.T. Low, and I.H. Witten. Learning from batched data: model combination vs data combination. Technical Report 97/14, University of Waikato, Department of Computer Science, Hamilton, New Zealand, May 1997.
[ bib ]
When presented with multiple batches of data, one can either combine them into a single batch before applying a machine learning procedure or learn from each batch independently and combine the resulting models. The former procedure, data combination, is straightforward; this paper investigates the latter, model combination. Given an appropriate combination method, one might expect model combination to prove superior when the data in each batch was obtained under somewhat different conditions or when different learning algorithms were used on the batches. Empirical results show that model combination often outperforms data combination even when the batches are drawn randomly from a single source of data and the same learning method is used on each. Moreover, this is not just an artifact of one particular method of combining models: it occurs with several different combination methods.

We relate this phenomenon to the learning curve of the classifiers being used. Early in the learning process when the learning curve is steep there is much to gain from data combination, but later when it becomes shallow there is less to gain and model combination achieves a greater reduction in variance and hence a lower error rate.

The practical implication of these results is that one should consider using model combination rather than data combination, especially when multiple batches of data for the same task are readily available. It is often superior even when the batches are drawn randomly from a single sample, and we expect its advantage to increase if genuine statistical differences between the batches exist.


Stacked generalization: when does it work?

K.M. Ting and I.H. Witten. Stacked generalization: when does it work? In Proc International Joint Conference on Artificial Intelligence, pages 866-871, Japan, 1997.
[ bib | .ps | .pdf ]
Stacked generalization is a general method of using a high-level model to combine lower-level models to achieve greater predictive accuracy. In this paper we address two crucial issues which have been considered to be a 'black art' in classification tasks ever since the introduction of stacked generalization in 1992 by Wolpert: the type of generalizer that is suitable to derive the higher-level model, and the kind of attributes that should be used as its input. We demonstrate the effectiveness of stacked generalization for combining three different types of learning algorithms.


Stacking bagged and dagged models

K.M. Ting and I.H. Witten. Stacking bagged and dagged models. In Proc International Conference on Machine Learning, pages 367-375, Tennessee, USA, 1997.
[ bib | .ps | .pdf ]
In this paper, we investigate the method of stacked generalization in combining models derived from different subsets of a training dataset by a single learning algorithm, as well as different algorithms. The simplest way to combine predictions from competing models is majority vote, and the effect of the sampling regime used to generate training subsets has already been studied in this context-when bootstrap samples are used the method is called bagging, and for disjoint samples we call it dagging. This paper extends these studies to stacked generalization, where a learning algorithm is employed to combine the models. This yields new methods dubbed bag-stacking and dag-stacking.

We demonstrate that bag-stacking and dag-stacking can be effective for classification tasks even when the training samples cover just a small fraction of the full dataset. In contrast to earlier bagging results, we show that bagging and bag-stacking work for stable as well as unstable learning algorithms, as do dagging and dag-stacking. We find that bag-stacking (dag-stacking) almost always has higher predictive accuracy than bagging (dagging), and we also show that bag-stacking models derived using two different algorithms is more effective than bagging.


Induction of model trees for predicting continuous classes

Y. Wang and I.H. Witten. Induction of model trees for predicting continuous classes. In Proc European Conference on Machine Learning Poster Papers, pages 128-137, Prague, Czech Republic, 1997.
[ bib | .ps | .pdf ]
Many problems encountered when applying machine learning in practice involve predicting a class that takes on a continuous numeric value, yet few machine learning schemes are able to do this. This paper describes a rational reconstruction of M5, a method developed by Quinlan (1992) for inducing trees of regression models. In order to accommodate data typically encountered in practice it is necessary to deal effectively with enumerated attributes and with missing values, and techniques devised by Breiman et al. (1984) are adapted for this purpose. The resulting system seems to outperform M5, based on the scanty published data that is available.


User manual - Weka: the Waikato environment for knowledge analysis

Waikato ML group. User manual - weka: the waikato environment for knowledge analysis, June 1997. This is out of date and refers to the old C version of WEKA, which we no longer support or distribute.
[ bib ]