[1]
|
Experience with OB1: an optimal Bayes decision tree learner
J.G. Cleary and L. Trigg.
Experience with ob1: an optimal bayes decision tree learner.
Technical Report 98/10, University of Waikato, Department of Computer
Science, Hamilton, New Zealand, May 1998.
[ bib |
.ps |
.pdf ]
Machine learning algorithms for inferring decision trees typically choose a single best tree to describe the training data. Recent research has shown that classification performance can be significantly improved by voting predictions of multiple, independently produced decision trees. This paper describes an algorithm, OB1, that makes a weighted sum over many possible models. We describe one instance of OB1, that includes all possible decision trees as well as naive Bayesian models. OB1 is compared with a number of other decision tree and instance based learning alogrithms on some of the data sets from the UCI repository. Both an information gain and an accuracy measure are used for the comparison. On the information gain measure OB1 performs significantly better than all the other algorithms. On the accuracy measure it is significantly better than all the algorithms except naive Bayes which performs comparably to OB1.
|
|
[2]
|
Generating Accurate Rule Sets Without Global Optimization
Eibe Frank and Ian H. Witten.
Generating accurate rule sets without global optimization.
In Proc 15th International Conference on Machine Learning,
Madison, Wisconsin, pages 144-151. Morgan Kaufmann, 1998.
Also available as Working Paper 98/2, Department of Computer Science,
University of Waikato; January.
[ bib |
.ps |
.pdf ]
The two dominant schemes for rule-learning, C4.5 and RIPPER, both operate in two stages. First they induce an initial rule set and then they refine it using a rather complex optimization stage that discards (C4.5) or adjusts (RIPPER) individual rules to make them work better together. In contrast, this paper shows how good rule sets can be learned one rule at a time, without any need for global optimization. We present an algorithm for inferring rules by repeatedly generating partial decision trees, thus combining the two major paradigms for rule generation-creating rules from decision trees and the separate-and-conquer rule-learning technique. The algorithm is straightforward and elegant: despite this, experiments on standard datasets show that it produces rule sets that are as accurate as and of similar size to those generated by C4.5, and more accurate than RIPPER's. Moreover, it operates efficiently, and because it avoids postprocessing, does not suffer the extremely slow performance on pathological example sets for which the C4.5 method has been criticized.
|
|
[3]
|
Using a Permutation Test for Attribute Selection in Decision
Trees
Eibe Frank and Ian H. Witten.
Using a permutation test for attribute selection in decision trees.
In Proc 15th International Conference on Machine Learning,
Madison, Wisconsin, pages 152-160. Morgan Kaufmann, 1998.
[ bib |
.ps |
.pdf ]
Most techniques for attribute selection in decision trees are biased towards attributes with many values, and several ad hoc solutions to this problem have appeared in the machine learning literature. Statistical tests for the existence of an association with a prespecified significance level provide a well-founded basis for addressing the problem. However, many statistical tests are computed from a chi-squared distribution, which is only a valid approximation to the actual distribution in the large-sample case-and this patently does not hold near the leaves of a decision tree. An exception is the class of permutation tests. We describe how permutation tests can be applied to this problem. We choose one such test for further exploration, and give a novel two-stage method for applying it to select attributes in a decision tree. Results on practical datasets compare favorably with other methods that also adopt a pre-pruning strategy.
|
|
[4]
|
Using Model Trees for Classification
Eibe Frank, Yong Wang, Stuart Inglis, Geoffrey Holmes, and Ian H. Witten.
Using model trees for classification.
Machine Learning, 32(1):63-76, 1998.
Also available as Working Paper 97/12, Department of Computer
Science, University of Waikato; April.
[ bib |
.ps |
.pdf ]
Model trees, which are a type of decision tree with linear regression functions at the leaves, form the basis of a recent successful technique for predicting continuous numeric values. They can be applied to classification problems by employing a standard method of transforming a classification problem into a problem of function approximation. Surprisingly, using this simple transformation the model tree inducer M5', based on Quinlan's M5, generates more accurate classifiers than the state-of-the-art decision tree learner C5.0, particularly when most of the attributes are numeric.
|
|
[5]
|
Practical feature subset selection for machine learning
Mark Andrew Hall and Lloyd Smith.
Practical feature subset selection for machine learning.
In Proc 21st Australian Computer Science Conference, pages
181-191, Perth, Australia, 1998. Springer.
[ bib |
.ps |
.pdf ]
Machine learning algorithms automatically extract knowledge from machine readable information. Unfortunately, their success is usually dependant on the quality of the data that they operate on. If the data is inadequate, or contains extraneous and irrelevant information, machine learning algorithms may produce less accurate and less understandable results, or may fail to discover anything of use at all. Feature subset selectors are algorithms that attempt to identify and remove as much irrelevant and redundant information as possible prior to learning. Feature subset selection can result in enhanced performance, a reduced hypothesis search space, and, in some cases, reduced storage requirement. This paper describes a new feature selection algorithm that uses a correlation based heuristic to determine the goodness of feature subsets, and evaluates its effectiveness with three common machine learning algorithms. Experiments using a number of standard machine learning data sets are presented. Feature subset selection gave significant improvement for all three algorithms.
|
|
[6]
|
Optimisation techniques for a computer simulation of a pastoral dairy farm
Hart R., M.T. Larcombe, R.A. Sherlock, and L.A. Smith.
Optimisation techniques for a computer simulation of a pastoral dairy
farm.
Journal Computing and Electronics in Agriculture, 19:129-153,
1998.
[ bib |
.ps |
.pdf ]
This paper compares different methods of optimising the management variables in UDDER, a commercially-available computer simulation model of a pastoral dairy farm. The emphasis is on identifying the best optimisation strategy for this complex multi-dimensional system, taking the simulation model as a given constant. The optimisation methods studied are based on significantly different principles, with differing strengths and weaknesses: two hill-climbing algorithms (Nelder-Mead Simplex and Powell's Direction Set), and a genetic algorithm. Rather than examine all facets of dairy farm management, a single problem is optimised-that of maximising milkfat production while maintaining the health of the herd and pasture.
The results show that while the genetic algorithm can determine good regions within the search space quickly, it is considerably slower than either hill-climber at finding the optimal point within that region. The hillclimbers, in contrast, are fast but have a tendency to get trapped on local maxima and thus fail to find the true optimum. This led to the development of a hybrid algorithm which utilises the initial global search of the genetic algorithm, followed by the more efficient local search of a hill-climber. This hybrid algorithm discovered near-optimal points much more quickly than the genetic algorithm, and with more reliability than the hill-climber.
|
|
[7]
|
Predicting apple bruising using machine learning
Holmes G., S.J. Cunningham, B.T. Dela Rue, and A.F. Bollen.
Predicting apple bruising using machine learning.
In L.M.M. Tijskens and M.L.A.T.M. Hertog, editors, Model-IT
Conference, Acta Horticulturae, volume 476, pages 289-296, The Netherlands,
1998.
Also available as Working Paper 98/7, Department of Computer Science,
University of Waikato; April.
[ bib |
.ps |
.pdf ]
Many models have been used to describe the influence of internal or external factors on apple bruising. Few of these have addressed the application of derived relationships to the evaluation of commercial operations. From an industry perspective, a model must enable fruit to be rejected on the basis of a commercially significant bruise and must also accurately quantify the effects of various combinations of input features (such as cultivar, maturity, size, and so on) on bruise prediction. Input features must in turn have characteristics which are measurable commercially; for example the measure of force should be impact energy rather than energy absorbed. Further, as the commercial criteria for acceptable damage levels change, the model should be versatile enough to regenerate new bruise thresholds from existing data.
Machine learning is a burgeoning technology with a vast range of potential applications particularly in agriculture where large amounts of data can be readily collected [1]. The main advantage of using a machine learning method in an application is that the models built for prediction can be viewed and understood by the owner of the data who is in a position to determine the usefulness of the model, an essential component in a commercial environment.
Machine Learning software recently developed at Waikato University [2] offers potential as a prediction tool for the classification of bruising based on historical data. It gives the user the opportunity to select any number of measured input attributes and determine the influence of that combination on a range of bruise size categories. The user may require a high degree of accuracy in the classification and therefore prune the attributes or bruise classes accordingly, or alternatively seek to discover trends in the dataset (in which case a lower level of accuracy often clarifies implicit structures in the data).
Models such as the theory of elasticity suggest that impact energy and radius of curvature will have a significant effect on the bruise surface area. Cell structure is also thought to contribute to variation in bruise size. The experiment described in this paper uses the machine learning programs C5.4 [3] and M5' [4] to determine the influence of impact energy, radius of curvature and impact site location on bruise area.
|
|
[8]
|
Knowledge visualization techniques for machine learning
M. Humphrey, S.J. Cunningham, and I.H. Witten.
Knowledge visualization techniques for machine learning.
Intelligent Data Analysis, 2(1-4):333-347, 1998.
[ bib ]
Researchers in machine learning primarily use decision trees, production rules, and decision graphs for visualizing classification data, with the graphic form in which a structure is portrayed as having a strong influence on comprehensibility. We analyze the questions that, in our experience, end users of machine learning tend to ask of the structures inferred from their empirical data. By mapping these questions onto visualization tasks, we have created new graphical representations that show the flow of examples through a decision structure. The knowledge visualization techniques are particularly appropriate in helping to answer the questions that users typically ask, and we describe their use in discovering new properties of a data set. In the case of decision trees, an automated software tool has been developed to construct the visualizations.
|
|
[9]
|
Objective measurement of mushroom quality
N. Kusabs, F. Bollen, L. Trigg, G. Holmes, and S. Inglis.
Objective measurement of mushroom quality.
In Proc New Zealand Institute of Agricultural Science and the
New Zealand Society for Horticultural Science Annual Convention, page 51,
Hawke's Bay, New Zealand, 1998.
[ bib ]
This paper describes a methodology for establishing an objective measurement of mushroom quality, based on a set of measured physical attributes. These attributes were used by a machine learning tool to model quality based on the classification of professional graders (experts).
Four experts visually evaluate 300 mushrooms and graded them into three major and eight subclasses of commercial quality. Weight, firmness and images of the top and bottom of each mushroom were then measured. These physical parameters were used to construct a model of the grades assigned by the four experts. Grader consistency was also assessed by repeated classification (four repetitions) of two 100-mushroom sets. Grader repeatability ranged from 6 to 15% misclassification.
Misclassification by the model increased with grading complexity (16-32% for the three major grades and 17-63% for the eight subclasses). This depended heavily on grader classifications and the variables used for training the model (weight, firmness, estimated gill opening and red/green/blue histograms from the image). Accuracy of classification using various combinations of physical attributes, which indicate the relative weight placed on each attribute by the different graders, are detailed in the paper.
|
|
[10]
|
Predicting query times
McNab R., Y. Wang, I.H. Witten, and C. Gutwin.
Predicting query times.
In W.B. Croft and et al., editors, Proc ACM SIGIR Conference on
Research and Development in Information Retrieval, pages 355-356,
Melbourne, Australia, 1998. ACM Press.
[ bib ]
We outline the need for search engines to provide user feedback on the expected time for a query, describe a scheme for learning a model of query time by observing sample queries, and discuss the results obtained for a set of actual user queries on a document collection using the MG search engine.
|
|
[11]
|
User perceptions of machine learning
R.J. McQueen and G. Holmes.
User perceptions of machine learning.
In E.D. Hoadley and I. Benbasat, editors, Proc Association of
Information Systems Conference, pages 180-182, Maryland, Baltimore, 1998.
Association for Information Systems, Atlanta, GA.
[ bib |
.pdf ]
Machine learning has potential use in the understanding of information hidden in large datasets, but little is known about user's perceptions about the use of the technology. In this study, a number of datasets were solicited from agricultural researchers and processed using a machine learning workbench. The results were reported to the researchers, and then interviews were conducted with some of them to determine their perceptions about the use of machine learning as an additional analysis technique to traditional statistical analysis. An number of themes about their satisfaction with this technique were constructed from the interview transcripts, which generally indicate that machine learning may be able to contribute to analysis and understanding of these kinds of datasets.
|
|
[12]
|
User satisfaction with machine learning as a data analysis method in agricultural research
R.J. McQueen, G. Holmes, and L. Hunt.
User satisfaction with machine learning as a data analysis method in
agricultural research.
New Zealand Journal of Agricultural Research, 41(4):577-584,
1998.
[ bib |
.pdf ]
Machine learning has potential use in the understanding of information hidden in large datasets, but little is know about user perceptions about the use of the technology. In this study, a number of datasets were solicited from agricultural researchers and processed using a machine learning workbench. The results were reported to the researchers, and then interviews were conducted with some of them to determine their perceptions about the use of machine learning as an additional analysis technique to traditional statistical analysis. An number of themes about their satisfaction with this technique were constructed from the interview transcripts, which generally indicate that machine learning may be able to contribute to analysis and understanding of these kinds of datasets.
|
|
[13]
|
Inferring lexical and grammatical structure from sequences
C.G. Nevill-Manning and I.H. Witten.
Inferring lexical and grammatical structure from sequences.
In B. Carpentieri and et al., editors, Proc Compression and
Complexity of Sequences, pages 265-274, Los Alamitos, CA, 1997. IEEE
Computer Society Press.
Published as Nevill-Manning and Witten, 1998.
[ bib |
.ps |
.pdf ]
In a wide variety of sequences from various sources, from music and text to DNA and computer programs, two different but related kinds of structure can be discerned. First, some segments tend to be repeated exactly, such as motifs in music, words or phrases in text, identifiers and syntactic idioms in computer programs. Second, these segments interact with each other in variable but constrained ways. For example, in English text only certain syntactic word classes can appear after the word 'the'-many parts of speech (such as verbs) are necessarily excluded.
This paper shows how these kinds of structure can be inferred automatically from sequences. Let us make clear at the outset what aspects of sequence structure we are not concerned with. We take no account of numerical frequencies other than the 'more than once' that defines repetition. We do not consider any similarity metrics between the individual symbols that make up the sequence, nor between 'similar' subsequences such as transposed or transformed motifs in music. Finally, although we are certainly interested in nested repetitions, we do not analyze recursive structure in sequences-such as self-similarity in fractal sequences. All of these regularities are interesting ones that would be well worth taking into account, but lie beyond the scope of this paper.
|
|
[14]
|
Phrase hierarchy inference and compression in bounded space
C.G. Nevill-Manning and I.H. Witten.
Phrase hierarchy inference and compression in bounded space.
In J.A. Storer and M. Cohn, editors, Proc Data Compression
Conference, pages 179-188, Los Alamitos, CA, 1998. IEEE Press.
[ bib |
.ps |
.pdf ]
Text compression by inferring a phrase hierarch from the input is a recent technique that shows promise both as a compression scheme and as a machine learning method that extracts some comprehensible account of the structure of the input text. Its performance as a data compression scheme outstrips other dictionary schemes, and the structures that it learns from sequences have been put to such eclectic uses as phrase browsing in digital libraries, music analysis, and inferring rules for fractal images.
|
|
[15]
|
Fast convergence with a greedy tag-phrase dictionary
T.C. Smith and R. Peeters.
Fast convergence with a greedy tag-phrase dictionary.
In J.A. Storer and M. Cohn, editors, Proc Data Compression
Conference, pages 33-42, Los Alamitos, CA, 1998. IEEE Press.
[ bib |
.ps |
.pdf ]
The best general-purpose compression schemes make their gains by estimating a probability distribution over all possible next symbols given the context established by some number of previous symbols. Such context models typically obtain good compression results for plain text by taking advantage of regularities in character sequences. Frequent words and syllables can be incorporated into the model quickly and thereafter used for reasonably accurate prediction. However, the precise context in which frequent patterns emerge is often extremely varied, and each new word or phrase immediately introduces new contexts which can adversely affect the compression rate.
A great deal of the structural regularity in a natural language is given rather more by properties of its grammar than by the orthographic transcription of its phonology. This implies that access to a grammatical abstraction might lead to good compression. While grammatical models have been used successfully for compressing computer programs [4], grammar-based compression of plain text has received little attention, primarily because of the difficulties associated with constructing a suitable natural language grammar. But even without a precise formulation of the syntax of a language, there is a linguistic abstraction which is easily accessed and which demonstrates a high degree of regularity which can be exploited for compression purposes-namely, lexical categories.
|
|
[16]
|
Learning feature-value grammars from plain text
T.C. Smith.
Learning feature-value grammars from plain text.
In David M.-W. Powers, editor, Proc Joint Conference on New
Methods in Language Processing and Computational Natural Language Learning,
pages 291-294, Sydney, Australia, 1998.
[ bib |
.ps |
.pdf ]
This paper outlines preliminary work on learning feature-value grammars from plain text. Common suffixes are gleaned from a word suffix tree and used to form a first approximation of how regular inflection is marked. Words are generalised into lexical categories according to regularities in how these suffixes appear in trigram context. The categories are expressed as a lexical feature whose value is given by the most frequent suffix for similar trigrams. The trigrams are subsequently used to infer agreement dependencies which are captured through the creation of additional feature structures. Agreement and linear precedence are preserved through the iterative creation of unification rules for pairs of terms.
|
|
[17]
|
Boosting trees for cost-sensitive classification
K.M. Ting and Z. Zheng.
Boosting trees for cost-sensitive classification.
In Proc European Conference on Machine Learning, volume 1398 of
LNAI, pages 190-195, Berlin, 1998.
[ bib |
.ps |
.pdf ]
This paper explores two boosting techniques for cost-sensitive tree classifications in the situation where misclassification costs change very often. Ideally, one would like to have only one induction, and use the induced model for different misclassification costs. Thus, it demands robustness of the induced model against cost changes. Combining multiple trees gives robust predictions against this change. We demonstrate that the two boosting techniques are a good solution in different aspects under this situation.
|
|
[18]
|
An entropy gain measure of numeric prediction performance
L. Trigg.
An entropy gain measure of numeric prediction performance.
Technical Report 98/11, University of Waikato, Department of Computer
Science, Hamilton, New Zealand, May 1998.
[ bib |
.ps |
.pdf ]
Categorical classifier performance is typically evaluated with respect to error rate, expressed as a percentage of test instances that were not correctly classified. When a classifier produces multiple classifications for a test instance, the prediction is counted as incorrect (even if the correct class was one of the predictions). Although commonly used in the literature, error rate is a coarse measure of classifier performance, as it is based only on a single prediction offered for a test instance. Since many classifiers can produce a class distribution as a prediction, we should use this to provide a better measure of how much information the classifier is extracting from the domain.
Numeric classifiers are a relatively new development in machine learning, and as such there is no single performance measure that has become standard. Typically these machine learning schemes predict a single real number for each test instance, and the error between the predicted and actual value is used to calculate a myriad of performance measures such as correlation coefficient, root mean squared error, mean absolute error, relative absolute error, and root relative squared error. With so many performance measures it is difficult to establish an overall performance evaluation.
The next section describes a performance measure for machine learning schemes that attempts to overcome the problems with current measures. In addition, the same evaluation measure is used for categorical and numeric classifier.
|
|