[1]

Improving Browsing in Digital Libraries with Keyphrase Indexes
Carl Gutwin, Gordon W. Paynter, Ian H. Witten, Craig G. NevillManning, and
Eibe Frank.
Improving browsing in digital libraries with keyphrase indexes.
Decision Support Systems, 27(1/2):81104, 1999.
[ bib 
.ps.gz ]


[2]

Developing innovative applications of machine learning
S.J. Cunningham and G. Holmes.
Developing innovative applications of machine learning.
In Proc Southeast Asia Regional Computer Confederation
Conference, Singapore, 1999.
[ bib 
.ps 
.pdf ]
The WEKA (Waikato Environment for Knowledge Analysis) system provides a comprehensive suite of facilities for applying data mining techniques to large data sets. This paper discusses a process model for analyzing data, and describes the support that WEKA provides for this model. The domain model 'learned' by the data mining algorithm can then be readily incorporated into a software application. The WEKAbased analysis and application construction process is illustrated through a case study in the agricultural domain?mushroom grading.


[3]

Market Basket Analysis of Library Circulation Data
Sally Jo Cunningham and Eibe Frank.
Market basket analysis of library circulation data.
In T. Gedeon, P. Wong, S. Halgamuge, N. Kasabov, D. Nauck, and
K. Fukushima, editors, Proc 6th International Conference on Neural
Information Processing, volume II of Perth, Australia, pages 825830.
IEEE Service Center, 1999.
[ bib 
.ps 
.pdf ]
Market Basket Analysis algorithms have recently seen widespread use in analyzing consumer purchasing patterns?specifically, in detecting products that are frequently purchased together. We apply the Apriori market basket analysis tool to the task of detecting subject classification categories that cooccur in transaction records of books borrowed from a university library. This information can be useful in directing users to additional portions of the collection that may contain documents relevant to their information needs, and in determining a library's physical layout. These results can also provide insight into the degree of scatter that the classification scheme induces in a particular collection of documents.


[4]

DomainSpecific Keyphrase Extraction
Eibe Frank, Gordon W. Paynter, Ian H. Witten, Carl Gutwin, and Craig G.
NevillManning.
Domainspecific keyphrase extraction.
In Proc 16th International Joint Conference on Artificial
Intelligence, Stockholm, Sweden, pages 668673. Morgan Kaufmann, 1999.
[ bib 
.ps 
.pdf ]
Automatic keyphrase extraction is a promising area for applied machine learning because keyphrases are an important means for document summarization, clustering, and topic search. Only a small minority of documents have authorassigned keyphrases, and manually assigning keyphrases to existing documents is very laborious. Therefore, it is highly desirable to automate the keyphrase extraction process.
This paper presents a simple procedure for keyphrase extraction based on the naive Bayes learning scheme, which is shown to perform comparably to the stateoftheart. It goes on to explain how the performance of this procedure can be boosted by automatically tailoring the extraction process to the particular document collection at hand. Results on a large collection of technical reports in computer science show that the quality of the extracted keyphrases improves significantly if domainspecific information is exploited.


[5]

Making Better Use of Global Discretization
Eibe Frank and Ian H. Witten.
Making better use of global discretization.
In Proc 16th International Conference on Machine Learning,
Bled, Slovenia, pages 115123. Morgan Kaufmann, 1999.
[ bib 
.ps 
.pdf ]
Before applying learning algorithms to datasets, practitioners often globally discretize any numeric attributes. If the algorithm cannot handle numeric attributes directly, prior discretization is essential. Even if it can, prior discretization often accelerates induction, and may produce simpler and more accurate classifiers.
As it is generally done, global discretization denies the learning algorithm any chance of taking advantage of the ordering information implicit in numeric attributes. However, a simple transformation of discretized data preserves this information in a form that learners can use. We show that, compared to using the discretized data directly, this transformation significantly increases the accuracy of decision trees built by C4.5, decision lists built by PART, and decision tables built using the wrapper method, on several benchmark datasets. Moreover, it can significantly reduce the size of the resulting classifiers.
This simple technique makes global discretization an even more useful tool for data preprocessing.


[6]

Correlationbased feature selection for machine learning
M.A. Hall.
Correlationbased feature selection for machine learning.
PhD thesis, University of Waikato, Department of Computer Science,
Hamilton, New Zealand, April 1999.
[ bib 
.ps 
.pdf ]
A central problem in machine learning is identifying a representative set of features from which to construct a classification model for a particular task. This thesis addresses the problems of feature selection for machine learning through a correlation based approach. The central hypothesis is that good feature sets contain features that are highly correlated with the class, yet uncorrelated with each other. A feature evaluation formula, based on ideas from test theory, provides an operational definition of this hypothesis. CFS (Correlation based Feature Selection) is an algorithm that couples this evaluation formula with an appropriate correlation measure and a heuristic search strategy.


[7]

Feature selection for discrete and numeric class machine learning
M.A. Hall.
Feature selection for discrete and numeric class machine learning.
Technical Report 99/4, University of Waikato, Department of Computer
Science, Hamilton, New Zealand, April 1999.
[ bib 
.ps 
.pdf ]
Algorithms for feature selection fall into two broad categories: wrappersuse the learning algorithm itself to evaluate the usefulness of features, while filtersevaluate features according to heuristics based on general characteristics of the data. For application to large databases, filters have proven to be more practical than wrappers because they are much faster. However, most existing filter algorithms only work with discrete classification problems.
This paper describes a fast, correlationbased filter algorithm that can be applied to continuous and discrete problems. Experiments using the new method as a preprocessing step for naive Bayes, instancebased learning, decision trees, locally weighted regression, and model trees show it to be an effective feature selector it reduces the data in dimensionality by more than sixty percent in most cases without negatively affecting accuracy. Also, decision and model trees built from the preprocessed data are often significantly smaller.


[8]

Feature selection for machine learning: comparing a correlationbased filter approach to the wrapper
Mark Andrew Hall and Lloyd Smith.
Feature selection for machine learning: comparing a correlationbased
filter approach to the wrapper.
In A. N. Kumar and I. Russel, editors, Proc Florida Artificial
Intelligence Symposium, pages 235239, Orlando, Florida, 1999. AAAI Press.
[ bib 
.ps 
.pdf ]
Feature selection is often an essential data processing step prior to applying a learning algorithm. The removal of irrelevant and redundant information often improves the performance of machine learning algorithms. There are two common approaches: a wrapperuses the intended learning algorithm itself to evaluate the usefulness of features, while a filterevaluates features according to heuristics based on general characteristics of the data. The wrapper approach is generally considered to produce better feature subsets but runs much more slowly than a filter. This paper describes a new filter approach to feature selection that uses a correlation based heuristic to evaluate the worth of feature subsets. When applied as a data preprocessing step for two common machine learning algorithms, the new method compares favourably with the wrapper but requires much less computation.


[9]

A diagnostic tool for tree based supervised classification learning algorithms
G. Holmes and L. Trigg.
A diagnostic tool for tree based supervised classification learning
algorithms.
In Proc Sixth International Conference on Neural Information
Processing (ICONIP'99), volume II, pages 514519, Perth, Western Australia,
November 1999.
[ bib 
.ps 
.pdf ]
The process of developing applications of machine learning and data mining that employ supervised classification algorithms includes the important step of knowledge verification. Interpretable output is presented to a user so that they can verify that the knowledge contained in the output makes sense for the given application. As the development of an application is an iterative process it is quite likely that a user would wish to compare models constructed at various times or stages.
One crucial stage where comparison of models is important is when the accuracy of a model is being estimated, typically using some form of crossvalidation. This stage is used to establish an estimate of how well a model will perform on unseen data. This is vital information to present to a user, but it is also important to show the degree of variation between models obtained from the entire dataset and models obtained during crossvalidation. In this way it can be verified that the crossvalidation models are at least structurally aligned with the model garnered from the entire dataset.
This paper presents a diagnostic tool for the comparison of treebased supervised classification models. The method is adapted from work on approximate tree matching and applied to decision trees. The tool is described together with experimental results on standard datasets.


[10]

Generating Rule Sets from Model Trees
Geoffrey Holmes, Mark Hall, and Eibe Frank.
Generating rule sets from model trees.
In Proc 12th Australian Joint Conference on Artificial
Intelligence, Sydney, Australia, pages 112. Springer, 1999.
[ bib 
.ps 
.pdf ]
Model trees  decision trees with linear models at the leaf nodes  have recently emerged as an accurate method for numeric prediction that produces understandable models. However, it is known that decision lists  ordered sets of IfThen rules  have the potential to be more compact and therefore more understandable than their tree counterparts.
We present an algorithm for inducing simple, accurate decision lists from model trees. Model trees are built repeatedly and the best rule is selected at each iteration. This method produces rule sets that are as accurate but smaller than the model tree constructed from the entire dataset. Experimental results for various heuristics which attempt to find a compromise between rule accuracy and rule coverage are reported. We show that our method produces comparably accurate and smaller rule sets than the commercial stateoftheart rule learning system Cubist.


[11]

Fitting a mixture model to threemode threeway data with categorical and continuous variables
L.A. Hunt and K.E. Basford.
Fitting a mixture model to threemode threeway data with categorical
and continuous variables.
Journal of Classification, 16(2):283296, 1999.
[ bib ]
The mixture likelihood approach to clustering is most often used with twomode twoway data to cluster one of the modes (e.g., the entities) into homogeneous groups on the basis of the other mode (e.g., the attributes). In this case, the attributes can either be continuous or categorical. When the data set consists of a threemode threeway array (e.g., attributes measured on entities in different situations), an analogous procedure is needed to enable the clustering of the entities (i.e., one of the modes) on the basis of both of the other modes simultaneously (i.e., the attributes measured in different situations). In this paper, it is shown that the finite mixture approach to clustering can be extended to analyze threemode threeway data where some of the attributes care continuous and some are categorical. The methodology is illustrated by clustering the genotypes in a threeway soybean data set where various attributes were measured on genotypes grown in several environments.


[12]

Mixture model clustering using the MULTIMIX program
L. Hunt and M. Jorgensen.
Mixture model clustering using the multimix program.
Australian and New Zealand Journal of Statistics,
41(2):153171, 1999.
[ bib ]
Hunt (1996) implemented the finite mixture model approach to clustering in a program called MULTIMIX. The program is designed to cluster multivariate data that have categorical and continuous variables and that possibly contain missing values. This paper describes the approach taken to design MULTIMIX and how some of the statistical problems were dealt with. As an example, the program is used to cluster an large medical dataset.


[13]

Issues in stacked generalization
K.M. Ting and I.H. Witten.
Issues in stacked generalization.
Journal of Artificial Intelligence Research, 10:271289, May
1999.
[ bib 
.ps 
.pdf ]
Stacked generalization is a general method of using a highlevel model to combine lowerlevel models to achieve greater predictive accuracy. In this paper we address two crucial issues which have been considered to be a 'black art' in classification tasks ever since the introduction of stacked generalization in 1992 by Wolpert: the type of generalizer that is suitable to derive the higherlevel model, and the kind of attributes that should be used as its input. We find that best results are obtained when the higherlevel model combines the confidence (and not just the predictions) of the lowerlevel ones.
We demonstrate the effectiveness of stacked generalization for combining three different types of learning algorithms for classification tasks. We also compare the performance of stacked generalization with majority vote and published results of arcing and bagging.


[14]

Clustering with finite data from semiparametric mixture distributions
Y. Wang and I.H. Witten.
Clustering with finite data from semiparametric mixture
distributions.
In Proc Symposium on the Interface: Models, Predictions, and
Computing, Schaumburg, Illinois, 1999.
[ bib ]
Existing clustering methods for the semiparametric mixture distribution perform well as the volume of data increases. However, they all suffer from a serious drawback in finitedata situations: small outlying groups of data points can be completely ignored in the clusters that are produced, no matter how far away they lie from the major clusters. This can result in unbounded loss if the loss function is sensitive to the distance between clusters.
This paper proposes a new distancebased clustering method that overcomes the problem by avoiding global constraints. Experimental results illustrate its superiority to existing methods when small clusters are present in finite data sets; they also suggest that it is more accurate and stable than other methods even when there are no small clusters.


[15]

Pace regression
Y. Wang and I.H. Witten.
Pace regression.
Technical Report 99/12, University of Waikato, Department of Computer
Science, Hamilton, New Zealand, September 1999.
[ bib 
.ps 
.pdf ]
This paper articulates a new method of linear regression, pace regression, that addresses many drawbacks of standard regression reported in the literatureparticularly the subset selection problem. Pace regression improves on classical ordinary least squares (OLS) regression by evaluating the effect of each variable and using a clustering analysis to improve the statistical basis for estimating their contribution to the overall regression. As well as outperforming OLS, it also outperformsin a remarkably general senseother linear modeling techniques in the literature, including subset selection procedures, which seek a reduction in dimensionality that falls out as a natural byproduct of pace regression. The paper defines six procedures that share the fundamental idea of pace regression, all of which are theoretically justified in terms of asymptotic performance. Experiments confirm the performance improvement over other techniques.


[16]

KEA: Practical Automatic Keyphrase Extraction
Ian H. Witten, Gordon W. Paynter, Eibe Frank, Carl Gutwin, and Craig G.
NevillManning.
KEA: Practical automatic keyphrase extraction.
In Proc 4th ACM conference on Digital Libraries, Berkeley, CA,
pages 254255. ACM, August 1999.
[ bib 
http 
.ps 
.pdf ]
Keyphrases provide semantic metadata that summarize and characterize documents. This paper describes Kea, an algorithm for automatically extracting keyphrases from text. Kea identifies candidate keyphrases using lexical methods, calculates feature values for each candidate, and uses a machine learning algorithm to predict which candidates are good keyphrases. The machine learning scheme first builds a prediction model using training documents with known keyphrases, and then uses the model to find keyphrases in new documents. We use a large test corpus to evaluate Kea's effectiveness in terms of how many authorassigned keyphrases are correctly identified. The system is simple, robust, and publicly available.


[17]

Weka: Practical Machine Learning Tools and Techniques with Java Implementations
Ian H. Witten, Eibe Frank, Len Trigg, Mark Hall, Geoffrey Holmes, and Sally Jo
Cunningham.
Weka: Practical machine learning tools and techniques with Java
implementations.
In Nikola Kasabov and Kitty Ko, editors, Proceedings of the
ICONIP/ANZIIS/ANNES'99 Workshop on Emerging Knowledge Engineering and
ConnectionistBased Information Systems, pages 192196, 1999.
Dunedin, New Zealand.
[ bib 
.ps 
.pdf ]
The Waikato Environment for Knowledge Analysis (Weka) is a comprehensive suite of Java class libraries that implement many stateoftheart machine learning and data mining algorithms. Weka is freely available on the WorldWide Web and accompanies a new text on data mining [1] which documents and fully explains all the algorithms it contains. Applications written using the Weka class libraries can be run on any computer with a Web browsing capability; this allows users to apply machine learning techniques to their own data regardless of computer platform.

