Improving Browsing in Digital Libraries with Keyphrase Indexes

Carl Gutwin, Gordon W. Paynter, Ian H. Witten, Craig G. Nevill-Manning, and Eibe Frank. Improving browsing in digital libraries with keyphrase indexes. Decision Support Systems, 27(1/2):81-104, 1999.
[ bib | .ps.gz ]

Developing innovative applications of machine learning

S.J. Cunningham and G. Holmes. Developing innovative applications of machine learning. In Proc Southeast Asia Regional Computer Confederation Conference, Singapore, 1999.
[ bib | .ps | .pdf ]
The WEKA (Waikato Environment for Knowledge Analysis) system provides a comprehensive suite of facilities for applying data mining techniques to large data sets. This paper discusses a process model for analyzing data, and describes the support that WEKA provides for this model. The domain model 'learned' by the data mining algorithm can then be readily incorporated into a software application. The WEKA-based analysis and application construction process is illustrated through a case study in the agricultural domain?mushroom grading.


Market Basket Analysis of Library Circulation Data

Sally Jo Cunningham and Eibe Frank. Market basket analysis of library circulation data. In T. Gedeon, P. Wong, S. Halgamuge, N. Kasabov, D. Nauck, and K. Fukushima, editors, Proc 6th International Conference on Neural Information Processing, volume II of Perth, Australia, pages 825-830. IEEE Service Center, 1999.
[ bib | .ps | .pdf ]
Market Basket Analysis algorithms have recently seen widespread use in analyzing consumer purchasing patterns?specifically, in detecting products that are frequently purchased together. We apply the Apriori market basket analysis tool to the task of detecting subject classification categories that co-occur in transaction records of books borrowed from a university library. This information can be useful in directing users to additional portions of the collection that may contain documents relevant to their information needs, and in determining a library's physical layout. These results can also provide insight into the degree of scatter that the classification scheme induces in a particular collection of documents.


Domain-Specific Keyphrase Extraction

Eibe Frank, Gordon W. Paynter, Ian H. Witten, Carl Gutwin, and Craig G. Nevill-Manning. Domain-specific keyphrase extraction. In Proc 16th International Joint Conference on Artificial Intelligence, Stockholm, Sweden, pages 668-673. Morgan Kaufmann, 1999.
[ bib | .ps | .pdf ]
Automatic keyphrase extraction is a promising area for applied machine learning because keyphrases are an important means for document summarization, clustering, and topic search. Only a small minority of documents have author-assigned keyphrases, and manually assigning keyphrases to existing documents is very laborious. Therefore, it is highly desirable to automate the keyphrase extraction process.

This paper presents a simple procedure for keyphrase extraction based on the naive Bayes learning scheme, which is shown to perform comparably to the state-of-the-art. It goes on to explain how the performance of this procedure can be boosted by automatically tailoring the extraction process to the particular document collection at hand. Results on a large collection of technical reports in computer science show that the quality of the extracted keyphrases improves significantly if domain-specific information is exploited.


Making Better Use of Global Discretization

Eibe Frank and Ian H. Witten. Making better use of global discretization. In Proc 16th International Conference on Machine Learning, Bled, Slovenia, pages 115-123. Morgan Kaufmann, 1999.
[ bib | .ps | .pdf ]
Before applying learning algorithms to datasets, practitioners often globally discretize any numeric attributes. If the algorithm cannot handle numeric attributes directly, prior discretization is essential. Even if it can, prior discretization often accelerates induction, and may produce simpler and more accurate classifiers.

As it is generally done, global discretization denies the learning algorithm any chance of taking advantage of the ordering information implicit in numeric attributes. However, a simple transformation of discretized data preserves this information in a form that learners can use. We show that, compared to using the discretized data directly, this transformation significantly increases the accuracy of decision trees built by C4.5, decision lists built by PART, and decision tables built using the wrapper method, on several benchmark datasets. Moreover, it can significantly reduce the size of the resulting classifiers.

This simple technique makes global discretization an even more useful tool for data preprocessing.


Correlation-based feature selection for machine learning

M.A. Hall. Correlation-based feature selection for machine learning. PhD thesis, University of Waikato, Department of Computer Science, Hamilton, New Zealand, April 1999.
[ bib | .ps | .pdf ]
A central problem in machine learning is identifying a representative set of features from which to construct a classification model for a particular task. This thesis addresses the problems of feature selection for machine learning through a correlation based approach. The central hypothesis is that good feature sets contain features that are highly correlated with the class, yet uncorrelated with each other. A feature evaluation formula, based on ideas from test theory, provides an operational definition of this hypothesis. CFS (Correlation based Feature Selection) is an algorithm that couples this evaluation formula with an appropriate correlation measure and a heuristic search strategy.


Feature selection for discrete and numeric class machine learning

M.A. Hall. Feature selection for discrete and numeric class machine learning. Technical Report 99/4, University of Waikato, Department of Computer Science, Hamilton, New Zealand, April 1999.
[ bib | .ps | .pdf ]
Algorithms for feature selection fall into two broad categories: wrappersuse the learning algorithm itself to evaluate the usefulness of features, while filtersevaluate features according to heuristics based on general characteristics of the data. For application to large databases, filters have proven to be more practical than wrappers because they are much faster. However, most existing filter algorithms only work with discrete classification problems.

This paper describes a fast, correlation-based filter algorithm that can be applied to continuous and discrete problems. Experiments using the new method as a preprocessing step for naive Bayes, instance-based learning, decision trees, locally weighted regression, and model trees show it to be an effective feature selector- it reduces the data in dimensionality by more than sixty percent in most cases without negatively affecting accuracy. Also, decision and model trees built from the pre-processed data are often significantly smaller.


Feature selection for machine learning: comparing a correlation-based filter approach to the wrapper

Mark Andrew Hall and Lloyd Smith. Feature selection for machine learning: comparing a correlation-based filter approach to the wrapper. In A. N. Kumar and I. Russel, editors, Proc Florida Artificial Intelligence Symposium, pages 235-239, Orlando, Florida, 1999. AAAI Press.
[ bib | .ps | .pdf ]
Feature selection is often an essential data processing step prior to applying a learning algorithm. The removal of irrelevant and redundant information often improves the performance of machine learning algorithms. There are two common approaches: a wrapperuses the intended learning algorithm itself to evaluate the usefulness of features, while a filterevaluates features according to heuristics based on general characteristics of the data. The wrapper approach is generally considered to produce better feature subsets but runs much more slowly than a filter. This paper describes a new filter approach to feature selection that uses a correlation based heuristic to evaluate the worth of feature subsets. When applied as a data preprocessing step for two common machine learning algorithms, the new method compares favourably with the wrapper but requires much less computation.


A diagnostic tool for tree based supervised classification learning algorithms

G. Holmes and L. Trigg. A diagnostic tool for tree based supervised classification learning algorithms. In Proc Sixth International Conference on Neural Information Processing (ICONIP'99), volume II, pages 514-519, Perth, Western Australia, November 1999.
[ bib | .ps | .pdf ]
The process of developing applications of machine learning and data mining that employ supervised classification algorithms includes the important step of knowledge verification. Interpretable output is presented to a user so that they can verify that the knowledge contained in the output makes sense for the given application. As the development of an application is an iterative process it is quite likely that a user would wish to compare models constructed at various times or stages.

One crucial stage where comparison of models is important is when the accuracy of a model is being estimated, typically using some form of cross-validation. This stage is used to establish an estimate of how well a model will perform on unseen data. This is vital information to present to a user, but it is also important to show the degree of variation between models obtained from the entire dataset and models obtained during cross-validation. In this way it can be verified that the cross-validation models are at least structurally aligned with the model garnered from the entire dataset.

This paper presents a diagnostic tool for the comparison of tree-based supervised classification models. The method is adapted from work on approximate tree matching and applied to decision trees. The tool is described together with experimental results on standard datasets.


Generating Rule Sets from Model Trees

Geoffrey Holmes, Mark Hall, and Eibe Frank. Generating rule sets from model trees. In Proc 12th Australian Joint Conference on Artificial Intelligence, Sydney, Australia, pages 1-12. Springer, 1999.
[ bib | .ps | .pdf ]
Model trees - decision trees with linear models at the leaf nodes - have recently emerged as an accurate method for numeric prediction that produces understandable models. However, it is known that decision lists - ordered sets of If-Then rules - have the potential to be more compact and therefore more understandable than their tree counterparts.

We present an algorithm for inducing simple, accurate decision lists from model trees. Model trees are built repeatedly and the best rule is selected at each iteration. This method produces rule sets that are as accurate but smaller than the model tree constructed from the entire dataset. Experimental results for various heuristics which attempt to find a compromise between rule accuracy and rule coverage are reported. We show that our method produces comparably accurate and smaller rule sets than the commercial state-of-the-art rule learning system Cubist.


Fitting a mixture model to three-mode three-way data with categorical and continuous variables

L.A. Hunt and K.E. Basford. Fitting a mixture model to three-mode three-way data with categorical and continuous variables. Journal of Classification, 16(2):283-296, 1999.
[ bib ]
The mixture likelihood approach to clustering is most often used with two-mode two-way data to cluster one of the modes (e.g., the entities) into homogeneous groups on the basis of the other mode (e.g., the attributes). In this case, the attributes can either be continuous or categorical. When the data set consists of a three-mode three-way array (e.g., attributes measured on entities in different situations), an analogous procedure is needed to enable the clustering of the entities (i.e., one of the modes) on the basis of both of the other modes simultaneously (i.e., the attributes measured in different situations). In this paper, it is shown that the finite mixture approach to clustering can be extended to analyze three-mode three-way data where some of the attributes care continuous and some are categorical. The methodology is illustrated by clustering the genotypes in a three-way soybean data set where various attributes were measured on genotypes grown in several environments.


Mixture model clustering using the MULTIMIX program

L. Hunt and M. Jorgensen. Mixture model clustering using the multimix program. Australian and New Zealand Journal of Statistics, 41(2):153-171, 1999.
[ bib ]
Hunt (1996) implemented the finite mixture model approach to clustering in a program called MULTIMIX. The program is designed to cluster multivariate data that have categorical and continuous variables and that possibly contain missing values. This paper describes the approach taken to design MULTIMIX and how some of the statistical problems were dealt with. As an example, the program is used to cluster an large medical dataset.


Issues in stacked generalization

K.M. Ting and I.H. Witten. Issues in stacked generalization. Journal of Artificial Intelligence Research, 10:271-289, May 1999.
[ bib | .ps | .pdf ]
Stacked generalization is a general method of using a high-level model to combine lower-level models to achieve greater predictive accuracy. In this paper we address two crucial issues which have been considered to be a 'black art' in classification tasks ever since the introduction of stacked generalization in 1992 by Wolpert: the type of generalizer that is suitable to derive the higher-level model, and the kind of attributes that should be used as its input. We find that best results are obtained when the higher-level model combines the confidence (and not just the predictions) of the lower-level ones.

We demonstrate the effectiveness of stacked generalization for combining three different types of learning algorithms for classification tasks. We also compare the performance of stacked generalization with majority vote and published results of arcing and bagging.


Clustering with finite data from semi-parametric mixture distributions

Y. Wang and I.H. Witten. Clustering with finite data from semi-parametric mixture distributions. In Proc Symposium on the Interface: Models, Predictions, and Computing, Schaumburg, Illinois, 1999.
[ bib ]
Existing clustering methods for the semi-parametric mixture distribution perform well as the volume of data increases. However, they all suffer from a serious drawback in finite-data situations: small outlying groups of data points can be completely ignored in the clusters that are produced, no matter how far away they lie from the major clusters. This can result in unbounded loss if the loss function is sensitive to the distance between clusters.

This paper proposes a new distance-based clustering method that overcomes the problem by avoiding global constraints. Experimental results illustrate its superiority to existing methods when small clusters are present in finite data sets; they also suggest that it is more accurate and stable than other methods even when there are no small clusters.


Pace regression

Y. Wang and I.H. Witten. Pace regression. Technical Report 99/12, University of Waikato, Department of Computer Science, Hamilton, New Zealand, September 1999.
[ bib | .ps | .pdf ]
This paper articulates a new method of linear regression, pace regression, that addresses many drawbacks of standard regression reported in the literature-particularly the subset selection problem. Pace regression improves on classical ordinary least squares (OLS) regression by evaluating the effect of each variable and using a clustering analysis to improve the statistical basis for estimating their contribution to the overall regression. As well as outperforming OLS, it also outperforms-in a remarkably general sense-other linear modeling techniques in the literature, including subset selection procedures, which seek a reduction in dimensionality that falls out as a natural byproduct of pace regression. The paper defines six procedures that share the fundamental idea of pace regression, all of which are theoretically justified in terms of asymptotic performance. Experiments confirm the performance improvement over other techniques.


KEA: Practical Automatic Keyphrase Extraction

Ian H. Witten, Gordon W. Paynter, Eibe Frank, Carl Gutwin, and Craig G. Nevill-Manning. KEA: Practical automatic keyphrase extraction. In Proc 4th ACM conference on Digital Libraries, Berkeley, CA, pages 254-255. ACM, August 1999.
[ bib | http | .ps | .pdf ]
Keyphrases provide semantic metadata that summarize and characterize documents. This paper describes Kea, an algorithm for automatically extracting keyphrases from text. Kea identifies candidate keyphrases using lexical methods, calculates feature values for each candidate, and uses a machine learning algorithm to predict which candidates are good keyphrases. The machine learning scheme first builds a prediction model using training documents with known keyphrases, and then uses the model to find keyphrases in new documents. We use a large test corpus to evaluate Kea's effectiveness in terms of how many author-assigned keyphrases are correctly identified. The system is simple, robust, and publicly available.


Weka: Practical Machine Learning Tools and Techniques with Java Implementations

Ian H. Witten, Eibe Frank, Len Trigg, Mark Hall, Geoffrey Holmes, and Sally Jo Cunningham. Weka: Practical machine learning tools and techniques with Java implementations. In Nikola Kasabov and Kitty Ko, editors, Proceedings of the ICONIP/ANZIIS/ANNES'99 Workshop on Emerging Knowledge Engineering and Connectionist-Based Information Systems, pages 192-196, 1999. Dunedin, New Zealand.
[ bib | .ps | .pdf ]
The Waikato Environment for Knowledge Analysis (Weka) is a comprehensive suite of Java class libraries that implement many state-of-the-art machine learning and data mining algorithms. Weka is freely available on the World-Wide Web and accompanies a new text on data mining [1] which documents and fully explains all the algorithms it contains. Applications written using the Weka class libraries can be run on any computer with a Web browsing capability; this allows users to apply machine learning techniques to their own data regardless of computer platform.