2009

[1]

Conditional Density Estimation with Class Probability Estimators

Eibe Frank and Remco Bouckaert. Conditional density estimation with class probability estimators. In Proceedings of the 1st Asian Conference on Machine Learning, Nanjing, China. Springer, 2009.
[ bib | .pdf ]
Many regression schemes deliver a point estimate only, but often it is useful or even essential to quantify the uncertainty inherent in a prediction. If a conditional density estimate is available, then prediction intervals can be derived from it. In this paper we compare three techniques for computing conditional density estimates using a class probability estimator, where this estimator is applied to the discretized target variable and used to derive instance weights for an underlying univariate density estimator; this yields a conditional density estimate. The three density estimators we compare are: a histogram estimator that has been used previously in this context, a normal density estimator, and a kernel estimator. In our experiments, the latter two deliver better performance, both in terms of cross-validated log-likelihood and in terms of quality of the resulting prediction intervals. The empirical coverage of the intervals is close to the desired confidence level in most cases. We also include results for point estimation, as well as a comparison to Gaussian process regression and nonparametric quantile estimation.

[2]

Classifier Chains for Multi-label Classification

Jesse Read, Bernhard Pfahringer, Geoff Holmes, and Eibe Frank. Classifier chains for multi-label classification. In Proc 13th European Conference on Principles and Practice of Knowledge Discovery in Databases and 20th European Conference on Machine Learning, Bled, Slovenia. Springer, 2009.
[ bib | .pdf ]
The widely known binary relevance method for multi-label classification, which considers each label as an independent binary problem, has been sidelined in the literature due to the perceived inadequacy of its label-independence assumption. Instead, most current methods invest considerable complexity to model interdependencies between labels. This paper shows that binary relevance-based methods have much to offer, especially in terms of scalability to large datasets. We exemplify this with a novel chaining method that can model label correlations while maintaining acceptable computational complexity. Empirical evaluation over a broad range of multi-label datasets with a variety of evaluation metrics demonstrates the competitiveness of our chaining method against related and state-of-the-art methods, both in terms of predictive performance and time complexity.

[3]

Human-competitive tagging using automatic keyphrase extraction

Olena Medelyan, Eibe Frank, and Ian H. Witten. Human-competitive tagging using automatic keyphrase extraction. In Proc Conf on Empirical Methods in Natural Language Processing. ACL, 2009.
[ bib | .pdf ]
This paper connects two research areas: automatic tagging on the web and statistical keyphrase extraction. First, we analyze the quality of tags in a collaboratively created folksonomy using traditional evaluation techniques. Next, we demonstrate how documents can be tagged automatically with a state-of-the-art keyphrase extraction algorithm, and further improve performance in this new domain using a new algorithm, “Maui”, that utilizes semantic information extracted from Wikipedia. Maui outperforms existing approaches and extracts tags that are competitive with those assigned by the best performing human taggers.

[4]

Clustering Documents using a Wikipedia-based Concept Representation

Anna Huang, David Milne, Eibe Frank, and Ian H. Witten. Clustering documents using a wikipedia-based concept representation. In Proc 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Bangkog, Thailand, pages 628-636. Springer, 2009.
[ bib | .pdf ]
This paper shows how Wikipedia and the semantic knowledge it contains can be exploited for document clustering. We first create a concept-based document representation by mapping the terms and phrases within documents to their corresponding articles (or concepts) in Wikipedia. We also developed a similarity measure that evaluates the semantic relatedness between concept sets for two documents. We test the concept-based representation and the similarity measure on two standard text document datasets. Empirical results show that although further optimizations could be performed, our approach already improves upon related techniques.

[5]

Large-scale attribute selection using wrappers

Martin Gütlein, Eibe Frank, Mark Hall, and Andreas Karwath. Large-scale attribute selection using wrappers. In Proc IEEE Symposium on Computational Intelligence and Data Mining, pages 332-339. IEEE, 2009.
[ bib | .pdf ]
Scheme-specific attribute selection with the wrapper and variants of forward selection is a popular attribute selection technique for classification that yields good results. However, it can run the risk of overfitting because of the extent of the search and the extensive use of internal cross-validation. Moreover, although wrapper evaluators tend to achieve superior accuracy compared to filters, they face a high computational cost. The problems of overfitting and high runtime occur in particular on high-dimensional datasets, like microarray data. We investigate Linear Forward Selection, a technique to reduce the number of attributes expansions in each forward selection step. Our experiments demonstrate that this approach is faster, finds smaller subsets and can even increase the accuracy compared to standard forward selection. We also investigate a variant that applies explicit subset size determination in forward selection to combat overfitting, where the search is forced to stop at a precomputed ``optimal'' subset size. We show that this technique reduces subset size while maintaining comparable accuracy.

[6]

Analysing chromatographic data using data mining to monitor petroleum content in water

G. Holmes, D. Fletcher, P. Reutemann, and E. Frank. Analysing chromatographic data using data mining to monitor petroleum content in water. In Proc 4th International ICSC Symposium, Thessaloniki, Greece, Information Technologies in Enviromental Engineering, pages 279-290, May 2009.
[ bib | .pdf ]
Chromatography is an important analytical technique that has wide- spread use in environmental applications. A typical application is the monitoring of water samples to determine if they contain petroleum. These tests are mandated in many countries to enable environmental agencies to determine if tanks used to store petrol are leaking into local water systems.
Chromatographic techniques, typically using gas or liquid chromatography coupled with mass spectrometry, allow an analyst to detect a vast array of compounds—potentially in the order of thousands. Accurate analysis relies heavily on the skills of a limited pool of experienced analysts utilising semi-automatic techniques to analyse these datasets—making the out- comes subjective.
The focus of current laboratory data analysis systems has been on refine- ments of existing approaches. The work described here represents a para- digm shift achieved through applying data mining techniques to tackle the problem. These techniques are compelling because the efficacy of pre- processing methods, which are essential in this application area, can be ob- jectively evaluated. This paper presents preliminary results using a data mining framework to predict the concentrations of petroleum compounds in water samples. Experiments demonstrate that the framework can be used to produce models of sufficient accuracy—measured in terms of root mean squared error and correlation coefficients—to offer the potential fo significantly reducing the time spent by analysts on this task.