Waikato University crest

Department of
Computer Science
Tari Rorohiko

Computing and Mathematical Sciences

2008 Seminars

Events Index

Learning classifiers from only positive and unlabeled data

Charles Elkan
University of California, San Diego, USA
Tuesday, 2nd December 2008
The input to an algorithm that learns a binary classifier normally consists of two sets of examples, where one set consists of positive examples of the concept to be learned, and the other set consists of negative examples. However, it is often the case that the available training data are an incomplete set of positive examples, and a set of unlabeled examples, some of which are positive and some of which are negative. The problem solved here is how to learn a standard binary classifier given a nontraditional training set of this nature. Under the assumption that the labeled examples are selected randomly from the positive examples, we show that a classifier trained on positive and unlabeled examples predicts probabilities that differ by only a constant factor from the true conditional probabilities of being positive. We show how to use this result in three different ways to learn a classifier from a nontraditional training set. We then apply the new methods to solve a real-world problem: identifying protein records that should be included in an incomplete specialized molecular biology database. Experiments in this domain show that models trained using the new methods perform better than the current state-of-the-art SVM-based method for learning from positive and unlabeled examples.


Bioinformatics algorithms: overview and current research directions

Sanghamitra Bandyopadhyay
Indian Statistical Institute, Kolkata, India
Monday, 24th November 2008
Over the past few decades, major advances in the field of molecular biology, coupled with advances in genomic technologies, have led to an explosive growth in the biological information generated by the scientific community. Computational biology and bioinformatics, an area that has evolved in response to this deluge of information, can be viewed as the use of computational methods to handle biological data. It is an interdisciplinary field involving biology, computer science, mathematics and statistics to analyze biological sequence data, genome content & arrangement, and to predict the function and structure of macromolecules. The ultimate goal of the field is to enable the discovery of new biological insights as well as to create a global perspective from which unifying principles in biology can be derived. One important sub-discipline within bioinformatics involves the development of new algorithms and models to extract new, and potentially useful, information from various types of biological data including DNA, RNA and proteins. Analysis of these macromolecules is performed both structurally and functionally. This talk will start with a brief overview of molecular biology and Bioinformatics followed by some of the basic research issues in Bioinformatics. Thereafter two tasks viz., protein superfamily classification and structure-based ligand design will be discussed.


Event-driven database information sharing

Ken Moody
University of Cambridge, UK
Tuesday, 14th October 2008
Database systems have been designed to manage business critical information and provide this information on request to connected clients, a passive model. Increasingly, applications need to share information actively with clients and/or external systems, so that they can react to relevant information as soon as it becomes available. Event-driven architecture (EDA) is a software architectural pattern that models these requirements based on the production of, consumption of, and reaction to events. Publish/subscribe provides a loosely-coupled communication paradigm between the components of a system, through many-to-many, push-based event delivery. We describe our work integrating distributed content-based publish/subscribe functionality into a database system. We have extended existing database technology with new capabilities to realise EDA in a reliable, scalable, and secure manner. We discuss the design, architecture, implementation, and evaluation of PostgreSQL-PS, a prototype built on the PostgreSQL open-source database system.


Text analysis using Wikipedia knowledge base: Texterra experience

Dmitry Lizorkin
Institute for System Programming, Russian Academy of Sciences, Russia
Wednesday, 27th August 2008
For the present information era, important portion of information is represented as natural language texts that require special techniques for automated analysis and management, e.g. search, keyphrase extraction, classification by topic, fact extraction, etc. Due to the natural language essence of documents authored by humans, naive techniques like exact word matching generally exhibit inadequate quality of results, and more sophisticated techniques are required for semantic analysis of texts. Definitely, Wikipedia can be effectively used as a knowledge base for working out such techniques. Wikipedia provides the necessary coverage, rich link structure between concepts, content structure imposed by the wiki markup, and linguistic statistics based its own textual corpus, that can be altogether applied as a sufficient knowledge base for semantic analysis of texts. Texterra is another research project that utilizes the benefits of the Wikipedia knowledge base for text analysis tasks. Texterra project started with using merely the link structure of Wikipedia for computing semantic similarity between concepts, and currently researches the other possibilities for applying Wikipedia in text analysis. The technical talk presents ideas developed in the Texterra project in this field of research.


Next Generation Augmented Reality

Mark Billinghurst
Human Interface Technology Laboratory (HIT Lab NZ), University of Canterbury
Friday, 22nd August 2008
Augmented Reality (AR) is a technology that enables virtual imagery to be seamlessly blended with the real world. Although first developed in the 1960's it is only recently that consumer hardware and software platforms have been developed to the point that the technology really can be placed in the hands of everyday users. For the first time people have the computing and communications power in their pocket to provide a ubiquitous personal augmented reality experience. However there are a number of interesting obstacles that must be overcome before AR experiences become commonplace. This presentation gives brief review of key research milestones in the past and then describes the state of the art today and the research opportunities that exist in the field.


Copyright vs community

Richard Stallman
GNU Project and Free Software Foundation
Wednesday, 20th August 2008
Copyright developed in the age of the printing press, and was designed to fit with the system of centralized copying imposed by the printing press. But the copyright system does not fit well with computer networks, and only draconian punishments can enforce it. The global corporations that profit from copyright are lobbying for draconian punishments, and to increase their copyright powers, while suppressing public access to technology. But if we seriously hope to serve the only legitimate purpose of copyright to promote progress, for the benefit of the public then we must make changes in the other direction.


21st Century Raga: Digitizing North Indian Music

Ajay Kapur
California Institute of the Arts, USA
Monday, 4th August 2008
This presentation describes methods for digitizing, analyzing, preserving and extending North Indian Classical Music. Custom built controllers, influenced by the Human Computer Interaction (HCI) community, serve as new interfaces to gather musical gestures from a performing artist. Modified tabla, dholak, and sitar will be described. Experiments using wearable sensors to capture ancillary gestures of a human performer will also be discussed. A brief history through the world of Musical Robotics will be followed by an introduction to the MahaDeviBot, a 12-armed solenoid-based drummer used to accompany a live sitar player. Presentation is full of video examples showing evolution of the body of work in the laboratory to the live performance stage.


Introducing the networked environment for music analysis: Phase I

Stephen Downie
University of Illinois, Urbana-Champaign
Friday, 23rd May 2008
Phase I of the Networked Environment for Music Analysis (NEMA) framework project is a multinational, multidisciplinary cyberinfrastructure project for music information processing that builds upon and extends the music information retrieval research being conducted by the International Music Information Retrieval Systems Evaluation Laboratory (IMIRSEL) at the University of Illinois at Urbana-Champaign (UIUC). NEMA brings together the collective projects and the associated tools of six world leaders in the domains of music information retrieval (MIR), computational musicology (CM), data mining, digital libraries and e-humanities research. The NEMA team aims to create an open and extensible webservice-based resource framework that facilitates the integration of music data and analytic/evaluative tools that can be used by the global MIR and CM research and education communities on a basis independent of time or location. To help achieve this goal, the NEMA team will be working co-operatively with the UIUC-based, Mellon-funded, Software Environment for the Advancement of Scholarly Research (SEASR) project to exploit SEASR’s expertise and technologies in the domains of data mining and webservice-based resource framework development. Here at the University of Waikato, the NEMA project is represented by Dr. David Bainbridge, leader of the Greenstone Digital Library (GSDL) Project.


Exact matching under noise and applications to music information retrieval

Edgar Chavez
University of Michoacan, Mexico
Tuesday, 20th May 2008
If we are given a set of $k$ positions where a string $s$ may have mismatches, a noisy exact matching is a string $s´$ matching all the characters of $s$, except perhaps the characters in the given $k$ positions. Those noisy exact matchings appear naturally in many pattern recognition tasks where a continuous function is discretized. One example is when we have different versions of the same song. The standard way to solve the problem is to used the Hamming distance, which measures the number of mismatches between strings. This implies sequentially searching over the collection of samples. For very large collections this approach does not scale. We present a randomized algorithm to find exact matches under noise using a standard relational database. This is the first truly scalable approach for the exact matching under noise problem. We exemplify the method applied to music information retrieval using a novel fingerprinting technique based on the instantaneous entropy measured directly in the time domain.


BuildIT: building research capability

Steve Reeves
Department of Computer Science, The University of Waikato
Tuesday, 29th April 2008
This project, BuildIT, focuses on developing the ICT research community within NZ, with particular focus on excellence in young and emerging researchers, and recognising the gap in post-doctorate support within the ICT domain. BuildIT recognises the short research history of Computer Science in the context of the phenomenal growth that this discipline has experienced, and factors impacting negatively on the research capability of the discipline. This project has been funded via the New Zealand government, and has charged the two universities leading the project (Waikato and Auckland) with distributing funding amongst all the New Zealand universities. The various parts of the project are run by academics from Waikato and Auckland (Mark Apperley, Steve Reeves, Robert Amor and John Hosking). There is also an overall governing board with members from New Zealand universities and industry. They will make the various awards against soon-to-be-agreed criteria—and you'll be hearing much more about this soon!


Interpreting the data: parallel analysis with Sawzall

Rob Pike
Google Australia
Tuesday, 22nd April 2008
Very large data sets – for example web document repositories - often have a flat but regular structure and span multiple disks and machines. These large data sets are not amenable to study using traditional database techniques, if only because they can be too large to fit in a single relational database. On the other hand, many of the analyses done on them can be expressed using simple, easily distributed computations: filtering, aggregation, extraction of statistics, and so on. Rob Pike, a Principal Engineer at Google, will discuss a system for automating such analyses that exploits the parallelism inherent in having data and computation distributed across many machines.


Prototyping the network management cycle

Michael O'Sullivan
Engineering Science Department, Auckland University
Tuesday, 15th April 2008
Due to the affordability of their components, and the complexity involved in accurately modeling their operation, storage systems have tended to develop myopically, in response to localised storage and performance issues. There tends to be no accepted "best practice" on how to combine monitoring, modeling, and design in an ongoing network management process. We will introduce the Network Management Cycle (NMC), a framework intended to apply best practice from Engineering Design in the Network Management domain. This framework includes detailed network monitoring, simulation modeling, and mixed-integer programming to introduce a global performance-based rationale for strategic network management decisions. We will use examples from our recent research to illustrate our experience following the steps of the NMC.


Detection of mastitis pathogens by analysis of volatile bacterial metabolites

Kasper Hettinga
Dairy Science and Technology Group, Wageningen University, Wageningen, The Netherlands
Thursday, 10th April 2008
Mastitis with or without clinical symptoms is most often caused by bacteria. Determination of the mastitis causing pathogen is of great interest, both for choice of treatment as well as for possible measures that have to be taken on the farm to prevent the spread of mastitis. Currently, determination of the pathogen is done with classical microbiological methods. The main disadvantage of these microbiological methods is that they are time-consuming. Faster and more accurate methods of pathogen detection are advantageous, because farmers are earlier able to choose an optimal treatment.


Data mining in reaction databases: towards predicting biodegradation products and pathways

Stefan Kramer
University of Technology, Munich, Germany
Tuesday, 8th April 2008
In the talk, I will give an overview of data mining methods for databases of chemical reactions. The overall goal is to predict reaction products and pathways for compounds without experimental data. We build on data from the University of Minnesota Biocatalysis/Biodegradation Database (UM-BBD, http://umbbd.msi.umn.edu/), a database containing information on microbial biocatalytic reactions and biodegradation pathways for about 1000 chemical compounds. Recently, the collection of known reactions and pathways in the UM-BBD has been used to develop a knowledge-based system for the prediction of plausible microbial catabolic reactions and pathways, the University of Minnesota Pathway Prediction System (PPS, http://umbbd.msi.umn.edu/predict/). In the first part of the talk, I will present a graph-based representation of chemical reactions for the application of graph mining methods to such data. In the second part, I will present two hybrid knowledge-based and machine-learning based approaches to limit the combinatorial explosion in pathway prediction. Both approaches make use of the existing UM-PPS transformation rules. The first approach extracts so-called relative reasoning rules for the resolution of conflicts of transformation rules. The second learns one classifier for each UM-PPS rule and provides probabilities of suggested transformation products.


Adaptation and resource-awareness in data stream processing

Mohamed Medhat Gaber
Monash University, Victoria, Australia
Wednesday, 2nd April 2008
Advances in both hardware and software technologies have led to a faster than ever data generation. As a result of this, the area of data stream processing has been introduced. Streaming data is ubiquitous and there is a real challenge to store, query, analyze and visualize such rapid and continuous large volumes of data. Resource constraints of ubiquitous computing environments represent the main research issue to realize such a potential field with various important applications. Examples of such applications include processing data streams produced from sensor networks, web clickstreams, ATM transactions, stock markets and many others. In this talk, we review the challenges facing data stream processing. The need for resource-awareness and adaptation is presented leading to an overview of our novel techniques in mining/querying data streams. Applications of these techniques are also presented. Finally the talk is concluded with future visions in the area of data streams and resource-constrained processing environments.


Utility-based regression - recent developments

Luis Torgo
Faculty of Economics, University of Porto, Portugal
Tuesday, 18th March 2008
Cost-sensitive learning is a key technique for addressing many real world data mining applications. Most existing research has been focused on classification problems. In this talk we describe recent developments we have undertaken to propose a framework for evaluating regression models in applications with non-uniform costs and benefits across the domain of the continuous target variable.


The role of inter-word distances in modelling natural language and mining useful information from it

Justin Washtell
School of Computing, University of Leeds, England
Tuesday, 26th February 2008
Distance-based language models are concerned with the distances between words in a corpus, as opposed to word frequencies derived from counts. They have scarcely been considered in the literature. This work presents one such model, the "nearest-neighbour" model - inspired by the problem of capturing dispersion - and illustrates how it can be applied to other tasks such as measuring association and performing machine translation and information retrieval. Theoretical benefits over traditional approaches to these tasks form a key part of the discussion.


Events Index