2008
-
Milne, D. and Witten, I.H. (2008) Learning to link with Wikipedia. In Proceedings of the ACM Conference on Information and Knowledge Management (CIKM'2008), Napa Valley, California. (Best Paper Award)
This paper describes how to automatically cross-reference documents with Wikipedia: the largest knowledge base ever known. It explains how machine learning can be used to identify significant terms within unstructured text, and enrich it with links to the appropriate Wikipedia articles. The resulting link detector and disambiguator performs very well, with recall and precision of almost 75%. This performance is constant whether the system is evaluated on Wikipedia articles or "real world" documents.
This work has implications far beyond enriching documents with explanatory links. It can provide structured knowledge about any unstructured fragment of text. Any task that is currently addressed with bags of words—indexing, clustering, retrieval, and summarization to name a few—could use the techniques described here to draw on a vast network of concepts and semantics.

-
Milne, D. and Witten, I.H. (2008) An effective, low-cost measure of semantic relatedness obtained from Wikipedia links. In Proceedings of the first AAAI Workshop on Wikipedia and Artificial Intelligence (WIKIAI'08), Chicago, I.L.
This paper describes a new technique for obtaining measures of semantic relatedness. Like other recent approaches, it uses Wikipedia to provide structured world knowledge about the terms of interest. Our approach is unique in that it does so using the hyperlink structure of Wikipedia rather than its category hierarchy or textual content. Evaluation with manually defined measures of semantic relatedness reveals this to be an effective compromise between the ease of computation of the former approach and the accuracy of the latter.

-
Medelyan, O, Witten, I.H., and Milne, D. (2008) Topic Indexing with Wikipedia. To appear in Proceedings of the first AAAI Workshop on Wikipedia and Artificial Intelligence (WIKIAI'08), Chicago, I.L.
Wikipedia article names can be utilized as a controlled vocabulary for identifying the main topics in a document. Wikipedia's two million articles cover the terminology of nearly any document collection, which permits controlled indexing in the absence of manually created vocabularies. We combine state-of-the-art strategies for automatic controlled indexing with Wikipedia's unique property—a richly hyperlinked encyclopedia. We evaluate the scheme by comparing automatically assigned topics with those chosen manually by human indexers. Analysis of indexing consistency shows that our algorithm performs as well as the average human.

-
Milne, D., Nichols, D.M, and Witten, I.H. (2008) A competitive environment for exploratory query expansion. In Proceedings of the Joint Conference on Digital Libraries (JCDL 2008), Pittsburgh, P.A.
Most information workers query digital libraries many times a day. Yet people have little opportunity to hone their skills in a controlled environment, or compare their performance with others in an objective way. Conversely, although search engine logs record how users evolve queries, they lack crucial information about the user's intent. This paper describes an environment for exploratory query expansion that pits users against each other and lets them compete, and practice, in their own time and on their own workstation. The system captures query evolution behavior on predetermined information-seeking tasks. It is publicly available, and the code is open source so that others can set up their own competitive environments.

-
Medelyan, O. and Milne, D. (2008) Augmenting domain-specific thesauri with knowledge from Wikipedia. In Proceedings of the NZ Computer Science Research Student Conference (NZCSRSC 2008), Christchurch, New Zealand.
We propose a new method for extending a domain-specific thesaurus with valuable information from Wikipedia. The main obstacle is to disambiguate thesaurus concepts to correct Wikipedia articles. Given the concept name, we first identify candidate mappings by analyzing article titles, their redirects and disambiguation pages. Then, for each candidate, we compute a link-based similarity score to all mappings of context terms related to this concept. The article with the highest score is then used to augment the thesaurus concept. It is the source for the extended gloss, explaining the concept's meaning, synonymous expressions that can be used as additional nondescriptors in the thesaurus, translations of the concept into other languages, and new domain-relevant concepts.

2007
-
Milne, D., Witten, I.H. and Nichols, D.M. (2007). A Knowledge-Based Search Engine Powered by Wikipedia. In Proceedings of the ACM Conference on Information and Knowledge Management (CIKM'2007), Lisbon, Portugal.
This paper describes Koru, a new search interface that offers effective domain-independent knowledge-based information retrieval. Koru exhibits an understanding of the topics of both queries and documents. This allows it to (a) expand queries automatically and (b) help guide the user as they evolve their queries interactively. Its understanding is mined from the vast investment of manual effort and judgment that is Wikipedia. We show how this open, constantly evolving encyclopedia can yield inexpensive knowledge structures that are specifically tailored to expose the topics, terminology and semantics of individual document collections. We conducted a detailed user study with 12 participants and 10 topics from the 2005 TREC HARD track, and found that Koru and its underlying knowledge base offers significant advantages over traditional keyword search. It was capable of lending assistance to almost every query issued to it; making their entry more efficient, improving the relevance of the documents they return, and narrowing the gap between expert and novice seekers.

-
Milne, D. (2007). Computing Semantic Relatedness using Wikipedia Link Structure. In Proceedings of the New Zealand Computer Science Research Student Conference (NZCSRSC 2007), Hamilton, New Zealand.
This paper describes a new technique for obtaining measures of semantic relatedness. Like other recent approaches, it uses Wikipedia to provide a vast amount of structured world knowledge about the terms of interest. Our system, the Wikipedia Link Vector Model or WLVM, is unique in that it does so using only the hyperlink structure of Wikipedia rather than its full textual content. To evaluate the algorithm we use a large, widely used test set of manually defined measures of semantic relatedness as our bench-mark. This allows direct comparison of our system with other similar techniques.

2006
-
Milne, D., Medelyan, O. and Witten, I. H. (2006). Mining Domain-Specific Thesauri from Wikipedia: A case study. In Proceedings of the International Conference on Web Intelligence (IEEE/WIC/ACM WI'2006), Hong Kong.
Domain-specific thesauri are high-cost, high-maintenance, high-value knowledge structures. We show how the classic thesaurus structure of terms and links can be mined automatically from Wikipedia, a vast, open encyclopedia. In a comparison with a professional thesaurus for agriculture (Agrovoc) we find that Wikipedia contains a substantial proportion of its domain-specific concepts and semantic relations; furthermore it has impressive coverage of a collection of contemporary documents in the domain. Thesauri derived using these techniques are attractive because they capitalize on existing public efforts and tend to reflect contemporary language usage better than their costly, painstakingly-constructed manual counterparts.

-
Witten, I. H., Medelyan, O. and Milne D. (2006). Finding documents and reading them: Semantic metadata extraction, topic browsing and realistic books. In Proceedings of the Russian Conference on Digital Libraries (RCDL 2006). Suzdal, Russia.
What would it take to provide a congenial and comfortable environment for finding and reading books in a digital library? To locate information we need algorithms that extract semantic metadata in forms such as keyphrases, with accuracy and consistency comparable to human indexers. To support this we need comprehensive, detailed thesauri, automatically created, that embody contemporary language and usage. To emulate and enjoy the serendipitous adventures found in real libraries and bookstores we need browsing environments that provide readers with multiple clues in parallel: keyphrases, text excerpts, and supplementary knowledge structures?as well as the documents themselves. For readers to cherish and enjoy individual works we need to transcend the bland reading environment provided by the web by recreating the subjective impact and pleasurable experience of interacting with real books. This paper describes research that aims to achieve these goals.

2005
-
Bouamrane, M., Luz, S., Masoodian, M., and King, D. (2005) Supporting Remote Collaboration through Structured Activity Logging. In Proceedings of the 4th International Conference on Grid and Cooperative Computing (GCC 2005), Beijing, China
This paper describes an integrated architecture for online collaborative multimedia (audio and text) meetings which supports the recording of participants' audio exchanges, automatic metadata generation and logging of users editing interaction and also information derived from the use of group awareness widgets (gesturing) for post-meeting processing and access. We propose a formal model for timestamping generation and manipulation of textual artefacts. Post-meeting processing of the interaction information highlight the usefulness of such histories in terms of tracking information that would be normally lost in usual collaborative editing settings. The potential applications of such automatic interaction history generation range from group interaction quantitative analysis, cooperation modelling, and multimedia meeting mining.

-
Masoodian, M., Luz, S., Bouamrane, M., and King, D. (2005) RECOLED: A Group-Aware Collaborative Text Editor for Capturing Document History. In Proceedings of the IADIS International Conference on WWW/Internet, Lisbon, Portugal.
This paper presents a usability analysis of RECOLED, a shared document editor which supports recording of audio communication in remote collaborative writing sessions, and transparent monitoring of interactions, such as editing, gesturing and scrolling. The editor has been designed so that the collaboration results in the production of a multimedia document history which enriches the final product of the writing activity and can serve as a basis for post-meeting information retrieval. A discussion is presented on how post-meeting processing can highlight the usefulness of such histories in terms of tracking information that would be normally lost in usual collaborative editing settings.

2004
-
Bouamrane, M., King, D., Luz, S., and Masoodian, M. (2004) A Framework for Collaborative Writing with Recording and Post-Meeting Retrieval Capabilities. In Proceedings of CSCW 2004 Workshop, The 6th International Workshop on Collaborative Editing Systems, Chicago, USA.
From a HCI perspective, elucidating and supporting the context in which collaboration takes place is key to implementing successful collaborative systems. Synchronous collaborative writing usually takes place in contexts involving a "meeting" of some sort. Collaborative writing meetings can be face-to-face or, increasingly, remote Internet-based meetings. The latter presents software developers with the possibility of incorporating multimedia recording and information retrieval capabilities into the collaborative environment. The collaborative writing that ensues can be seen as an activity encompassing asynchronous as well as synchronous aspects. In order for revisions, information retrieval and other forms of post-meeting, asynchronous work to be effectively supported, the synchronous collaborative editor must be able to appropriately detect and record meeting metadata. This paper presents a collaborative editor that supports recording of user actions and explicit metadata production. Design and technical implications of introducing such capabilities are discussed with respect to document segmentation, consistency control, and awareness mechanisms.

Theses
-
Milne, D. (2006) From Phrase Browsing to Interactive Query Expansion, an AJAX enabled approach. Unpublished Masters Thesis, University of Waikato, New Zealand.
Interactive query expansion covers a group of techniques that provide a useful compromise between searching and browsing. Their case is compelling; they expose available knowledge and assist with the difficult task of constructing effective queries. One such technique is phrase browsing, in which queries are treated as single phrases and evolved by exploring an automatically generated hierarchy of terms. Extensive development and evaluation at the University of Waikato has shown phrase browsing to be promising but lacking in several important respects, particularly its inability to cope with multi-topic queries.
This thesis takes phrase browsing as the starting point and generalizes it into a new interactive query expansion technique. This transcends the narrow definition of phrase browsing by removing its restrictions to provide flexible searching and browsing. Multiple topics can be expanded with terms obtained from different sources, including both general and domain-specific thesauri. These modifications to the general approach are matched by a complete redesign and reimplementation of the interface. The AJAX framework is used to provide a highly responsive web application.
The new system has been compared directly with keyword searching and indirectly with an earlier phrase browser. This formal evaluation confirmed that many phrase browsing problems have been resolved. The new interface was well received by subjects, who preferred it to keyword searching. However, much of the improvement is due to interface features that could be incorporated into the less popular system. This research calls into question the whole idea of phrase browsing and raises the possibility of topic browsing; a more general approach that is less closely tied to specific terminology.
-
Milne, D. (2004) Design and implementation of an XML-based collaborative document editor. Unpublished Honors Thesis, University of Waikato, New Zealand.
It is very rare for a significant document to be produced entirely as an individual effort. Despite this, the majority of document editors are geared towards the individual author. The purpose of this project is to develop a document editor for groups of authors, allowing them to simultaneously and remotely collaborate around a central shared document. To this end, the RECOLED prototype was designed, implemented and evaluated.
While there have been many research projects involving similar prototypes, this project seeks to differentiate itself from these by providing an XML-based record of the authoring process for post meeting analysis and by maintaining a strong focus on usability and group awareness.
Confusingly enough, some of the publications above have me under an older alias: David King.
I forget why—dodging taxes perhaps?