1. Introduction
2. Related Work
2.1 Associating Keyphrases with Documents
2.2 Kea
2.3 Evaluating Keyphrase Extraction
3. Evaluation of Kea Keyphrase Sets
3.1 Research Questions
3.2 Experimental Texts
3.3 Subjects
3.4 Tasks
3.5 Paper Allocation
3.6 Instructions
3.7 Candidate Phrase Sets
4. Results
4.1 Overall Quality of Keyphrase Sets
4.2 Comparison Between Keyphrase Sources
4.3 Inter-subject Agreement
5. Analysis and Discussion
5.1 Overall Quality of Keyphrase Sets
5.2 Comparing Keyphrase Sources
5.3 Comparing Sets to Individual Phrases
5.4 Experimental Design
6. Compositional coverage
7. Conclusions
References
This metadata is as equally useful in digital document repositories like digital libraries and Web search engines. When available, it can facilitate document clustering, document retrieval, thesaurus production, browsing mechanisms, and many other access tools. It can be provided to end users to enrich query result sets and document displays, and to help them discriminate between documents.
Unfortunately, digital document repositories often lack sufficient resources to assign detailed metadata. One of the research goals of the New Zealand Digital Library Project (http://www.nzdl.org) is the automatic generation of metadata from source documents, and one aspect of this work is the Kea algorithm, which selects words and phrases from within a document that reflect its content. Some documents, such as scientific papers, are assigned sets of keyphrases (in our use of this term we are also referring to keywords) by authors or professional cataloguers. Many more documents are not. The goal of keyphrase extraction algorithms like Kea (Frank et al. 1999; Witten et al. 1999) and Extractor (Turney 1999; Turney 2000) is to automatically identify a set of keyphrases for a document that approximates the list that might be supplied by a document's author. Identified keyphrases may be exploited in (amongst others) clustering algorithms (Anick & Vaithynathan 1997; Jones & Mahoui 2000), retrieval algorithms (Arampatzis et al. 1998; Croft et al. 1991), thesaurus construction (Paynter et al. 2000), and browsing interfaces (Gutwin et al. 1999; Jones & Paynter 1999).
This paper reports on an evaluation of the keyphrases produced by document authors, and by the Kea and Extractor algorithms. Kea and Extractor have previously been evaluated in two ways: automatically, by measuring the algorithm's precision and recall of author-supplied keyphrases (Frank et al. 1999; Turney 2000); and subjectively, by asking human assessors to rate the quality of individual keyphrases (Barker & Cornacchia 2000; Jones & Paynter 2001). The evaluation reported here complements these studies.
Jones and Paynter (2001) describe a subjective evaluation of Kea and author keyphrases that asks participants to rate individual keyphrases, then uses these ratings to evaluate the various phrase sources. Although this experiment showed that the phrases in the author and Kea sets were viewed favourably, a limitation is that there is no evidence that the quality of a phrase set as a whole is directly related to the quality of its individual members. In fact, Barker and Cornacchia (2000) report the opposite phenomenon: when evaluating two algorithms, in some cases human assessors preferred the individual phrases from one set, but preferred the other as a whole. The purpose of the present experiment was to evaluate the quality of the phrase sets generated by various sources, and to test the hypothesis that the quality of a phrase set is directly and positively related to the quality of its constituent phrases.
At first glance, this hypothesis appears trivially true, but it is not necessarily so. Consider as a (simple) example a document that discusses two topics, dogs and fish, for which we wish to choose three keyphrases. Our first attempt yields dingoes, wolves, and domestic canines, all of which are sensible phrases, but none of which covers the second topic of the paper. A second attempt might yield poodles, quadrupeds, and ocean life, all of which seem less useful, but which cover both topics. A third attempt might suggest dogs, fish, and electricity, which has two phrases that cover each of the topics perfectly, and one that is unrelated to the document and likely to mislead the user about its content. It is not clear which of the three attempts is the best—this will depend on the way the phrases are employed and on the preferences of the user.
In this paper we show that sets comprised of phrases that are very good in isolation are likely to form a very good keyphrase set when it is considered as a whole. However, this is not always the case. The relationship between the quality of phrases and of sets is too complex for the current generation of keyphrase extraction algorithms to fully encompass because they are syntactically-based, and do not understand the semantics of the phrases they extract. Document authors do take this additional information into account, and their phrase sets were viewed most favourably in this evaluation.
In the following section we summarize keyphrase extraction, the Kea algorithm and the results of previous evaluations. We then describe the method of the present study, and go on to present our findings. These results are then discussed and our conclusions summarized.
In the second approach, called keyphrase extraction, the text of a document is analyzed and its most appropriate words and phrases are identified and associated with the document. This means that every phrase that occurs in the document is a potential keyphrase of the document, and a controlled vocabulary is not required. However, the keyphrases generated are less consistent. Automatically identifying and extracting phrases is a complex task, but a range of techniques for identifying useful, descriptive and meaningful phrases have been suggested (Chen 1999; Krulwich & Burkey 1997; Smeaton & Kelledy 1998; Tolle & Chen 2000) (see Jones & Paynter 2002 for an overview). However, few of these conform to our definition of a keyphrase extraction algorithm. The distinction between keyphrase extraction algorithms and other phrase extraction algorithms is important because the author keyphrases provide an objective basis for evaluation, as will be discussed below.
Turney (1999; 2000) was the first to frame keyphrase extraction as a supervised learning problem, where all the phrases in a document are potential keyphrases, but only those that match the authors' choices are "correct" keyphrases. Turney devised an algorithm, called Extractor, that uses a set of heuristics and a genetic algorithm to identify the phrases that are most likely to be the authors'. Barker and Cornacchia (2000) suggest an alternative strategy; they identify noun phrases using dictionary lookup, and then consider the frequency of a given noun as a phrase head within a document, discarding those that fall below a given threshold. We will discuss a third algorithm, Kea, in greater detail.
Kea operates in two distinct stages. First, a model is built from a set of training documents with exemplar keyphrases (usually the author keywords, though any authoritative source may be used). Second, documents without keyphrase metadata are presented to Kea, and the model is used to identify those of their phrases that are most likely to be keyphrases; these phrases are "extracted" from the document and provided as output. The process is illustrated in Figure 1.
To learn a model, Kea extracts every phrase from each of the training documents. Many phrases are immediately discarded, including proper nouns, those that begin or end with a stopword, those that do not match predefined phrase length constraints, and those that occur only once within a document. The remaining phrases are called the candidate phrases of the document. Three attributes are calculated for each candidate phrase: whether or not it is an author-specified keyphrase of the document, the distance into a document that it first occurs, and how specific it is to the document. The third attribute is represented using the TF·IDF measure. TF is the frequency of the phrase in the document, and is divided by DF, the number of other documents in which the phrase occurs. The candidate phrases from every training document are combined into a single dataset and used to construct a Naïve Bayes classifier (Domingos & Pazzani 1997) that predicts whether or not a phrase is an author keyphrase based on its other attributes.
Once a model has been learned from the training documents, it can be used to extract keyphrases from new documents. The candidate phrases are extracted from the new document as described above, and the distance and TF·IDF attributes computed. The Naïve Bayes model uses these attributes to calculate the probability that each candidate phrase is a keyphrase of its document. The most probable candidates for each document are output in ranked order; these are the keyphrases that Kea associates with the document.
Changes to the model building process affect the parameters of the model and the characteristics of the keyphrases that will ultimately be extracted. The simplest and most significant possible change is to vary the training data. In this study we have used two different sets of training documents labelled cstr and aliweb. A question that arises is whether the level of similarity between the training and target documents has a noticeable effect on the quality of the extracted keyphrases. Therefore we have applied a domain-focussed extraction model (cstr) and a non-domain focussed model (aliweb) in our study.
The cstr corpus was drawn from a collection of Computer Science research papers gathered for inclusion in the New Zealand Digital Library (http://www.nzdl.org). This corpus was used as training data for Kea to produce the cstr keyphrase extraction model. Frank et al. (1999) generated the model that we use in this study. The cstr training corpus and keyphrase extraction model may be considered domain-focussed in that all documents discuss Computer Science research, although a range of research areas are represented.
The aliweb corpus contains HTML web pages gathered by Turney via the Aliweb search engine for his studies on keyphrase extraction (1999; 2000). The documents address a broad variety of topics such as micro-breweries, law libraries, text processing and university departments and contain author-assigned keyphrases specified within them using the HTML META tag. This corpus was used to train Kea and produce the aliweb keyphrase extraction model. The training material and the consequent extraction model are clearly not domain-focussed. Many Kea users will not have suitable training data that matches the documents from which they wish to extract keyphrase sets, and so it is important to establish the utility of a generic model such as aliweb.
The extraction model can be tailored in further ways. The length of the phrases to be extracted, expressed as a minimum and maximum number of words, can be constrained, as can the number of phrases extracted from each document.
Finally, the model can be extended by the inclusion of the keyphrase-frequency attribute in addition to the three attributes described above. This attribute represents the number of times a phrase has occured as an author-specified keyphrase in the set of training documents. The effect of including this attribute is to bias the extraction model in favour of the most common author-selected keyphrases in the training corpus. The resulting model is more domain specific and it is suited to situations where there is a strong relationship between the domains of the training and testing documents. This study evaluates a Kea model called cstr-kf that is based on the cstr corpus and employs the keyphrase-frequency attribute.
Both Kea (Frank et al. 1999) and Extractor (Turney 2000) have been evaluated using precision and recall. These studies show that there is no statistically significant difference between the two algorithms, though Kea's performance can be significantly improved by the keyphrase-frequency attribute. However, the keyphrase-frequency attribute may not always be available and is domain-specific, meaning that it boosts performance only for documents describing a single domain, such as computer science or physics. Further, recent work by Turney has shown that keyphrase frequencies can be detrimental when used in other domains (personal communication).
There are well-known problems with evaluations that rely on the author's keyphrases. Author keyphrases do not always appear in the text of the document to which they belong, and therefore cannot be found by an extraction-based technique. Further, author keyphrases are available for a limited number and type of documents, and authors rarely provide more than a few keyphrases, far fewer than may be extracted automatically. Finally, authors choose keyphrases for purposes other than document description—to increase the likelihood of publication, for example. Barker and Cornacchia (2000) observed these deficiencies, and proposed instead a subjective evaluation of Extractor and their own algorithm (called B&C), though their results are limited by the paucity and disparity of assessors and test documents, and a consequent lack of inter-assessor agreement.
Jones and Paynter (2002) describe a more extensive subjective evaluation of Kea and author keyphrases, using the same test material and similar assessors to the study described later in this paper. In this previous evaluation, assessors were presented with scientific papers and an associated list of individual keyphrases for each paper. The keyphrases were produced from a number of sources, including author keyphrases, and three Kea models (aliweb, cstr and cstr-kf, which are described below). Assessors rated each of the keyphrases using a numeric scale, based on how well they reflected the content of the paper.
This study revealed that:
This limitation is particularly relevant to single-word keyphrases like WWW, Clustering, Categorization, Scripting and navigation, which are too general to describe a document when they appear in isolation, but benefit from the context provided by other phrases. For example, the keyword navigation has many possible interpretations when it appears on its own, but its meaning is plain when it occurs in a keyphrase list with WWW, hypertext and history mechanisms. It is possible that a keyphrase that is rated poorly on its own makes a significant contribution to a keyphrase list, or that a phrase that is rated highly in isolation is redundant when presented with other, similar phrases. Consequently, a list comprised of phrases with high individual ratings is not necessarily better than one whose constituents were less well-regarded.
In fact, Barker and Cornacchia observed exactly this phenomenon (2000). They determined their subjects' preference for phrase sets and for individual phrase scores from two sources (Extractor and B&C). Only half of the phrase set preferences matched those derived from individual phrase scores. They conjecture that the phrase set preferences of the subjects were not simply based on the individual phrases that constituted the sets, and that the sets were more (or less) than the sum of their parts.
Additionally we were interested in the extent to which the sets of keyphrases covered the content of a document. We wished to determine whether the phrase sources focussed on particular portions of a document for keyphrase extraction, or whether the extracted phrases occured throughout the document text. This is of particular interest with respect to Kea, given that the distance attribute of the Kea models favours extraction of phrases that occur near to the start of a document.
| Paper ID | Reference |
|---|---|
| 1 | Borchers, J.O., "WorldBeat: Designing a Baton-Based Interface for an Interactive Music Exhibit", pp. 131-138.
interface design, interactive exhibit, baton, music, education |
| 2 | Kandogan, E. and Shneiderman, B., "Elastic Windows: Evaluation of Multi-Window Operations", pp. 250-257.
Window Management, Multi-window operations, Personal Role Management, Tiled Layout, User Interfaces, Information Access and Organization |
| 3 | Wilcox, L.D., Schilit, B.N. and Sawhney, N., "Dynomite: a Dynamically Organized Ink and Audio Notebook", pp. 186-193.
Electronic notebook, note-taking, audio interfaces, hand-writing, keyword indexing, ink properties, retrieval, paper-like interfaces, PDA, pen computing |
| 4 | Myers, B.A., "Scripting Graphical Applications by Demonstration", pp. 534-541.
Scripting, Macros, Programming by Demonstration (PBD), Command Objects, Toolkits, User Interface Development Environments, Amulet |
| 5 | Tauscher, L. and Greenberg, S., "Revisitation Patterns in World Wide Web Navigation", pp. 399-406.
History mechanisms, WWW, web, hypertext, navigation |
| 6 | Pitkow, J. and Pirolli, P., "Life, Death and Lawfulness on the Electronic Frontier", pp. 383-390.
Clustering, categorization, co-citation analysis, World Wide Web, hypertext, survival analysis, usage model |
The papers were chosen because they contain author-specified keywords and phrases, and provide a good fit with the background and experience of our subjects. Each paper was eight pages long, and the authors' keyphrases were removed from each so that they would not influence the subjects' responses.
Three Kea models were used to extract keyphrases. The first, aliweb, was trained on a set of general content web pages gathered by Turney (1999; 2000). The second, cstr, is derived from a collection of technical reports in a range of computer science subject areas. The third, cstr-kf, was trained on the same documents as cstr, but uses a further attribute which reflects how frequently a phrase occurs as an author keyphrase in a set of training documents. Both cstr and cstr-kf were produced by Frank et al. (1999).
The minimum phrase length was varied for each Kea model. Two phrase sets were produced with each model, corresponding to phrases of 1 to 3 words and 2 to 3 words. The six Kea phrase sets for each text were the same as those used in our previous study.
In our previous study we captured subjects' ratings of individual phrases provided by Kea and the authors. In this study we use this data to calculate the mean score across subjects for all phrases for each of the texts. For each text we then produced a list of phrases ranked in descending order by the mean of the scores assigned to them by subjects; this we called the best-individual set.
The number of phrases in each set was determined by the number of author phrases specified for the text (it is, of course, possible to select more phrases from the other sets). In the case of the Kea and best-individual sets the list of phrases is already ranked and we chose the first N from the beginning of the list. A command-line parameter was used to request the appropriate number of phrases from Extractor.
The nine phrase sets for each paper were labelled as follows (labels for Kea sets reflect the model used and the length restriction, in words, placed on the phrases):
| Document text | Phrase sets |
| Paper 1 : Borchers | Paper 1 : Borchers |
| Paper 2 : Kandogan & Shneiderman | Paper 2 : Kandogan & Shneiderman |
| Paper 3 : Wilcox et al | Paper 3 : Wilcox et al |
| Paper 4 : Myers | Paper 4 : Myers |
| Paper 5 : Tauscher & Greenberg | Paper 5 : Tauscher & Greenberg |
| Paper 6 : Pitkow & Pirolli | Paper 6 : Pitkow & Pirolli |
Table 3 shows the ranks and mean scores when data from all six papers are combined, providing an immediate indication of the "best" keyphrase sources. The author phrase sets were judged to be of the highest quality with a mean score of 6.65. They were closely followed by the best-individual set, which were the top-ranking individual phrases from our previous study, and which received a mean score of 6.63. The Kea cstr 2-3 sets were ranked third overall, followed by Extractor. The aliweb 1-3 and cstr 1-3 sets were ranked fifth and sixth respectively, and on average received positive ratings that were higher than the midpoint on the scale used by the subjects. Three sets received negative ratings overall, with mean scores below the midpoint: aliweb 2-3, cstr-kf 2-3 and cstr-kf 1-3.
| Keyphrase Set | ||||||||||||
| author | best-individual | cstr 2-3 | Extractor | aliweb 1-3 | cstr 1-3 | aliweb 2-3 | cstr-kf 2-3 | cstr-kf 1-3 | ||||
| rank | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |||
| mean score | 6.65 | 6.63 | 6.20 | 5.93 | 5.78 | 5.53 | 4.98 | 4.65 | 4.25 | |||
| sd | 1.75 | 1.72 | 2.26 | 1.80 | 2.15 | 2.22 | 3.10 | 1.89 | 2.33 | |||
The ranks and mean scores for the individual papers are shown in Table 4. These scores vary from paper to paper. For example, the author set ranked first for papers 3 and 6, but ranked fifth for paper 2. The cstr 2-3 set ranked first for papers 3 and 5, but ranked sixth for papers 4 and 6. Note that in the case of a tie, more than one paper may be assigned the same ranking.
| Keyphrase Set | |||||||||||||
| author | best-individual | cstr 2-3 | Extractor | aliweb 1-3 | cstr 1-3 | aliweb 2-3 | cstr-kf 2-3 | cstr-kf 1-3 | |||||
| Paper 1 | rank | 3 | 6 | 2 | 4 | 6 | 5 | 1 | 6 | 6 | |||
| mean | 6.00 | 4.86 | 6.29 | 5.43 | 4.86 | 5.14 | 8.00 | 4.86 | 4.86 | ||||
| sd | 2.00 | 1.77 | 2.21 | 1.62 | 2.61 | 2.04 | 1.15 | 0.90 | 1.46 | ||||
| Paper 2 | rank | 5 | 1 | 3 | 5 | 7 | 4 | 1 | 8 | 9 | |||
| mean | 5.50 | 6.75 | 6.63 | 5.50 | 5.13 | 6.00 | 6.75 | 4.63 | 3.38 | ||||
| sd | 1.77 | 1.04 | 1.77 | 1.85 | 2.10 | 1.51 | 0.89 | 1.19 | 1.77 | ||||
| Paper 3 | rank | 1 | 4 | 1 | 4 | 3 | 6 | 8 | 7 | 8 | |||
| mean | 6.71 | 5.57 | 6.71 | 5.57 | 5.71 | 5.43 | 3.00 | 5.29 | 3.00 | ||||
| sd | 0.76 | 1.13 | 1.70 | 1.72 | 2.14 | 1.81 | 2.31 | 1.70 | 1.83 | ||||
| Paper 4 | rank | 2 | 1 | 6 | 3 | 4 | 4 | 8 | 7 | 9 | |||
| mean | 6.67 | 7.50 | 5.00 | 6.00 | 5.33 | 5.33 | 3.50 | 4.17 | 2.67 | ||||
| sd | 1.37 | 1.38 | 0.89 | 1.67 | 1.63 | 1.21 | 1.52 | 2.40 | 1.97 | ||||
| Paper 5 | rank | 8 | 2 | 1 | 6 | 5 | 3 | 3 | 9 | 6 | |||
| mean | 6.20 | 8.40 | 8.80 | 7.20 | 7.40 | 7.80 | 7.80 | 5.00 | 7.20 | ||||
| sd | 0.84 | 1.14 | 1.30 | 0.84 | 1.67 | 1.64 | 1.10 | 2.12 | 1.30 | ||||
| Paper 6 | rank | 1 | 2 | 6 | 4 | 3 | 7 | 9 | 7 | 5 | |||
| mean | 8.86 | 7.29 | 4.29 | 6.29 | 6.71 | 4.00 | 1.14 | 4.00 | 5.14 | ||||
| sd | 1.21 | 1.50 | 2.81 | 2.56 | 2.14 | 3.37 | 2.61 | 2.94 | 2.79 | ||||
Table 5 summarises Table 4 by showing how many times each phrase set achieved each ranking. aliweb 2-3 was the most variable: its average rating was below halfway, but twice it was ranked first.
| Keyphrase Set | ||||||||||
| author | cstr 1-3 | cstr 2-3 | aliweb 1-3 | aliweb 2-3 | cstr-kf 1-3 | cstr-kf 2-3 | Extractor | best-individual | ||
| Ranking | 1 | 2 | 2 | 2 | 2 | |||||
| 2 | 1 | 1 | 1 | 2 | ||||||
| 3 | 1 | 1 | 1 | 2 | 1 | |||||
| 4 | 2 | 1 | 3 | 1 | ||||||
| 5 | 1 | 1 | 1 | 1 | 1 | |||||
| 6 | 1 | 2 | 1 | 2 | 1 | 1 | 1 | |||
| 7 | 1 | 1 | 3 | |||||||
| 8 | 1 | 2 | 1 | 1 | ||||||
| 9 | 1 | 2 | 1 | |||||||
Given the variation observed between papers we asked if there was a significant difference between the quality of the phrase sets for each of the six papers. Using Friedman's two-way analysis of variance by ranks (Siegel & Castellan 1988) we established that at the p=0.05 level there was a significant difference between at least one pair of phrase sets for all papers except for paper 5. (For paper 5 there was a significant difference at the p=0.051 level.) Table 6 shows the results of this analysis. All scores shown were adjusted for ties in the data.
| Paper | Fr | p |
| 1 | 23.53 | =0.003 |
| 2 | 28.67 | <0.001 |
| 3 | 24.34 | =0.002 |
| 4 | 26.42 | =0.001 |
| 5 | 15.52 | =0.51 |
| 6 | 40.15 | <0.001 |
We investigated further to identify which phrase sets are significantly different (at p=0.05, excluding data from paper 5). We extended our analysis by Friedman, carrying out multiple comparisons between conditions (Siegel & Castellan 1988).
Table 7 summarizes the phrase sources that are significantly different. Each cell of the table shows the papers for which the phrase set labeling the row was significantly better than the phrase set labeling the column. For example, the row labeled cstr 2-3 shows that the cstr 2-3 phrase set was significantly better than the aliweb 2-3 phrase set for papers 2 and 3, and also significantly better than the cstr-kf 1-3 phrase set for paper 3.
| Keyphrase Set | ||||||||
| aliweb 1-3 | aliweb 2-3 | cstr 1-3 | cstr 2-3 | cstr-kf 1-3 | cstr-kf 2-3 | Extractor | best-individual | |
| author | 6 | 6 | 3,4 | 6 | ||||
| aliweb 1-3 | 6 | |||||||
| aliweb 2-3 | 1 | 2 | 1 | |||||
| cstr 1-3 | ||||||||
| cstr 2-3 | 2,3 | 3 | ||||||
| cstr-kf 1-3 | ||||||||
| cstr-kf 2-3 | ||||||||
| Extractor | 6 | |||||||
| best-individual | 4,6 | 4 | 4 | |||||
W has a value between 0 (agreement as expected by chance) and 1 (complete agreement). Table 8 shows the agreement values for each paper. In each case, W is non-zero, indicating that there is some level of inter-subject agreement other than that expected by chance. The weakest agreement of 0.39 was observed for paper 5, and the strongest of 0.72 for paper 6. The score and degrees of freedom (9 phrase sets) can be used to determine the level of significance of the W value. The level of agreement is highly significant (to at least the p=0.003 level) for all papers except paper 5 which is significant at the p=0.0514 level.
| Paper | W | X2 | p |
| 1 | 0.42 | 23.53 | =0.0027 |
| 2 | 0.45 | 28.67 | =0.0004 |
| 3 | 0.44 | 24.34 | =0.0020 |
| 4 | 0.55 | 26.41 | =0.0009 |
| 5 | 0.39 | 15.43 | =0.0514 |
| 6 | 0.72 | 40.15 | <0.0001 |
| All | 0.39 | 18.91 | =0.0153 |
We also measured overall agreement, across all papers. For each paper we computed the mean score for each phrase set (across all subjects), and converted the mean scores to rankings. The resulting W value is also shown in Table 8 and provides evidence of significant agreement. Consequently, we conclude that subjects agree sufficiently for us to consider the reported results to be meaningful.
After author, the next most highly-rated set was the best-individual set, whose average was only fractionally less than the author set. This was followed by one of the Kea sets, cstr 2-3, and the Extractor set. The next two Kea sets, aliweb 1-3 and cstr 1-3 were also rated positively, but it is notable that the Kea phrases are more variable than the author, best-individual and Extractor sets (Table 3, standard deviation row). This indicates that information providers should utilise author keyphrases whenever possible. However, such metadata is often unavailable, and consequently automated extraction is required. In this case, the results indicate that a focussed Kea extraction model (cstr in this case) can be used to provide good quality keyphrase sets where there is a good fit between the model and the target documents. The situation will arise where it is not possible to create a new model to match the target documents and no pre-existing focussed model is appropriate. In such a case, the results indicate that generic models such as Extractor and aliweb 1-3 can provide useful phrase sets.
The two cstr-kf sets performed particularly poorly: they received the lowest average ratings, and were not ranked better than fifth for any of the papers (Table 4). This is consistent with our previous study where the individual phrases produced by this model were significantly worse than those from other sources. As in the previous study, we attribute its poor performance to a mismatch between the domain of the training documents (Computer Science) and of the documents used in the evaluation (Human Computer Interaction). There is mounting evidence that domain-specific models using the keyphrase-frequency attribute perform significantly worse than generic models when applied outside the domain on which they are trained: this effect was visible in both of our subjective evaluations, and has been demonstrated by Turney when Kea is evaluated against author keyphrases (personal communication).
Some significant differences were detected within individual papers, as described in Table 7. Most of these differences involve one of the top three sets (author, best-individual, cstr 2-3) being significantly better than one of the bottom three sets (aliweb 2-3, cstr-kf 1-3, cstr-kf 2-3). The aliweb 2-3 set is anomalous in that it is both significantly better and significantly worse than other sets for more than one document. This is because the ratings for this set were the most variable, as can be seen from the standard deviation row of Table 3, and in Table 4. The cstr-kf phrase sets, as we have observed, simply perform poorly. In summary, the best sources of keyphrase sets are author, best-individual and cstr 2-3, while the domain-specific Kea models perform poorly. The differences are not statistically significant in this dataset.
One approach is to compare the average score assigned to each phrase set in this evaluation to the average score assigned to each constituent phrase in the previous evaluation. In our previous study, the mean score of an individual author phrase was 6.36, which is very similar to the mean score of 6.65 for author phrase sets in this study. These figures, and figures for the other sets, are shown in the second and third columns of Table 8. (Extractor is omitted because it did not appear in the previous experiment.) Column 4 of this table contains the difference between the two scores; the differences are in all cases small.
| Phrase set | Mean individual score (Rank) | Mean set score (Rank) | Difference in means | Difference in ranks |
| best-individual | n/a (1) | 6.63 (2) | n/a | -1 |
| author | 6.36 (2) | 6.65 (1) | 0.29 | +1 |
| cstr 2-3 | 5.87 (3) | 6.20 (3) | 0.33 | 0 |
| aliweb 2-3 | 5.65 (4) | 4.98 (6) | 0.67 | -2 |
| cstr 1-3 | 5.62 (5) | 5.53 (5) | 0.09 | 0 |
| cstr-kf 2-3 | 5.59 (6) | 4.65 (7) | 0.94 | -1 |
| aliweb 1-3 | 5.25 (7) | 5.78 (4) | 0.53 | +3 |
| cstr-kf 1-3 | 4.92 (8) | 4.25 (8) | 0.67 | 0 |
Ultimately, the Mean individual scores and Mean set scores in Table 8 are not directly comparable because they measure different things. Instead, we can consider the performance of each set by each measure relative to the other sets. Table 8 also assigns a rank to each of the sets for when they are sorted by mean individual score (from the previous experiment) and mean set score (from this experiment). The author set, for example, was ranked second (behind best-individual) when sorted by mean individual score, but ranked first when sorted by mean set score. The ranks for each of the other sets are shown in parentheses, while the rightmost column contains the change in rank for each set between the two evaluations.
Generally, the keyphrase sources ranked approximately the same when ranked as sets as when they are ranked individually. Only the phrase sets based on the aliweb model change by more than 1 position in the rank hierarchy. aliweb 2-3 has already proved to be the most variable of the sets, and performed much better when its individual phrases were measured than when they were considered as a group. Conversely, aliweb 1-3 was ranked seventh when its phrases were considered in isolation, but improved to fourth when considered as a group.
The relationship between the best-individual set and the author set is interesting. best-individual is constructed of the best individual phrases drawn from both the author and Kea sets, and is by definition the top-ranked set when measured by individual phrases. The author set must therefore contain some phrases that are (when considered individually) ranked less highly that those in best-individual, yet the author phrases as a group very slightly outperform the best-individual set. The prominence of the author set suggests that authors take care to choose a set of complementary keyphrases, rather than choosing good phrases in isolation. This can be explained intuitively: if a document describes five topics, the reader is much better served by five adequate phrases that describe each of the five different topics than they are by five excellent phrases that describe only one or two of the topics.
We conclude from this data that if we compare two sets of keyphrases, it is likely that the set whose constituent phrases are individually the best (in the opinion of a user) will be the best set when the phrases are considered as a whole. However, this is not always the case. Barker and Cornacchia (2000) observed that phrase sets and individual phrases were ranked differently; our experiments suggest this is the exception rather than the rule.
Developers using keyphrase extraction algorithms should consider whether their phrases are to be considered as a group (e.g. when a document is to be summarised), or in isolation (e.g. when indexing a set of documents by topic). Overall, Kea appears to have equivalent performance by both measures, and both Kea and Extractor perform well, but domain-specific models like cstr-kf which have been reported to boost performance in particular domains are demonstrably worse when applied outside their domains--even to seemingly very similar domains like computer science and human-computer interaction.
A design goal of both experiments was to maximise the number of judgements per phrase or phrase set given the constraints of the resources available. Other researchers have reported very low levels of agreement in phrase evaluation studies (Barker & Cornacchia 2000; Chen 1999). Such inconsistency is a common difficulty experienced in studies that gather subjective ratings by human assessors. Our previous study of individual keyphrases reported a high level of agreement, which we attributed to the expertise of the subjects in the domain of the documents that they considered. We have again observed significant agreement for our current study, with the exception of one paper where agreement was marginally not significant. This strengthens our belief that it is important to match subject knowledge to the documents for which they provide assessments, following the methodology of Tolle and Chen (2000).
Kea and the other algorithms are handicapped by their inability to explicitly identify and label the high-level concepts in a document. Our studies have established that some of the phrase sets extracted by Kea models are good ones, but has not directly addressed how completely those sets cover the topics within each document. However, we assume that subjects considered this to some degree in their rating of the phrase sets.
Documents are normally composed of multiple topics related to the main theme, and these may be interwoven to form the document as a whole. This paper makes the topical structure reasonably clear through the use of sections and subsections, but many documents do not contain such cues (and Kea does not use them). They do however suggest that different topics will be discussed at different points in the document, and that in order to cover all the topics in a document, we must draw keyphrases from the document's entire length.
We have investigated this idea further by considering the distribution of extracted keyphrases within documents. Each document was split into ten segments of equal size and the number and proportion of occurrences of keyphrases from each keyphrase set within each segment was calculated (the "Keywords" list was removed from the start of each document). The mean proportion of keyphrases occurring within each segment are shown in Figure 3.
The overall trend is that the occurrence of keyphrases declines as distance into the document decreases. (Note that Kea is biased towards phrases that occur early in the text through the distance attribute described earlier). More than a third (38%) of aliweb 2-3 keyphrase occurrences were in the first 10% of the document text, and 25% and 26% for cstr 2-3 and author respectively. The effect was less evident for aliweb 1-3, cstr 1-3, and each of the cstr-kf phrase sources, all of which tended to be rated less highly by the human assessors. Extractor provided the most even distribution of phrase occurrences across document segments.
Document structure clearly impacts upon these observations. Each of the study documents is a research paper from an ACM conference, and includes an abstract, introduction, conclusion and references section. Consequently, one might expect that the topics discussed later in the document are presented in overview near to the start. The phrases chosen by authors do tend to occur near to the beginning of documents. However, the author phrases do occur throughout the document text, so this emphasis does not exclude selection from, and consequently coverage of topics within, the entire document. It is also noticeable that phrase occurrence displays a regular pattern in the last three sections of the documents, which we attribute to the similar structures of conclusions and reference lists.
We are interested in testing the hypothesis that when two phrase sets are compared, the set whose constituent phrases are rated highest is also rated highest when considered as a whole. The evidence suggests this is often but not always true, and that the quality of a phrase set depends not only on the quality of its individual phrases, but on the relationships between the phrases, such as whether the terms are synonyms, how many phrases describe the same topic, and how many of the concepts in the document are reflected in the phrase set.
Kea, and other automatic keyphrase generation algorithms, work by attempting to find good individual phrases, but this does not always lead to a phrase set that reflects the composition of topics in the document. Author keyphrases, on the other hand, are consistently judged highly, better even than the set of best individual phrases from our previous evaluation, which suggests that authors take care to get a balanced phrase set, and this care is reflected in the scores assigned to their choices.
The topical composition of the source documents is generally not considered in keyphrase extraction algorithms and evaluations. We have attempted to gauge its effect by examining the distributions of keyphrases throughout the documents. Although Kea favours extraction of phrases occuring near to the start of the document, it selects phrases that occur throughout the document text, providing good coverage of document content.
(1997) Proceedings of CHI'97: Human Factors in Computing Systems (ACM Press)
(1997) Proceedings of CHI'98: Human Factors in Computing Systems (ACM Press)
Anick, P. and Vaithyanathan, S. (1997) "Exploiting Clustering and Phrases for Context-Based Information Retrieval". In Proceedings of SIGIR'97: the 20th International Conference on Research and Development in Information Retrieval, ACM Press, pp. 314-322
Arampatzis, A.T., Tsoris, T., Koster, C.H.A. and Van der Weide, T.P. (1998) "Phrase-based information retrieval". Information Processing & Management, 34(6), 693-707
Barker, K. and Cornacchia, N. (2000) "Using Noun Phrase Heads to Extract Document Keyphrases". In Proceedings of the Thirteenth Canadian Conference on Artificial Intelligence (LNAI 1822), pp. 40-52
Chen, K.-H. (1999) "Automatic Identification of Subjects for Textual Documents in Digital Libraries". Computation and Language E-Print Archive, Los Alamos National Laboratory, http://xxx.lanl.gov/abs/cs.DL/9902002
Croft, B., Turtle, H. and Lewis, D. (1991) "The Use of Phrases and Structured Queries in Information Retrieval". In Proceedings of SIGIR'91, ACM Press, pp. 32-45
Domingos, P. and Pazzani, M. (1997) "On the Optimality of the Simple Bayesian Classifier Under Zero-One Loss". Machine Learning 29(2/3), 103-130
Dumais, S.T., Platt, J., Heckerman, D. and Sahami, M. (1998) "Inductive Learning Algorithms and Representations for Text Categorization". In Proceedings of the 7th International Conference on Information and Knowledge Management, ACM Press, pp. 148-155
Frank, E., Paynter, G., Witten, I., Gutwin, C. and Nevill-Manning, C. (1999) "Domain-specific Keyphrase Extraction". In Proceedings of the Sixteenth International Joint Conference on Aritificial Intelligence, pp. 668-673
Gutwin, C., Paynter, G.W., Witten, I.H., Nevill-Manning, C. and Frank, E. (1998) "Improving Browsing in Digital Libraries with Keyphrase Indexes". Journal of Decision Support Systems, 27(1-2), 81-104
Jones, S. and Mahoui, M. (2000) "Hierarchical Document Clustering Using Automatically Extracted Keyphrases". In Proceedings of the Third International Asian Conference on Digital Libraries, pp. 113-120
Jones, S. and Paynter, G.W. (2002) "Automatic Extraction of Document Keyphrases for Use in Digital Libraries: Evaluation and Applications". Journal of the American Society for Information Science and Technology, 53(8), 653-677.
Jones, S. and Paynter, G.W. (2001) "Human Evaluation of Kea, an Automatic Keyphrasing System." In Proceedings of the First ACM/IEEE-CS Joint Conference on Digital Libraries, ACM Press, pp. 148-156
Jones, S. and Paynter, G. (1999) "Topic-based Browsing Within a Digital Library Using Keyphrases". In Proceedings of Digital Libraries'99: The Fourth ACM Conference on Digital Libraries, ACM Press, pp. 114-121
Krulwich, B. and Burkey, C. (1997) "The Infofinder Agent - Learning User Interests Through Heuristic Phrase Extraction". IEEE Intelligent Systems & Their Applications 12(5), 22-27
Paynter, G.W., Witten, I.H. and Cunningham, S.J. (2000) "Evaluating Extracted Phrases and Extending Thesauri". In Proceedings of the Third International Conference on Asian Digital Libraries, pp. 131-138
Siegel, S. and Castellan, N.J. (1988) Nonparametric Statistics for the Behavioral Sciences (2nd edition), McGraw Hill College Div
Smeaton, A. and Kelledy, F. (1998) "User-Chosen Phrases in Interactive Query Formulation for Information Retrieval". In Proceedings of the 20th BCS IRSG Colloquium, (Grenoble, France)
Tolle, K.M. and Chen, H. (2000) "Comparing Noun Phrasing Techniques for Use with Medical Digital Library Tools". Journal of the American Society for Information Science, 51(4), 352-370
Turney, P.D. (2000) "Learning Algorithms for Keyphrase Extraction". Information Retrieval, 2(4), 303-336
Turney, P.D. (1999) "Learning to Extract Keyphrases from Text". Technical Report ERB-1057 (NRC #41622). Canadian National Research Council, Institute for Information Technology
Witten, I.H., Paynter, G.W., Frank, E., Gutwin, C. and Nevill-Manning, C.G. (1999) "KEA: Practical Automatic Keyphrase Extraction". In Proceedings of Digital Libraries '99: The Fourth ACM Conference on Digital Libraries, pp. 254-255.