Clustering Full Text Documents Abstract An index or topic hierarchy of full-text documents can organize a domain and speed information retrieval. The automatic generation of such an index is an ideal application for unsupervised learning, where the learner creates an integrated summary of a domain. Traditional indexes, like the Library of Congress system or Dewey Decimal system, are generated by hand, updated infrequently, and applied inconsistently. With machine learning, they can be generated automatically, updated as new documents arrive, and applied consistently. Despite the appeal of automatic indexing, organizing natural language documents is a difficult balance between what we want to do and what we can do. For optimal performance, the machine learner must know or acquire all that a human library patron knows about natural language. This will be beyond the capabilities of machine learning for many years to come. For the foreseeable future, we will have to apply approximate solutions to the problem and do whatever data engineering is necessary to yield good performance. This paper describes an application of clustering to full-text databases, presents a new clustering method, and discusses the data engineering necessary to use clustering for this application. In particular, the paper deals with engineering the feature set to permit learning and otherwise engineering the data to match assumptions underlying the learning algorithm. 1. Introduction Few problems fit neatly into the parameter space of a given machine learning method. Often, the data must be engineered to match the assumptions underlying the learning method. This paper discusses the simplification of natural language data for probabilistic, unsupervised learning across a collection of full-text documents. There is an ever-increasing number of on-line, full-text documents available and, as a result, access to just the right document is becoming much more difficult. As in a traditional library, this electronic maze of documents should be organized to make items easy to retrieve and browse. Ideally, this index would be generated automatically and would support hierarchical browsing of related documents. For example, a patron may want documents on the population of the California county of El Dorado. The user of such a system would not need to enter a query of ?El ADJ Dorado AND (county OR california OR CA) AND population?. Instead, at each level of a hierarchical index, they would be given a few terms to choose among. Their choices would determine the next level of the hierarchy presented. Once they have found useful documents, they may wish to be notified of any other documents that arrive on that topic. Generating a topic hierarchy is an ideal application for unsupervised learning, in which an integrated overview is created for a domain (cf. Rasmussen, 1992). In traditional libraries, hierarchical indexes, like the Library of Congress system or Dewey Decimal system, are generated by hand, updated infrequently, and applied haphazardly by individual librarians. With machine learning, they can be generated automatically, updated as new documents arrive, and applied consistently. As alluring as this possibility is, there is a problem. Clustering natural language documents is a difficult balance between what we want to do and what we are able to do. To maximize performance, the learner must know or acquire all that a human library patron knows about natural language. Unfortunately, such profound understanding is many decades away. For the foreseeable future, some data engineering is necessary for any application of machine learning to the organization, indexing, and retrieval of full-text documents. Even if we wanted to design a more appropriate learning algorithm that appeals to the rich structure of natural language, it is not clear where to start or even whether we could achieve better performance (cf. Soderland & Lehnert,1994). In the rest of the paper, we present the learning task, our learning method, and the data engineering difficulties encountered. Section 2 discusses the task of learning a hierarchy of documents. Section 3 presents a method for learning overlapping clusters. Section 4 briefly lists the data engineering problems we encountered. For Clustering Full Text Documents IJCAI-95 Workshop on Data Engineering for Inductive Learning 3 each problem described, we present our approach and our reasons for that approach. 2. Organizing a Document Hierarchy There are many ways to search a complex collection of on-line information and many have direct analogs in the traditional library. Currently, we can perform content searches, browse hierarchically organized information, or ask a librarian to find books similar to an example. Once we have a book, we can scan its index or the table of contents. With a subject index in the card catalog, we can find the card labelled ?El Dorado? to find which books are available. Then, once we reach the stacks of books armed with a Library of Congress call number, we can find our book. If we cannot find the book or it is not quite what we wanted, we can browse the nearby, related books, those that have a similar call number. Finally, if we want to be aware of any new related information, we can check that shelf every week or so. A key organizing principle for the library is the use of the Dewey Decimal system or Library of Congress system for organizing related materials. This system allows patrons to browse books on the same or similar topics, looking for those they did not know about but might like to read anyway. The patrons do not need to have a fully specified query as long as they get to the right shelf of books. In addition, the organization scheme allows a librarian or patron to quickly check if there are new books on a particular topic. Hierarchical indexes can play an important role for on-line documents. A large, growing source of on-line documents is the World Wide Web (Berners-Lee, 1994), a set of conventions for publishing information on the internet, and already many people have begun to publish hierarchical indexes to the information available on the World Wide Web. Indeed, one of the most popular WWW documents is a hand-built hierarchical index of the Web called ?YAHOO? (http:// www.yahoo.com/). Our goal is to build one of these indexes automatically. It would serve several purposes. First, it could be used to browse on-line documents quickly. For instance, a patron might want to know the population of El Dorado, CA. They might start by typing ?El Dorado? and choosing from the following hierarchy: The ?County, California? selection would then expand to include entries for, ?IRS ?, ?Population?, ?Maps?. Then by selecting ?Population?, the patron would discover one document containing census information: gopher://gopher.census.gov:70/00/Bureau/Population/Estimate/County/cou es_CA.txt. Finally, they would find that as of July 1, 1994, 144,839 of the 31,430,697 people in California lived in El Dorado county. The system constructs the presented hierarchy from pieces of the hierarchical index it generated ahead of time. Second, a hierarchical index could be used to organize an individual?s ?hotlist?, a list of favourite documents. The hierarchical structure of all documents can impose order on a subset of documents. Third, a hierarchical index could be used automatically to suggest extensions to a ?hotlist?. Essentially saying, ?Because you read pages related to El Dorado, CA, you might be interested in the following new page that was recently found: Los Rios College District (gopher://gopher.losrios.cc.ca.us:70/00/info)?. 3. Building a Hierarchical Index To build an index of documents, the learning or clustering method must group related documents together (Rasmussen, 1992). In addition, because different people will wish to use the resulting index in many different ways, the learner ideally should build several overlapping hierarchies of the documents (Martin & Billman, 1994). We will use an algorithm designed to learn several different divisions of the documents. Given a set of n documents, it will learn one or more ways to divide the documents into related subsets. Each different way to divide the documents will be called a factor of the documents. Each cluster of documents for a factor is given a name called a value of the factor. For example, a set of documents on El Dorado might be clustered by type of document: Postscript, HTML, GIF etc. Hence, one factor would be the type, with values corresponding to each of the types of document. Another factor might correspond to the content, with values related to different geographical locations. The method used is similar to that introduced by Martin and Billman (1994) and similar to a clustering algorithm used by Anderson and Matessa (1992). The algorithm, FL (Factor Learner), is a learning method that constructs clusters from empirical data. It takes as input a set of variable-value vectors and produces one or more factors. A variable-value vector is simply a list of variable names, each paired with one possible value for that variable. FL searches for an appropriate set of factors. First, it uses classical tests of association to decide how many sets of overlapping clusters to build, i.e., to decide which observable variables each factor should cover and how many factors to use. Second, it uses an approximation of Bayesian clustering to determine a highly probable number of values per factor and highly probable conditional dependencies between factors and observable variables. Because we expect that we will have to engineer the data in order to use FL, we must be clear about what assumptions underlie it. FL is based on the following five assumptions: 1. FL assumes that the domain of objects to be clustered really are clustered in the world. For example, all the documents in one cluster were generated by members of the same research community or are otherwise related by some structure in the world. However, we do not have any direct access to the clusters from the world only to observable features. 2. In general, the clusters in FL are used by providing some variable values and requesting other variable values to be predicted. For example, if documents have been clustered by keywords, ideally some keywords would be given and FL would be used to predict what other keywords are relevant. 3. The objects to be clustered are represented as variable-value vectors such that all possible variables have an associated value. This means, for example, that if some relational information is relevant to a task, it will have to be changed into a zero-th order representation. 4. There is no mathematical reason for additional levels to the hierarchy. Whatever the probability distribution that represents the domain, it can be represented by one level of factors. The only rationale for multiple levels of a hierarchy is to improve retrieval time. 5. Once a factor value is known, no features relevant to that factor are inter-correlated. 3.1 Finding the factors The topology of factors, i.e., the number of factors and which variables are relevant to the factor, is determined using classical tests of association between pairs of discrete variables.1 A structure of factors can be identified using the pattern of dependence and independence between pairs of observable variables.2 Two rules govern the structure: 1. If two variables are interdependent, one or more factors directly influence both. 1. FL currently uses Pearson?s test of association. However, there is no theoretical commitment to a particular test. 2. Higher-order conditional dependencies and independencies could provide converging evidence but are not strictly necessary. 2. If two variables are independent, no factor directly influences both. FL constructs the factors by finding a set of cliques in an undirected graph of dependency relations. In this graph, a direct link between two variables means they are dependent, and no link means they are independent. First, FL constructs the graph by introducing a link between two variables when the test of association indicates dependence between those two variables. FL labels each link with the probability (computed using classical statistics) that those two variables are dependent. Second, FL extracts cliques by iterating through the variables and adding a variable to an existing clique if the variable is dependent on all other variables in that clique. After doing this, if the variable has dependency links not already in a clique, FL starts another clique with two variables. Third, FL performs a greedy search for the cliques that jointly maximizes the sum of the labels represented in the set of cliques. Each of the resulting cliques corresponds to a factor in the factor structure and the variables in the cliques are the variables that directly depend on that factor. When two or more cliques share a variable, then that variable has two or more factors on which its value depends. 3.2 Constructing the factors Once the factors and related variables are chosen, FL must choose the number of values for each factor, that is cluster the objects according to a factor. It must also choose the conditional probabilities. FL attempts to do so very rapidly, only approximating the normative Bayesian method for learning hidden variables. The normative way to learn hidden variables is to consider all possible numbers of values and all possible assignments of hidden variable values to the instances in the dataset (Cooper & Herskovits, 1992; Cooper, 1994). To be faster than the normative approach, a learner must have some way to find a highly probable assignment without considering all possible assignments of factor values to instances. FL uses an incremental, greedy search for a single most probable assignment of values. FL uses a variation of Anderson and Matessa?s (1992) single factor algorithm. It assigns factor-values to each instance one at a time. The assignment for the nth instance is the most probable given the current set of values for factors which is based on the first n-1 instances. Also, FL does not consider all possible combinations of factor values when seeking the most probable assignment for an instance. Instead, it finds the most probable value for each factor serially, each time including the influence of previously chosen factor-values. For example, if there are three binary factors, there are eight possible combinations of factor values. FL does not examine all eight possi- bilities. Rather it greedily considers the values serially. To reduce the bias that serial assignment introduces, FL performs the assignment to factors in a random order. In all cases where probabilities must be estimated from frequencies we used the following formula: The quantities, , are parameters of a Dirichlet distribution (one for each value of the variable) and is the sum over those parameters. We set all the ?s to be 1.0 and never vary them. The probability of a new value is a constant and its magnitude is a parameter. FL is very fast. Each of m instances is tested against each of r values for each of the factors, making for tests. Each test iterates over all observable variables, producing an overall cost of . If we assume a constant number of values, this becomes . 3.3 An example Figure 1 shows an example factor structure. We took a random sample of 5000 instances based on the distribution shown in the Figure and gave those instances to FL. In the Figure, factors are represented by rectangular boxes and observable variables are represented by circles. Each of the values for each factor has an associated prior probability that is given in each box. As well, there is a conditional probability of each observable variable value given any factor values. Figure 1: The target factors from which the example data was generated. Those conditional probabilities are represented in the matrices associated with the arrows. For example, the probability of V3 = 1 given that A has value 1 and B has value 0 is 0.2. Also, the probability of V4 =0 given that B has value 1 and C has value 1 is 0.1 Given this data, FL first computed the classical test of association for all pairs of variables. It returns the probability of the observed association between the variables given that the variables are independent. Table 1 shows the results of the tests of association. FL concludes that there are dependencies between V1, V2, and V3, a dependency between V3 and V4, and dependency between V4 and V5. As a result, FL creates three factors: one that influences variables V1, V2, and V3, one that influences variables V3 and V4, and one that influences V4 and V5. After learning, FL recovered the correct factor structure as shown in Figure 1. However, for two of the factors, it proposed additional values that had extremely low prior probabilities. These extra values are artifacts of the incremental hillclimbing search. FL should have a means of eliminating those values that are not adequately supported by the instances. 4. Applying FL for Full-Text Clustering Although FL has many advantages over alternative document clustering approaches, it is still woefully inadequate for the task of automatic index generation. Several information retrieval researchers have applied statistical clustering methods to the organization of documents (Rasmussen, 1992). FL is a significant improvement over those approaches because it permits cross classification using overlapping hierarchies. The documents are not given one ?best? organization. Instead, FL permits several good organizations. In addition, FL, because it is incremental, naturally permits efficient cluster maintenance (Can & Ozkarahan, 1989). The structure of factors and clusters can be updated easily as new documents are added. Finally, unlike traditional clustering, FL?s applicability for particular tasks can be more easily evaluated because its underlying assumptions can be clearly stated. This is also true for such systems as Fisher?s (1987) COBWEB and Anderson and Matessa?s (1992) method. However, we are not here to sing the praises of this method. Rather we will highlight the ways in which FL is ill-suited to our task. Because of its underlying assumptions, FL does not seem the right tool to build a retrieval hierarchy of natural language documents. 1. It does not build a hierarchy 2. It cannot accept natural language 3. Even if two documents share features, people may not find those documents similar for all purposes. 4. If words are binary variables (they either occur or do not) there are too few shared variable values. 5. The majority of shared words are not at all correlated with document similarity. 6. Too many documents are very similar because they share the absence of many words. We discuss these problems with FL in terms of how we addressed them. Whenever there is a mismatch between a learning method and data, there are three general ways to reduce the difference: 1. Build a better learning method, Table 1. Triangular matrix containing the probability that two variables are independent. Probabilities below 0.05 taken as evidence of association. 2. Simplify or restrict the task until it meets the learning assumptions 3. Transform the data to increase the probability of uncovering important structure. 4.1 Building a Better Learning Method First, if there is enough time, the problem is general enough, and the task is well understood, we can design a specialized machine learning algorithm built just for the task. Clearly, we cannot do this for the whole task of natural language understanding. However, we can address some of the mismatch by modifying our algorithm. The FL algorithm does not learn hierarchies. We can simply pull a page out of COBWEB?s (Fisher, 1987) book and apply the clustering recursively. Once one level of factors has been formed, each cluster of documents can itself be further divided. This will clearly lead to a hierarchical structure. However, it should be clear that this is an ad hoc solution to achieve the task we want. There is no clear definition of what each level of the hierarchy means. There is no way to argue that this hierarchy is better than some other hierarchy. In fact, building a hierarchy in this way seems to violate one of the assumptions underlying FL. Once a set of factors has been learned, each cluster of documents should not have correlated variables. If there are no correlated variables, there is no reason to further subdivide the cluster. In general, however, subsequent levels of the hierarchy will cluster based on different variables. The first level of clustering will group based on a subset of the variables and result in clusters in which that subset of variables is no longer intercorrelated. 4.2 Making Index-building a Clustering Problem Building an index of full-text documents and using that index for browsing is not like building a taxonomy of plants and animals and then using the hierarchy to make predictions about new plants or animals. Table 2 summarizes two ways that makes clustering a difficult match for index-building: FL cannot use structure found in the data and the task of retrieval is not the same task that was being optimized during learning. Table 2. Meeting FL?s assumptions. Problem Our Approach Data represented without relations Extract keywords and proximity features Predicting relevance. Adopt the cluster hypothesis 4.2.1 No structure On the one hand, it is not as easy to identify unary predicates concerning a text as it is for animals. This is possibly because we have more experience translating relational information about animals into simple features. Whatever the case, in order to use FL or another clustering approach, we must translate the rich structure of language into simple features. There are many ways to do this. The simplest of which is to identify keywords in the text and remove all sequential or proximity information. At the other extreme is attempting to parse the entire text. Such parsing is very slow and current methods paradoxically lead to diminished performance (e.g., Soderland & Lehnert, 1994). Many information retrieval researchers have chosen some intermediate approach, including frequency counts , simple adjacency information and possibly rudimentary syntactic and semantic information (e.g., DeJong, 1979; Mauldin, 1991, Soderland & Lehnert, 1994). For our application, we have chosen to identify keywords and maintain frequency and proximity information, such that two documents are more similar if they have certain keywords within N of each other. This gives us considerable flexibility in mimicking syntactic and semantic structure without requiring natural language capability. More specifically, for every cluster, FL maintains a description of the frequency of every keyword that appeared and a matrix of the expected distance between any two keywords. 4.2.2 Prediction and Retrieval Even if this is an adequate simplification of natural language, library patrons want to use the factors to do something they were not designed to do. The factors and clusters are learned to recover the underlying factors that produced the documents. For example, it might cluster all the documents generated by one group of researchers. The result of FL?s learning would be ideal for predicting what other keywords are strongly correlated with a provided query of other keywords. However, a library patron simply wants documents relevant to their current real-world task. That task could be anything, and we cannot guarantee that FL will cluster together documents that the arbitrary user will find relevant. To be clear about what FL can do, we will place a restriction on what types of retrieval the patron can expect to achieve. The user?s definition of what documents are relevant to their task should be a list of words, phrases, and partial phrases. The strings of words that are allowed depends further on how the keywords are extracted (see following subsection). Finally, during retrieval, the patron will find their document by repeatedly choosing the most relevant word, phrase, or partial phrase from a menu. This process will lead them to the documents mostly likely to Clustering Full Text Documents IJCAI-95 Workshop on Data Engineering for Inductive Learning 12 share all their words, phrases, and partial phrases. In this view, relevance to the patron must be changed from relevance to a task to a list of ideas that themselves are possibly relevant to the task. The translation from the task to the list of ideas is the patron?s responsibility. This restriction is very similar to that made throughout the area of information retrieval and in the context of clustering is sometimes called the ?clustering hypothesis? (Rasmussen, 1992). 4.3 Engineering the Features Assuming FL keeps track of keywords and keyword positions, can the underlying structure of documents be learned? If the authors of the documents chose their words carefully, then the keywords likely will be most highly related to the key ideas of those documents. However, this is rarely true. Blindly applying FL to a set of documents and using only the words that occur in the documents as possible keywords leads to very poor clustering performance. Related documents do not get clustered together. 4.3.1 Too few matches There are three major reasons why FL does not work well using only the surface words (see Table 3). First, there are so many synonyms in English that it is hard to recognize two similar documents. They would have to use enough of the same words. A common solution to this problem is to use a dictionary or thesaurus to translate words into some canonical form and increase matching. We decided to use WordNet (Miller et al., 1992) to add to (rather than replace) the keywords. In particular, we added synonyms, hypernyms, and contains words. A hypernym is a word that is more general than the given word. ?Vehicle? is a hypernym of ?bicycle?. A contains word is a word that describes a system of objects, one of which is the given word. ?Automobile? is a contains word for ?radiator?. For efficiency, these words are only added when a document is being compared with another document or a summary of a cluster. For efficiency reasons, the only thesaurus words that are added are those that are shared by both the document and its comparison. We have found that adding synonyms is generally not useful. More often, the hypernyms fill that role adequately. Beside using the thesaurus, we also augment the keywords with a limited list of keywords from hyperlinks or references that point to or refer to the document being indexed. For example, in the WWW, most documents have pointers or hyperlinks to other documents. If you click on one of the hyperlinks, you are shown another document. Each of the hyperlinks has a short description of what document will be found by clicking there. The words in the short description for a given document can be useful in describing the content or type of document it is. Again, keywords in this class are only added to a cluster summary if it is shared by some of the documents in the cluster. 4.3.2 Too many matches by missing keywords Most or all the documents will be highly similar in that they do not share most of the dictionary of words. Each document contains only a fraction of the possible keywords. Two documents can be seen as very similar simply because they both omit 2000 important keywords. FL applied blindly would cluster unrelated documents solely because of the shared absence of words. This is a side effect of using binary variables to represent the presence or absence of words. For practical reasons, when we can we treat the absence of words as a missing variable. If two documents both do not have a particular keyword, we assume that they could and we just have not been told. Similarly, if a document and a summary of a cluster both do not have a particular keyword, we assume that they could share that keyword or not, we just do not know. However, if one document has a keyword and the other does not, we assume that the documents differ in that respect. We no longer assume that the document without the keyword has a missing variable. This is a messy bit of practical engineering that prevents a clear description of what the resulting clusters mean. If one document omits the word ?retrieval?, FL in some cases treats that as meaning that the document could have ?retrieval? if we read between the lines or it might not have the word ?retrieval?. We just do not know. However, in other cases, FL treats the omission as meaning that the document definitely does not have the word ?retrieval?. By adopting this policy, we greatly simplify the building of clusters by only focusing on clearly relevant keywords. As well, we eliminate the problem of strong similarity by omission. In the process, however, we interpret the data inconsistently. 4.3.3 Too many irrelevant matches The thesaurus and references were used to increase the overlap between related documents. There is still a problem with high overlap between unrelated documents. There are many words that are shared by documents just by virtue of both documents being in English. As well, there are many documents that use different meanings of the same words. The standard response to extraneous matching of keywords is to use a stop list of words that are too common. These are usually the 100 most common words. However, in a clustering setting, the stop list will be different for different clusters and can grow to be very large deep in the tree of clusters. Any word that appears in most or all the clusters of a factor are essentially stop words, words so common they can be ignored. We implemented a stop list as a list of words and the probability that the word will be in an arbitrary document. Any word more probable than a threshold is ignored. At each level of the hierarchy, the stop list is augmented by the cluster summary at the next level up. Essentially, this implements a statistical measure of informativeness or interestingness. 4.4 Preliminary Results Preliminary results suggest that at least our decisions about augmenting and filtering the words of documents helps organize documents intuitively. We accepted the Yahoo organization of documents as a gold standard and found a branch of that index with 54 documents (http://www.yahoo.com/ Society_and_Culture/Holidays/). We applied FL to build a hierarchical index for those documents and compared the learned hierarchy with the provided one. The best hierarchy generated by FL when only actual words were used as features seems considerably different from the Yahoo directory, sharing only a few headings. When we permitted the introduction of synonyms, hypernyms, and partof terms, the best FL hierarchy shared half of the headings with the Yahoo directory. 5. Discussion When faced with a mismatch between a database and existing machine learning method, we have three choices. Either the dataset and task characteristics seem general enough to motivate the design of a specialized machine learning algorithm, Clustering Full Text Documents IJCAI-95 Workshop on Data Engineering for Inductive Learning 15 the task can be redefined or simplified, or the data can be engineered to fit an existing learning method. When applying to the problem of automatic generation of a hierarchical index, we did all three. In the best of all possible worlds, we would always generalize existing learning methods to include our new task. However, for the next decade, this will not be possible when dealing with natural language data. We must make do, simplifying our tasks and engineering our data. Acknowledgments Many of the ideas presented in this paper concerning feature engineering arose in my frequent conversations with Peter Turney. References Anderson, J. R. & Matessa, M. (1992). Explorations of an incremental, Bayesian algorithm for categorization. Machine Learning, 9, 275-308. Berners-Lee, T. The World Wide Web. Communications of the ACM, 37. Can, F. & Ozkarahan, E. A. (1989) Dynamic Cluster Maintenance. Information Processing and Management, 25, 275-291. Cooper, G.F. (1994). A Bayesian method for learning belief networks that contain hidden variables. Journal of Intelligent Information Systems. Cooper, G.F. & Herskovits, E. (1992). A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9, 309-347. G. F. DeJong. (1979). Prediction and Substantiation: a New Approach to Natural Language Processing. Cognitive Science, 3. Fisher, D. H. (1987). Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2, 139-172. Martin, J. D. and Billman, D. O. (1994). The acquisition and use of overlapping concepts, Machine Learning. Mauldin, M. L. (1991). Retrieval Performance in FERRET: A Conceptual Information Retrieval System, In the Proceedings of the 14th International Conference on Research and Development in Information Retrieval. Rasmussen, E. (1992). Clustering Algorithms. In W. B. Frakes & R. Baeza-Yates (Eds), Information Retrieval: Data Structures and Algorithms. New Jersey: Prentice Hall. Soderland, S. and Lehnert, W. (1994). Wrap-Up: a Trainable Discourse Module for Information Extraction, Journal of Artificial Intelligence Research, 2, 131- 158.