TOPIC INDEXING WITH WIKIPEDIA


Wikipedia can be utilized as a controlled vocabulary for identifying the main topics in a document, with article titles serving as index terms and redirect titles as their synonyms. Wikipedia contains over 4M such titles covering the terminology of nearly any document collection. This permits controlled indexing in the absence of manually created vocabularies. We combine state-of-the-art strategies for automatic controlled indexing with Wikipedia's unique property--a richly hyperlinked encyclopedia. We evaluate the scheme by comparing automatically assigned topics with those chosen manually by human indexers. Analysis of indexing consistency shows that our algorithm performs as well as the average person.

Full paper (to appear in Proceedings of the WikiAI Workshop at AAAI-2008, Chicago, US)

To evaluate the approach we used 20 Computer Science technical reports.
15 Teams of 2 senior CS undergraduates have independently assigned topics from Wikipedia to each article. The following table compares their topics with the ones picked by our algorithm automatically.
Document

Topics by the teams
number of teams chosen / Wikipedia article

Topics by the algorithm
10894.txt 15 Regression testing
13 Software maintenance
10 Control flow graph
9 Software testing
7 Algorithm
4 Software development process
3 Interprocedural optimization
2 Code coverage
2 Test suite
2 Software engineering
1 Revision control
1 Selection algorithm
1 Computer software
1 Upgrade
1 Quality control
1 Subroutine
1 Regression
1 Procedure
1 Procedural programming
1 Performance analysis
1 Software quality
1 Test automation
Regression testing
Quality control
Algorithm
Software maintenance
Test suite
12049.txt 13 Yacc
12 Parsing
9 Compiler-compiler
9 Backus Naur form
6 Compiler
6 Occam (programming language)
5 Occam's razor
4 Programming language
3 Lexical analysis
2 C (programming language)
2 Van Wijngaarden grammar
2 Formal grammar
2 Lex programming tool
1 Code generation
1 Modularity (programming)
1 Stack-oriented programming language
1 Parsing expression grammar
1 Context-free grammar
1 ANSI escape code
1 Evaluation
1 Syntax (logic)
1 Two-level grammar
1 Subroutine
1 Technology
1 Decompiler
1 Syntax
1 Regular grammar
1 Regular language
1 Formal specification
1 Run-time system
1 Declarative programming
1 High-level programming language
1 Abstract art
1 Formal language
Yacc
Parsing
Occam (programming language)
Lexical analysis
Compiler-compiler
13259.txt 7 Hierarchical model
7 3D computer graphics
6 Visualization (graphic)
5 Tree (data structure)
3 Computer graphics
3 Tree structure
3 Constructive solid geometry
3 Treemapping
2 Fisheye lens
2 GUI widget
2 3D modeling
2 Visualization
2 Command line interface
2 Scientific visualization
2 PARC (company)
2 Unix
2 Object-oriented programming
2 Human-computer interaction
2 Rapid prototyping
1 Dimension
1 Product visualization
1 Computer display
1 Computer animation
1 Visual communication
1 Hierarchical Data Format
1 Object oriented design
1 Color code
1 Animation
1 Tree (Unix)
1 Code
1 Model (person)
1 Coding
1 UXGA
1 Software visualization
1 Three-dimensional space
1 Scripting language
1 3D computer graphics software
1 Programming language
1 Object-oriented programming language
1 Computer Graphics (Publication)
1 Shape
1 Channel (communications)
1 Icon
1 Information theory
1 Hierarchy
1 File (Unix)
Map projection
Visualization (graphic)
3D computer graphics
Syntax
Unix
16393.txt 15 Virtual memory
9 Multiprocessing
9 Cache coherency
7 Consistency model
6 Cache
6 Sequential consistency
4 Shared memory
4 Release consistency
3 Distributed shared memory
3 Memory management unit
2 Physical address
2 Central processing unit
2 Synchronization (computer science)
2 Coherence protocol
2 Memory coherence
2 Parallel computing
1 Memory management
1 Bus (computing)
1 Computer architecture
1 Application software
1 Thread (computer science)
1 Access time
1 Mechanism
1 Computer simulation
1 Bus sniffing
1 Virtual machine
1 Synchronization
1 Uniprocessor
1 Concurrent computing
1 System software
1 Scalability
1 Latency (engineering)
1 Gaussian elimination
Cache coherency
Central processing unit
Cache
Computer data storage
Consistency model
18209.txt 15 Object-oriented programming
14 Logic programming
6 Linear logic
5 Immutable object
4 Deductive database
3 Object (computer science)
2 Dynamic logic (modal logic)
2 Logic
2 State (computer science)
2 Assertion (computing)
2 Programming paradigm
2 Encapsulation
2 Dynamic logic
1 Concurrency (computer science)
1 Linear
1 Semantics of logic
1 Database
1 Class (computer science)
1 Rewriting Techniques and Applications
1 Class invariant
1 Theory of Computing (journal)
1 Computer science
1 Computer programming
1 Object-Oriented Modeling
1 Modal logic
1 Validity
1 Transaction Processing Facility
1 Event calculus
1 Type polymorphism
1 Abstract data type
1 Prolog
1 Object database
1 Object-oriented programming language
1 Multi-paradigm programming language
1 Mathematical logic
1 Object composition
1 Logic Programming Associates
1 Resource (computer science)
1 Inheritance
1 Dynamic programming
1 Logic in computer science
1 First-order logic
Logic programming
Object-oriented programming
Logic
Prolog
Object (computer science)
19970.txt 13 Sorting algorithm
10 Parallel computing
6 Deterministic algorithm
5 Computational complexity theory
4 Load balancing (computing)
4 Parallel algorithm
3 Deterministic automaton
3 High-level programming language
2 Distributed computing
2 Platform (computing)
2 Algorithm
2 Split-C
2 Central processing unit
2 Optimization (mathematics)
2 Optimization (computer science)
2 Scalability
2 Sorting
2 Cray T3D
2 Performance evaluation
1 Performance testing
1 Collective
1 Complexity
1 Cost modeling
1 Randomized algorithm
1 Computation time
1 Benchmark
1 Computer simulation
1 Subsequence
1 Performance
1 Analysis of algorithms
1 Benchmark (computing)
1 Runtime
1 Sampling (statistics)
1 Sampling (music)
1 IBM
Sorting algorithm
Algorithm
Central processing unit
NP (complexity)
Benchmark (computing)
20782.txt 15 Geographic information system
9 Parallel computing
9 Distributed Interactive Simulation
8 Load balancing (computing)
4 Parallel programming model
3 Visualization
3 Visualization (graphic)
3 Virtual reality
3 Spatial analysis
2 High-performance computing
2 Address space
2 Uniform Memory Access
2 United States Army Research Laboratory
2 Optimization (computer science)
2 3D computer graphics
1 Military
1 Data set
1 Performance engineering
1 Computer graphics
1 Evaluation
1 Model (person)
1 Silicon Graphics Image
1 Simulation
1 Fetch
1 Real-time computing
1 Programming language
1 Interaction
1 Map
1 Microprocessor
1 Polygonal modeling
1 Army
1 Message passing
1 Idle
1 Geographic data
Geographic information system
Polygon
Central processing unit
Message passing
Load balancing (computing)
23267.txt 12 Programming language
6 C++
6 Component-based software engineering
6 Encapsulation
5 Abstraction (computer science)
5 Software architecture
5 Inheritance (computer science)
4 Database management system
4 Information hiding
4 Software engineering
3 Parametrization
3 Object-oriented programming
3 High-level programming language
2 Domain-specific programming language
2 Systems design
2 Abstraction
2 Domain model
1 Architecture
1 Array programming
1 Software framework
1 Computer programming
1 Component
1 Methodology (software engineering)
1 David Parnas
1 Modeling language
1 Parameter (computer science)
1 Software development process
1 Software architect
1 Software development
1 Procedural programming
1 Separation of concerns
1 Object-oriented programming language
1 Design pattern (computer science)
1 Extensibility
1 Design paradigm
1 Automatic programming
Computer software
Information hiding
Programming language
Software engineering
Object-oriented programming
23507.txt 11 Parody
9 Language model
9 Artificial intelligence
7 Vocabulary
6 Natural language processing
5 Ernest Hemingway
4 Natural language generation
4 Grammar
3 Data compression
3 Statistics
2 Logical disjunction
2 Computer
2 Computational linguistics
2 Phrase structure rules
2 Lexical category
2 Thomas Hardy
2 Semantics
2 Natural language
1 Context-free language
1 Pattern
1 Computational semantics
1 Generalised phrase structure grammar
1 Computer program
1 Imitation
1 Prose
1 Inferential programming
1 Grammatology
1 Phrase
1 Turing test
1 Grammar induction
1 Data mining
1 Artificial Creativity
1 Statistical model
1 Computer-generated
1 Pseudorandomness
1 Stylometry
1 Linguistics
1 Hypothesis
1 Formal grammar
Parody
Ernest Hemingway
Grammar
Lexical category
Computer science
23596.txt 8 Communication
8 Computer supported cooperative work
7 Collaborative software
7 Collaborative workspace
5 Collaboration
5 Human-computer interaction
4 Face-to-face
3 Questionnaire
2 Small-group communication
2 Interactivity
2 Experiment
2 Cooperation
2 Communication studies
2 Interpersonal communication
2 Computer-supported collaboration
2 Interaction
2 Empirical studies
2 Problem solving
1 Pattern
1 Observational study
1 Natural environment
1 Conversation analysis
1 Collaborative Networked Learning
1 Mass collaboration
1 Online deliberation
1 Video
1 Computer-mediated communication
1 Videoconferencing
1 Constructivism (art)
1 Human communication
1 Large-group communication
1 User (computing)
1 Group collaboration
1 Workspace
1 Jigsaw puzzle
1 Teamwork
1 Group process
Communication
Sound recording and reproduction
Face-to-face
Computer-supported collaboration
Computer supported cooperative work
25473.txt 7 Content-based image retrieval
5 Image processing
5 Computer vision
5 Image compression
5 Feature extraction
5 Information systems
3 Digital image processing
3 Data compression
2 Data structure
2 Database
2 Query
2 Discrete cosine transform
2 Quadtree
2 Color histogram
2 Image retrieval
2 Signal processing
2 Multimedia
2 Index (database)
2 Pattern recognition
1 Mahalanobis distance
1 Algorithm
1 Computational complexity theory
1 Prototype
1 Visual communication
1 Visualization (graphic)
1 Video compression
1 Visionaire (software)
1 Wavelet
1 Query language
1 Texture
1 Image analysis
1 Wavelet series
1 Feature (Computer vision)
1 Tree (data structure)
1 Visual system
1 Image editing
1 RGB color model
1 Gabor filter
1 Information retrieval
1 Prototyping
1 Index (publishing)
Texture mapping
Color
Database
Wavelet
Data compression
287.txt 13 Machine learning
10 Cluster analysis
8 Information retrieval
6 Index (search engine)
5 Natural language
4 Information Engineering
4 Unsupervised learning
4 Natural language processing
4 Index
3 Data mining
3 Full text search
2 Algorithm
2 Inverted index
2 Index (information technology)
2 Hierarchy
2 Index (publishing)
1 Engineering
1 Digital library
1 Dewey Decimal Classification
1 Document retrieval
1 Document management system
1 Keywords
1 Inductive reasoning
1 Hierarchical classifier
1 Table of contents
1 Optimization (computer science)
1 Library of Congress
1 Index (database)
1 Dirichlet distribution
Machine learning
Cluster analysis
Hierarchy
Information retrieval
Natural language
37632.txt 12 Software visualization
8 Electronic learning
6 Distance education
4 Education
4 Internet
3 Computer programming
3 Computer supported cooperative work
2 Visualization
2 Knowledge engineering
2 Computer-supported collaborative learning
2 Virtual learning environment
2 Web server
2 Web application
2 Software engineering
2 Teacher
2 Educational technology
1 Algorithm
1 Client-server
1 Knowledge visualization
1 Computer program
1 Communication
1 Synchronous learning
1 Online tutoring
1 Curriculum
1 Face-to-face
1 Collaborative software
1 Collaborative Networked Learning
1 Virtual reality
1 Collaboration
1 Synchronization
1 Sorting algorithm
1 Software development
1 Prolog
1 Information design
1 Social software
1 Laboratory
1 Learning management system
1 Student
1 Information
1 Hypertext Transfer Protocol
1 Asynchronous learning
Software visualization
Internet
Prolog
Eisenstadt
Computer programming
39172.txt 12 NP-complete
10 String searching algorithm
5 Computational complexity theory
5 Levenshtein distance
5 String (computer science)
4 Polynomial time
4 String-to-string correction problem
3 Approximate string matching
3 Pattern matching
3 Molecular biology
2 Algorithm
2 BLAST
2 Substring
2 Handwriting recognition
2 Tablet PC
2 Dynamic programming
1 Graphology
1 Polynomial-time approximation scheme
1 Approximation algorithm
1 Optical character recognition
1 Edit distance
1 Distance
1 Disjoint sets
1 Statement block
1 Fuzzy string searching
1 Polynomial
1 Lemma (mathematics)
1 Information retrieval
1 Graphonomics
1 Bioinformatics sequence alignment
1 Pattern recognition
Edit distance
String (computer science)
NP-complete
Approximate string matching
Compact Disc
39955.txt 15 Object-oriented programming
6 Software engineering
6 Structured interview
6 Interview
5 Software maintenance
5 Programming paradigm
4 C++
4 Inheritance (computer science)
3 Empirical research
3 Empirical method
2 Carpool
2 Empirical
2 High-level programming language
2 Object-based language
2 Inheritance
1 Entropy
1 Questionnaire
1 Information hiding
1 Experiment
1 Polymorphism (biology)
1 Mathematical analysis
1 Assessment
1 Systems design
1 Quantitative research
1 Introspection
1 Dynamic binding
1 Type polymorphism
1 Polymorphism in object-oriented programming
1 Software development
1 Object-oriented programming language
1 Object-oriented software engineering
1 Object-based
1 Name binding
1 Hierarchy (object-oriented programming)
1 Software quality
Object-oriented programming
Software maintenance
Empirical
Inheritance (computer science)
Multiple inheritance
40879.txt 14 Machine learning
9 Nearest neighbour algorithm
5 Algorithm
5 Training set
4 Artificial intelligence
3 Computer data storage
3 Data mining
3 K-nearest neighbor algorithm
2 Reduction (complexity)
2 Pruning (algorithm)
2 Pruning
2 Cross-validation
2 Optimization (computer science)
2 Supervised learning
2 Generalization
2 Pattern recognition
1 Heuristic (computer science)
1 Computational complexity theory
1 Cluster analysis
1 Euclidean distance
1 Noise reduction
1 Data cleansing
1 Nearest neighbor interpolation
1 Search algorithm
1 Pattern matching
1 Data analysis
1 Generalization error
1 Runtime
1 Case-based reasoning
1 Object (computer science)
1 Kd-tree
1 Reduction (mathematics)
Algorithm
Training set
K-nearest neighbor algorithm
Machine learning
Nearest neighbor search
43032.txt 15 Internationalization and localization
6 User interface
6 Software engineering
5 Translation
5 Linguistics
4 Character encoding
3 Culture
3 Computer program
3 Computer software
3 Globalization
3 Human-computer interaction
2 Cultural identity
2 Usability
2 Compile time
2 Language
2 Multilingualism
2 Localization
1 Software maintenance
1 Collation
1 Translations
1 Categorization
1 Language localisation
1 American National Standards Institute
1 Interface (computer science)
1 Abstraction (computer science)
1 Text user interface
1 Kanji
1 Code refactoring
1 Software development
1 Liquid-liquid extraction
1 Cross-cultural communication
1 User interface design
1 Natural language
1 Machine translation
Internationalization and localization
Jakob Nielsen (usability consultant)
Computer software
User interface
String (computer science)
7183.txt 17 Expert system
12 Artificial intelligence
6 Model (abstract)
5 Abstraction
5 Knowledge base
4 Inference
3 Domain knowledge
3 Abstraction (computer science)
3 Knowledge-based systems
2 Medical cybernetics
2 Methodology (software engineering)
2 Knowledge engineers
2 KADS
1 Categorization
1 Cybernetics
1 Clinical decision support system
1 Mathematical logic
1 Medicine
1 Case-based reasoning
1 Knowledge representation
1 Problem solving
1 Domain expert
1 Model-based testing
1 Deep inference
Artificial intelligence
Expert system
Inference
Knowledge-based systems
Medicine
7502.txt 10 Machine learning
7 Reasoning
7 Introspection
6 Artificial intelligence
3 Learning
2 Algorithm
2 Inductive reasoning
2 Strategy
2 Failure analysis
2 Case-based reasoning
2 Index
2 Problem solving
2 Meta
1 Node (networking)
1 Regression analysis
1 Reinforcement learning
1 Learning by doing
1 Logical reasoning
1 Deductive reasoning
1 Cognitive science
1 Meta learning (computer science)
1 Knowledge engineering
1 Control flow
1 Software agent
1 Failure
1 Data mining
1 Type introspection
1 Logic
1 Question answering
1 Windows XP
1 Inference
1 Taxonomy
1 Metaknowledge
1 Prediction
1 Knowledge
1 Meta learning
1 Performance
1 Mathematical logic
1 Knowledge engineers
1 Failure assessment
1 Learning theory (education)
1 Computational intelligence
1 Information retrieval
1 Knowledge-based systems
1 Decision model
1 Expert system
Random access memory
Reasoning
XML Paper Specification
Aqua (band)
Learning
9307.txt 13 Object-oriented programming
7 Software development process
7 Software engineering
5 Class (computer science)
4 Computer-aided software engineering
3 C++
3 Product life cycle management
2 Adaptive system
2 Pattern
2 Object oriented design
2 Wave propagation
2 Object-Oriented Modeling
2 Class diagram
2 Programming language
2 Object (computer science)
2 Graph theory
1 Code generation
1 Class-based programming
1 Behavioral pattern
1 Data model
1 Software maintenance
1 Data modeling
1 Event-driven programming
1 Graph (data structure)
1 Descriptive linguistics
1 Software Lifecycle Processes
1 Abstraction
1 Social class
1 Component-based software engineering
1 Inheritance (computer science)
1 Object graph
1 Law of Demeter
1 Object-oriented programming language
1 Object-oriented software engineering
1 High-level programming language
1 Inheritance
1 Propagation of schema
1 Novell Evolution
Radio propagation
Graph (mathematics)
Object-oriented programming
Class (computer science)
Evolution