- (Velardi et al., 2008) ⇒ Paola Velardi, Roberto Navigli, and Pierluigi D'Amadio. (2008). “Mining the Web to Create Specialized Glossaries.” In: IEEE Intelligent Systems Journal, 23(5). doi:10.1109/MIS.2008.88
- Web text analysis; Web text analysis; natural language processing; artificial intelligence; knowledge acquisition; knowledge management; artificial intelligence; knowledge acquisition; knowledge management; natural language processing
Glossaries are helpful for integrating information, reducing semantic heterogeneity, and facilitating communication between information systems. Commercial publishers charge lexicographers with building glossaries, but this isn't appropriate when a domain's semantics are continuously evolving rather than precisely characterized, as in emerging Web communities and interest groups. In emerging domains, glossary building is the cooperative effort of a team of domain experts. It involves several steps, including identifying the domain-relevant terminology, defining each term, and harmonizing the results. This is a time-consuming, costly process that often requires support from a collaborative platform to facilitate shared decisions and validation. TermExtractor and GlossExtractor, two Web applications based on Web mining techniques, support this complete glossary-building procedure. The tools exploit the Web's evolving nature, allowing one to continually update the emerging community's vocabulary. TermExtractor and GlossExtractor, which were used in the European project Interop, are freely available and are being used in experiments in different domains across the world. This article is part of a special issue on Natural Language Processing and the Web.
Related work in glossary extraction
In the past few years, the Web has been a valuable resource for extracting several kinds of information (such as terminologies, facts, and social-networking information) to aid text-mining methods. Nonetheless, automatic-glossary-extraction literature, especially from the Web, isn’t very rich. Atsushi Fujii and Tetsuya Ishikawa presented a method that extracts fragments of Web pages using patterns typically used to describe terms. They used an n-gram model to quantify how well formed a given text fragment was. Although they used a clustering approach to identify the most representative definitions, they didn’t perform domain-based selection. Youngja Park, Roy Byrd, and Branimir Boguraev’s method, implemented in the GlossEx system, extracts glossary items (that is, a terminology) on the basis of a pipeline architecture. This method performs domain-oriented filtering that improves over the state of the art in terminology extraction; nonetheless, it provides no means for learning definitions for the extracted terms. Judith Klavans and Smaranda Muresan’s text-mining method, the Definder system, extracts definitions embedded in online texts.  The system is based on pattern matching at the lexical level and guided by cue phrases such as “is called” and “is defined as.” However, as in Fujii and Ishikawa’s method, the Definder system applies patterns in an unconstrained manner, which implies low precision when applied to the Web, rather than a domain corpus.
A closely related problem to glossary extraction is opendomain definitional question answering (QA), which focuses on unconstrained “who is X?” and “what is X?” questions. Many QA approaches — especially those evaluated in the TREC (Text Retrieval Conference, http://trec.nist.gov) evaluation competitions — require the availability of training data, such as large collections of sentences tagged as “definitions” or “nondefinitions.” Applying supervised approaches  to glossary extraction in emerging domains would be costly and time-consuming. Ion Androutsopoulos and Dimitrios Galanis propose a weakly supervised approach in which feature vectors associated with each candidate definition are augmented with automatically learned patterns. Patterns are n-grams that surround a term for which a definition is sought. Horacio Saggion’s method extracts definitions from a large corpus using lexical patterns coupled with a search for cue terms that recur in dictionary and encyclopedic definitions of the target terms.
A recent article proposes using probabilistic lexico-semantic patterns (called soft patterns) for definitional QA.  These patterns cope better with the diversity of definition sentences than predefined patterns (or hard patterns) can. The authors evaluated the method on data sets made available through TREC. However, we must outline some important differences of glossary extraction with respect to the TREC “what is” task.
First, in TREC, the target is to mediate at best between precision and recall; when the objective is an emerging domain, recall is the most relevant performance figure. For certain novel concepts, there might be very few definitions (or just one). The target is to capture the majority of them because rejecting a bad definition takes just seconds, whereas manually creating a definition takes several minutes. (We consulted a professional lexicographer, Orin Hargraves, to estimate the manual effort of glossary development.)
Second, TREC questions span heterogeneous domains — that is, there’s no domain relation between definitions, in contrast with glossaries.
Third, the TREC evaluation data set doesn’t have true definitions but rather sentence “nuggets” that provide some relevant fact about a target. The following example shows a list of answers, classified as “vital” or just “okay,” for the query, “What is Bollywood?” from TREC 2003.
Qid 2201: What is Bollywood? 2201 1 vital is Bombay-based film industry 2201 2 okay bollywood is indian version of Hollywood 2201 3 okay name bollywood derived from Bombay 2201 4 vital second largest movie industry outside of hollywood 2201 5 okay the bollywood movies in UK’s top ten this year 2201 6 okay empty theaters because films shown on cable 2201 7 okay Bollywood awards equivalent to Oscars 2201 8 vital 700 or more films produced by india with 200 or more from bollywood 2201 9 vital Bolywood — an industry reeling from its worst year ever
This example shows that manually validated answers (even those classified as “vital”) aren’t always true definitions (for example, answer 8) and aren’t professionally written (for example, answer 9). Wikipedia gives a true definition of Bollywood: “Bollywood is the informal term popularly used for the Mumbai-based Hindi-language film industry in India.” Given the TREC task objectives, the use of soft matching criteria is certainly crucial, but in glossary definitions, style constraints should be followed closely to provide clarity and knowledge sharing, as well as to facilitate validation and agreement among specialists.
In summary, the state of the art on glossary and definition extraction only partially satisfies the objectives of the glossary extraction life cycle. In the main article, we present our glossary extraction system, which employs natural language techniques to exploit the Web’s potential.
- A. Fujii and T. Ishikawa, “Utilizing the World Wide Web as an Encyclopedia: Extracting Term Descriptions from Semi-Structured Texts,” Proceedings of 38th Ann. Meeting Assoc. for Computational Linguistics, Morgan Kaufmann, 2000, pp. 488–495.
- Y. Park, R. Byrd, and B. Boguraev, “Automatic Glossary Extraction: Beyond Terminology Identification,” Proceedings of 19th Int’l Conf. Computational Linguistics, Howard Int’l House and Academica Sinica, 2002, pp. 1–7.
- J. Klavans and S. Muresan, “Evaluation of the Definder System for Fully Automatic Glossary Construction,” Proceedings of American Medical Informatics Association Symp., 2001, pp. 324–328.
- S. Miliaraki and I. Androutsopoulos, “Learning to Identify Single-Snippet Answers to Definition Questions,” Proceedings of 20th Int’l Conf. Computational Linguistics, Morgan Kaufmann, 2004, pp. 1360–1366.
- H.T. Ng, J.L.P. Kwan, and Y. Xia, “Question Answering Using a Large Text Database: A Machine Learning Approach,” Proceedings of Conf. Empirical Methods in Natural Language Processing, Assoc. for Computational Linguistics, 2001, pp. 67–73.
- I. Androutsopoulos and D. Galanis, “A Practically Unsupervised Learning Method to Identify Single-Snippet Answers to Definition Questions on the Web,” Proceedings of Human Language Technology Conf. and Conf. Empirical Methods in Natural Language Processing, Assoc. for Computational Linguistics, 2005, pp. 323–330.
- H. Saggion, “Identifying Definitions in Text Collections for Question Answering,” Proceedings of Language Resources and Evaluation Conf., European Language Resources Assoc., 2004.
- H. Cui, M.K. Kan, and T.S. Chua, “Soft Pattern Matching Models for Definitional Question Answering,” ACM Trans. Information Systems, vol. 25, no. 2, 2007, pp. 1–30.
|2008 MiningtheWebtoCreateSpecialized||Paola Velardi|
|Mining the Web to Create Specialized Glossaries||10.1109/MIS.2008.88||2008|