- (Wong et al., 2012) ⇒ Wilson Wong, Wei Liu, and Mohammed Bennamoun. (2012). “Ontology Learning from Text: A Look Back and Into the Future.” In: ACM Computing Surveys (CSUR) Journal, 44(4). doi:10.1145/2333112.2333115
Subject Headings: Ontology-Learning from Text.
- Ontology learning; application of ontologies; concept discovery; semantic relation acquisition; term recognition
Ontologies are often viewed as the answer to the need for interoperable semantics in modern information systems. The explosion of textual information on the Read/Write Web coupled with the increasing demand for ontologies to power the Semantic Web have made (semi-)automatic ontology learning from text a very promising research area. This together with the advanced state in related areas, such as natural language processing, have fueled research into ontology learning over the past decade. This survey looks at how far we have come since the turn of the millennium and discusses the remaining challenges that will define the research directions in this area in the near future.
Advances in areas such as natural language processing, information retrieval, machine learning, data mining, and knowledge representation have been fundamental in our quest for means to make sense of an ever growing body of textual information in electronic forms, known simply as information from here on. The intermingling of techniques from these areas has enabled us to extract and represent facts and patterns for improving the management, access, and interpretability of information. However, it was not until the turn of the millennium with the Semantic Web dream (Maedche and Staab 2001) and the explosion of information due to the Read/Write Web that the need for a systematic body of study in large-scale extraction and representation of facts and patterns became more obvious. Over the years, that realization gave rise to a research area now known as ontology learning from text which aims to turn facts and patterns from an ever growing body of information into shareable high-level constructs for enhancing everyday applications (e.g., Web search) and enabling intelligent systems (e.g., Semantic Web).
Ontologies are effectively formal and explicit specifications in the form of concepts and relations of shared conceptualizations (Gruber 1993). Ontologies may contain axioms for validation and enforcing constraints. There has always been a subtle confusion or controversy regarding the difference between an ontology and a knowledge base. In an attempt to draw a line between these two structures, consider the loosely applicable analogy where ontologies are cupcake molds and knowledge bases are the actual cupcakes of assorted colours, tastes, and so on. Ontologies, in this sense, represent the intensional aspect of a domain for governing the way the corresponding knowledge bases (i.e., extensional aspect) are populated [Buitelaar et al. 2005]. In other words, every knowledge base has to be committed to a conceptualization, whether implicitly or explicitly. This conceptualization is what we refer to as ontologies [Gruber 1993]. With this in mind, knowledge bases can be created by extracting the relevant instances from information to populate the corresponding ontologies, a process known as ontology population or knowledge markup . Ontology learning from text is then essentially the process of deriving high-level concepts and relations as well as the occasional axioms from information to form an ontology.
Ontology learning has benefited from the adoption of established techniques from the related areas just discussed. Aside from the inherent challenges of processing natural language, one of the remaining obstacles preventing the large-scale deployment of ontology learning systems is the bottleneck in handcrafting structured knowledge sources (e.g., dictionaries, taxonomies, knowledge bases) [Cullen and Bryman 1988] and training data (e.g., annotated text corpora). It is gradually becoming apparent that in order to minimize human efforts in the learning process and to improve the scalability and robustness of the system, static and expert crafted resources may no longer be adequate. Recognizing this, an increasing amount of research effort is gradually being directed towards harnessing the collective intelligence on the Web in the hopes of addressing this one major bottleneck. At the same time, as with many fields before ontology learning, the process of maturing has triggered a mounting awareness of the actual intricacies involved in automatically discovering concepts, relations, and even axioms. This gives rise to the question of whether the ultimate goal of achieving full-fledged formal ontologies automatically can be achieved. While certain individuals dwell on the question, many others move on with a more pragmatic goal, which is to focus on learning lightweight ontologies first and extend them later if possible. With high hopes and achievable aims, we are seeing a gradual rise in the adoption of ontologies across many domains that require knowledge engineering, in particular, interoperability of semantics in their applications (e.g., document retrieval [Castells et al. 2007], image retrieval [Hyvonen et al. 2003], bioinformatics [Baker et al. 2007], manufacturing [Cho et al. 2006], industrial safety [Abou-Assali et al. 2007], law [Volker et al. 2008], environment [Raskin and Pan 2005], disaster management [Klien et al. 2006], e-Government [Kayed et al. 2010], e-Commerce [Liu et al. 2008], and tourism [Park et al. 2009]).
6. CURRENT TRENDS AND FUTURE RESEARCH DIRECTIONS
To summarize, we began this survey with an overview of ontologies and ontology learning from text. In particular, we introduced a unified way of looking at the types of output, tasks, techniques, and resources in ontology learning as well as the associations between these different dimensions in Figure 2. We summarized several widely used evaluation methods in the ontology learning community. The differences between a formal and a lightweight ontology were also explained. Finally, we reviewed seven prominent ontology learning systems as well as recent advances in the field. A summary of the systems reviewed is provided in Table I.
In this section, we bring this survey to a close by summarizing the progress and trends that the ontology learning community has witnessed over the past ten years. We then look at several open issues that will likely define the future research directions of the community.
The intertwining of the Web with ontology learning is a natural progression for many reasons. The ability to harvest consensus (considering that ontologies are shared conceptualizations) and accessibility to very large samples required by many learning techniques are amongst the reasons. In addition to the already existing problems in ontology learning, the growing use of Web data will introduce new challenges. At the moment, research involving the use of Web data for addressing the bottleneck of manual knowledge crafting has already begun. For instance, we are already seeing the marrying of Web data with term, concept, and relation extraction techniques that can easily benefit from larger datasets. For all we know, the Web may very well be the key ingredient in constructing ontologies with minimal human intervention required for cross-language and cross-domain applications and, eventually, the Semantic Web. When this happens, the role of formal ontology language will become much more significant, and heavyweight ontologies will take the center stage. We close this survey by looking at some of the present and future research problems in the area in this section.
First, we foresee that more and more research efforts will be dedicated to creating new or adapting existing techniques to work with the noise, richness, diversity, and scale of Web data. In regard to noise, there is currently little mention of data cleanliness during ontology learning. As the use of Web data becomes more common, integrated techniquesforaddressingspellingerrors,abbreviations,grammaticalerrors,wordvari- ants, and so on in texts are turning into a necessity. For instance, looking for a more representative word count on the Web for “endeavour” will require consideration for its variants (e.g., “endeavor”) and spelling errors (e.g., “endevour”). Moreover, the issues of authority and validity in Web data sources must also be investigated. Otherwise, relations frequently occurring on the Web, such as
<Vladimir Putin><is-a><president of Germany>, will end up in the knowledge base. We predict that social data from the Web (e.g., collaborative tagging) will play an increasingly important role in addressing the authority and validity aspects of ontology learning. Probabilities and ranking based on wisdom of the masses is one way to assign trust to concepts and relations acquired from Web sources.
Second, the richness of Web data in terms of (semi-)structured, collaboratively maintained resources, such as Wikipedia, is increasingly being used to improve higher-layer tasks, such as concept formation and relation discovery. We observed from the literature, the current mushrooming of techniques for finding semantic relations using the categorical structure of Wikipedia. These techniques are mostly focused on hierarchical relations and often leave out the details on how to cope with concepts that do not appear in Wikipedia. We foresee that more effort will be dedicated to studying and exploiting associative relations on Wikipedia (e.g., links under the “See also” section) for ontology learning. We have already noticed work on identifying coarse-grained unlabeled associative relations from Wikipedia and the adaptive matching of terms to Wikipedia topics where exact matches are not available. We will definitely see more work going along this direction. An example would be the use of the coarse-grained associative relations as seeds together with triples extracted from Web search results for bootstrapping the discovery of more detailed semantic relations. The verbs from the triples could then be used to label the relations. Unless improvements are made in these tasks, many of the current elaborate and expert-crafted ontologies, such as the Gene Ontology, cannot be replicated using ontology, learning from text systems.
Third, the diversity of Web data has also contributed to the rise of cross-language ontology learning in the past few years. As more communities of different cultural and linguistic backgrounds contribute to the Web, the availability of textual resources required for ontology learning across different languages will improve. The potential growth of cross-language research in the future signals the need to gradually move ontologies away from language dependency. Considering that formal ontologies are shared conceptualizations and should not contain lexical knowledge [Hjelm and Volk 2011], apples should not be represented lexically as “apple” in an ontology so that the overall fruit ontology can be applicable to other languages. For this to happen, we need more research into mechanisms for encoding and representing ontological entities as low-level constructs and for mapping these constructs into natural language symbols to facilitate human interpretation.
Fourth, the ability to cope with the scale of Web data required for ontology learning is also another concern. The efficiency and robustness in processing an exponentially growing volume of text will likely receive increasing attention. The issues that researchers will look at extend beyond mere storage space or other hardware considerations. Some of the topics of potential interest include the ease of analyzing petabyte collections for corpus statistics, the ability to commit, resume, and rollback the learning process in the event of errors or interruptions, and the efficiency of techniques for the various tasks of ontology learning from Web-scale data (e.g., large-scale sentence parsing). The latter topic is of particular interest considering that many of the current ontology learning systems employ readily available off-the-shelf tools or incorporate techniques designed for small datasets or without efficiency in mind. In particular, systems that are the result of putting together existing tools may not be streamlined and hence may suffer in performance when faced with Web-scale text analysis.
Fifth, we speculate that the related area of ontology mapping, also known as ontology alignment, will become more pertinent as the availability of ontologies increases. The availability of multiple and potentially conflicting or complementing ontologies will call for better means to determine correspondences between concepts and even relations [deBruijn et al. 2006]. A gradual rise in interest in ontology mapping is obvious as we look at the publication trend shown in Figure 5. The data for this graph are obtained by searching for publications containing the phrase “ontology mapping” or “ontology alignment” on Google Scholar. The graphs in Figures 4 and 5 may not be representative of the actual publication trends. They, however, do offer testable predictions of the current state of (as well as) future research interests. In addition, we are predicting a rise in focus on logic-based techniques in ontology learning, as our techniques for the lower layers (i.e., term, concept, relation) mature and our systems become more comprehensive (i.e., inclusion of higher-layer outputs), reaching towards the learning of full-fledged ontologies.
Last, it remains a fact that the majority of the ontologies out there at the moment are lightweight. To be able to semi-automatically learn formal ontologies, we have to improve on our current axiom learning techniques as well as to find ways of incorporating the consensus aspect into the learning process, amongst others. As formal ontologies take the center stage, we foresee an increase in concern regarding the extensibility of existing lightweight ontologies to [heavyweight ontology|full-fledged ones]]. All in all, there are several key issues that will likely define the research directions in this area in the near future, namely, (1) the issue of noise, authority, and validity in Web data for ontology learning; (2) the integration of social data into the learning process to incorporate consensus into ontology building; (3) the design of new techniques for exploiting the structural richness of collaboratively maintained Web data; (4) the representation of ontological entities as language-independent constructs; (5) the applicability of existing techniques for learning ontologies for different writing systems (e.g., alphabetic, logographic); (6) the efficiency and robustness of existing techniques for Web-scale ontology learning; (7) the increasing role of ontology mapping as more ontologies become available; and (8) the extensibility of existing lightweight ontologies to formal ones. Key phrases, such as Web-scale, open, consensus, social, formal, and cross-language ontology learning or ontologies, are all buzzwords that we will encounter very often in the future.
|2012 OntologyLearningfromTextALookBa||Wilson Wong|
|Ontology Learning from Text: A Look Back and Into the Future||10.1145/2333112.2333115||2012|