2009 EmployingTopicModelsForPatSemClasDisc

Jump to: navigation, search

Subject Headings: Text-based Semantic Class Expansion Task, Probabilistic Topic Mode.




A semantic class is a collection of items (words or phrases) which have semantically peer or sibling relationship. This paper studies the employment of topic models to automatically construct semantic classes, taking as the source data a collection of raw semantic classes (RASCs), which were extracted by applying predefined patterns to web pages. The primary requirement (and challenge) here is dealing with multi-membership: An item may belong to multiple semantic classes; and we need to discover as many as possible the different semantic classes the item belongs to. To adopt topic models, we treat RASCs as “documents”, items as “words”, and the final semantic classes as “topics”. Appropriate preprocessing and postprocessing are performed to improve results quality, to reduce computation cost, and to tackle the fixed-k constraint of a typical topic model. Experiments conducted on 40 million web pages show that our approach could yield better results than alternative approaches.

1 Introduction

Semantic class construction (Lin and Pantel, 2001; Pantel and Lin, 2002; Paşca, 2004; Shinzato and Torisawa, 2005; Ohshima et al., 2006) tries to discover the peer or sibling relationship among terms or phrases by organizing them into semantic classes. For example, {red, white, black…} is a semantic class consisting of color instances. A popular way for semantic class discovery is pattern-based approach, where predefined patterns (Table 1) are applied to a collection of web pages or an online web search engine to produce some raw semantic classes (abbreviated as RASCs, Table 2). RASCs cannot be treated as the ultimate semantic classes, because they are typically noisy and incomplete, as shown in Table 2. In addition, the information of one real semantic class may be distributed in lots of RASCs (R2 and R3 in Table 2).

Table 1. Sample patterns (SENT: Sentence structure patterns; TAG: HTML Tag patterns) Type Pattern SENT NP {, NP}*{,} (and|or) {other} NP


  • item
  • item


Table 2. Sample raw semantic classes (RASCs) R1: {gold, silver, copper, coal, iron, uranium} R2: {red, yellow, color, gold, silver, copper} R3: {red, green, blue, yellow} R4: {HTML, Text, PDF, MS Word, Any file type} R5: {Today, Tomorrow, Wednesday, Thursday, Friday, Saturday, Sunday} R6: {Bush, Iraq, Photos, USA, War}

This paper aims to discover high-quality semantic classes from a large collection of noisy RASCs. The primary requirement (and challenge) here is to deal with multi-membership, i.e., one item may belong to multiple different semantic classes. For example, the term “Lincoln” can simultaneously represent a person, a place, or a car brand name. Multi-membership is more popular than at a first glance, because quite a lot of English common words have also been borrowed as company names, places, or product names. For a given item (as a query) which belongs to multiple semantic classes, we intend to return the semantic classes separately, rather than mixing all their items together.




 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2009 EmployingTopicModelsForPatSemClasDiscHuibin Zhang
Mingjie Zhu
Shuming Shi
Ji-Rong Wen
Employing Topic Models for Pattern-based Semantic Class DiscoveryProceedings of the Annual Meeting of the Association for Computational Linguisticshttp://www.aclweb.org/anthology/P/P09/P09-1052.pdf2009