2015 MiningLatentEntityStructures

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Latent Entity Structure Mining.

Notes

Cited By

Quotes

Author Keywords

information networks, text mining, link analysis, topic modeling, phrase extraction, role discovery, clustering, ranking, relationship mining, probabilistic models, real-world applications, efficient and scalable algorithms

Abstract

The “big data” era is characterized by an explosion of information in the form of digital data collections, ranging from scientific knowledge, to social media, news, and everyone's daily life. Examples of such collections include scientific publications, enterprise logs, news articles, social media, and general web pages. Valuable knowledge about multi-typed entities is often hidden in the unstructured or loosely structured, interconnected data. Mining latent structures around entities uncovers hidden knowledge such as implicit topics, phrases, entity roles and relationships. In this monograph, we investigate the principles and methodologies of mining latent entity structures from massive unstructured and interconnected data. We propose a text-rich information network model for modeling data in many different domains. This leads to a series of new principles and powerful methodologies for mining latent structures, including (1) latent topical hierarchy, (2) quality topical phrases, (3) entity roles in hierarchical topical communities, and (4) entity relations. This book also introduces applications enabled by the mined structures and points out some promising research directions.

Table of Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Data Model: A Text-Rich Heterogeneous Information Network Model. . . . . . . 2
1.3 Latent Entity Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 ?e Mining Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4.1 Hierarchical Topic and Community Discovery . . . . . . . . . . . . . . . . . . . . . 4
1.4.2 Topical Phrase Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4.3 Entity Topical Role Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4.4 Entity Relationship Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Hierarchical Topic and Community Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 Generative Model for Text or Homogeneous Networks . . . . . . . . . . . . . . . . . . . . 8
2.2 Generative Model for Heterogeneous Network . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 ?e Basic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.2 Learning Link-Type Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.3 Shape of Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3 Empirical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.1 Efficacy of Subtopic Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.2 Topical Hierarchy Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3.3 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3 Topical Phrase Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.1 Criteria of Good Phrases and Topical Phrases . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2 KERT: Mining Phrases in Short, Content-Representative Text . . . . . . . . . . . . 41
3.2.1 Phrase Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2.2 Topical Phrase Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3 ToPMine: Mining Phrases in General Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3.1 Frequent Phrase Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3.2 Segmentation and Phrase Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3.3 Topical Phrase Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.4 Empirical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.4.1 ?e Impact of the Four Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.4.2 Comparison of Mining Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.4.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4 Entity Topical Role Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.1 Role of Given Entities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.1.1 Entity Specific Phrase Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.1.2 Distribution over Subtopics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.1.3 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.2 Entities of Given Roles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5 Mining Entity Relations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.1 Unsupervised Hierarchical Relation Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.1.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.1.2 Assumptions and Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.1.3 Stage 1: Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.1.4 Stage 2: TPFG Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.1.5 Model Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.1.6 Empirical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.2 Supervised Hierarchical Relation Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.2.1 Conditional Random Field for Hierarchical Relationship . . . . . . . . . . . . 91
5.2.2 Potential Function Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.2.3 Model Inference and Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.2.4 Empirical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.3 Semi-Supervised Co-Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.3.1 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.3.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.3.3 Inference Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.3.4 Empirical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6 Scalable and Robust Topic Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.1 Latent Dirichlet Allocation with Topic Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.2 ?e STROD Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.2.1 Moment-Based Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.2.2 Scalability Improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.2.3 Hyperparameter Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.3 Empirical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.3.1 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.3.2 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.3.3 Interpretability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7 Application and Research Frontier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.1 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.1.1 Online Analytical Processing of Information Networks . . . . . . . . . . . . 135
7.1.2 Social Influence and Viral Marketing . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.1.3 Relevance Targeting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.2 Research Frontier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

1. Introduction

1.1. Motivation

The success of database technology is largely attributed to the efficient and effective handling of structured data. The construction of a well-structured database is often the premise of subsequent applications. However, the explosion of “big data” poses great challenges on this practice since the real world data are largely unstructured, or loosely structured. It is crucial to uncover latent structures of real-world entities, such as the topics and communities they are involved in, the roles the entities play in these topics and communities, and the relations they potentially have with each other. By mining massive unstructured or loosely structured data associated with entities, one can construct semantically rich structures which reveal the relationships among entities. The uncovered structures facilitate browsing information and retrieving knowledge from the data. Mining latent entity structures will enhance knowledge engineering effectively in many applications. For example, mining entity structures hidden in billions of web pages will turn extensive web data to knowledge that will enrich open-domain knowledge-bases. Mining entity structures hidden in social media will help reorganize scattered information from hundreds of millions of individuals and improve the social network services. In news data, the topics, as well as the entity relations, are buried in the text rather than in the form of relational tuples. Mining such news data will enable us to extract multiple types of entities like people, locations, organizations, and events, for effective news understanding and analysis. In a bibliographic database like DBLP[1] or PubMed,[2] research papers are explicitly linked with authors, venues, and terms. Many interesting semantic relationships, such as advisor-advisee between authors, are hidden in the publication records; moreover, the research topics of authors, venues and terms are also hidden or unorganized, preventing insightful organization of the entities. Mining hidden research network structures will help scientific research tremendously.

Footnotes

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2015 MiningLatentEntityStructuresChi Wang
Jiawei Han
Mining Latent Entity Structures10.2200/S00625ED1V01Y201502DMK0102015