2012 AddingSemanticstoMicroblogPosts

(Meij et al., 2012) ⇒ Edgar Meij, Wouter Weerkamp, and Maarten de Rijke. (2012). “Adding Semantics to Microblog Posts.” In: Proceedings of the fifth ACM International Conference on Web search and data mining. doi:10.1145/2124295.2124364

Subject Headings: Topic Detection, Text Item Semantic Classification, Microblog Post, Semantic Linking.

Notes

Cited By

Quotes

Author Keywords

microblogs; semantic linking; wikipedia

Abstract

Microblog s have become an important source of information for the purpose of marketing, intelligence, and reputation management. Streams of microblogs are of great value because of their direct and real-time nature. Determining what an individual microblog post is about, however, can be non-trivial because of creative language usage, the highly contextualized and informal nature of microblog posts, and the limited length of this form of communication. We propose a solution to the problem of determining what a microblog post is about through semantic linking: we add semantics to posts by automatically identifying concepts that are semantically related to it and generating links to the corresponding Wikipedia articles. The identified concepts can subsequently be used for, e.g., social media mining, thereby reducing the need for manual inspection and selection. Using a purpose-built test collection of tweets, we show that recently proposed approaches for semantic linking do not perform well, mainly due to the idiosyncratic nature of microblog posts. We propose a novel method based on machine learning with a set of innovative features and show that it is able to achieve significant improvements over all other methods, especially in terms of precision.

1. INTRODUCTION

In recent years Twitter has become one of the largest online microblogging platforms with over 65M unique visitors and around 200M tweets per day.^[1] Microblogging streams have become invaluable sources for many kinds of analyses, including online reputation management, news and trend detection, and targeted marketing and customer services [4, 18, 32, 35]. Searching and mining microblog streams offers interesting technical challenges, because of the sheer volume of the data, its dynamic nature, the creative language usage, and the length of individual posts [17, 22].

In many microblog search scenarios the goal is to find out what people are saying about concepts such as products, brands, persons, et cetera [31]. Here, it is important to be able to accurately retrieve tweets that are on topic, including all possible naming and other lexical variants. So, it is common to manually construct lengthy keyword queries that (hopefully) capture all possible variants [2]. We propose an alternative approach, namely to determine what a microblog post is about by automatically identifying concepts in them. We take a concept to be any item that has a unique and unambiguous entry in a well-known large-scale knowledge source, Wikipedia.

Little research exists on understanding and modeling the semantics of individual microblog posts. Linking free text to knowledge resources, on the other hand, has received an increasing amount of attention in recent years. Starting from the domain of named entity recognition (NER), current approaches establish links not just to entity types, but to the actual entities themselves [15, 20, 30]. Instead of merely identifying types, we also aim to disambiguate the found concepts and link them to Wikipedia articles. With over 3.5 million articles, Wikipedia has become a rich source of knowledge and a common target for linking; automatic linking approaches using Wikipedia have met with considerable success [14, 25, 27, 28].

Most, if not all, of the linking methods assume that the input text is relatively clean and grammatically correct and that it provides sufficient context for the purposes of identifying concepts. Microblog posts are short, noisy, and full of shorthand and other ungrammatical text and provide very limited context for the words they contain [17, 22]. Hence, it is not obvious that automatic concept detection methods that have been shown to work well on news articles or web pages, perform equally well on microblog posts.

We present a robust method for automatically mapping tweets to Wikipedia articles to facilitate social media mining on a semantic level. The first research question we address is: What is the performance of state-of-the-art approaches for linking text to Wikipedia in the context of microblog posts? Our proposed approach involves a two-step method for semantic linking. The first step is recalloriented where the aim is to obtain a ranked list of candidate concepts. In the next step, we enhance precision and determine which of the candidate concepts to keep. Our second research question concerns a comparison of methods for the initial concept ranking step; we consider lexical matching, language modeling, and other state-of-the-art baselines and compare their effectiveness. Our third research question concerns the second, precision-enhancing step. We approach this as a machine learning problem and consider a broad set of features, some of which have been proposed previously in the literature on semantic linking, some newly introduced. In addition to multiple features, we also consider multiple machine learning algorithms and examine which of these are most effective for our problem. Finally, we examine the relative effectiveness of the precision-enhancing step on top of different initial concept ranking methods. The paper focuses on the effectiveness of concept detection methods in the setting of microblog posts. In the conclusion to the paper we also discuss efficiency considerations.

Tweet
- Concepts
Is it me or does Google Instant encourage you to pay more attention to their Ads and Shopping links?
- ADS, AND, ATTENTION, DOES, GOOGLE, GOOGLE INSTANT, IS, IT, LINKS, ME, MORE, etc.
Keep your eyes out for an actress called Judi Dench. She’s a promising talent and I predict we’ll be hearing more about her.
- A, ABOUT, ACTRESS, AN, AND, AND I, BE, CALLED, DENCH, FOR, HEARING, HER, I, I PREDICT, JUDI DENCH, etc.

Table 1: Example tweets with concepts recognized using lexical matching on Wikipedia article titles.

Our main contributions are: (i) a robust, successful method for linking tweets to Wikipedia articles, based on a combination of high-recall concept ranking and high-precision machine learning, including state-of-the-art machine learning algorithms, (ii) insights into the influence of various features and machine learning algorithms on the task, and (iii) a reusable dataset, with which we aim to facilitate follow-up research. The remainder of this paper is organized as follows. In Section 2 we discuss related work, followed by a description of our method. In Section 4 we discuss the experimental setup and, in Section 5, the experiments with which we answer our research questions. We end with a concluding section.

2. RELATED WORK

In this section we review related …

…

Added feature MAP �
TEN(c; q) / TWCT(c;Q)
/ REDIRECT(c) / LINKPROB(q) 0.4964
+ IDFanchor(q) 0.5452 9.83%
+ KEYPHRASENESS(q) 0.5523 1.30%
+ SNIL(q) 0.5502 -0.38%
+ IDFtitle(q) 0.5547 0.82%
+ SNCL(q) 0.5593 0.83%
+ TFparagraph(c; q) 0.5674 1.45%
+ TFsentence(c; q) 0.5834 2.82%
+ TCN(c; q) 0.5866 0.55%
+ COMMONNESS(c; q) 0.6216 5.97%
+ TFtitle(c; q) 0.6340 1.99%
+ URL(q;Q) 0.6405 1.03%
+ POS1 (c; q) 0.6435 0.47%
+ GEN(c) 0.6475 0.62%
+ WIG(q) 0.6491 0.25%
+ IDFcontent(q) 0.6511 0.31%

Table 9: MAP of CMNS-RF after incrementally adding features proportional to their information gain (truncated to show only the top features).

…

6. CONCLUSION AND FUTURE WORK

Microblogging streams have become an invaluable resource for marketing, search, information dissemination, and online reputation management. Searching and mining microblog streams offers interesting challenges and in this paper we have presented a successful semantic linking method for microblog posts. The identified concepts, i.e., Wikipedia articles, can subsequently be used for, e.g., social media mining or advanced search result presentation. Our novel method uses machine learning and is based on a high-recall concept ranking and a high-precision concept selection step. Using a purpose-built test collection, we have shown that it significantly outperforms other methods, including various recently proposed approaches. Moreover, the concept selection step can be applied to any method that returns concepts for an input text. Our results show that this step, in particular using random forests or gradient boosted regression trees, can significantly improve a weak baseline, especially in terms of precision. It is even able to improve when the concept ranking performance is already strong on its own.

We have focused mainly on the effectiveness of semantic linking in the setting of microblog posts as opposed to the efficiency. Since both best performing machine learning algorithms are easily parallelizable, the bulk of the processing happens during feature extraction. From the results obtained during feature analysis, we note that not all features are equally important and that a minimal, easily computable set can already obtain good performance. Moreover, our analysis has shown that only a relatively small number of iterations is needed to achieve optimal performance. We finally note that, in the cases where a real-time analysis of a stream of microblog posts is required, merely using the low-cost CMNS feature already obtains very good performance.

Future work includes the following. First, although our method is not language-dependent in any way, the manual annotations are indeed language-specific. Wikipedia, on the other hand, already contains numerous, manually-curated inter-language links that we could use for this purpose. Second, we already mentioned a posthoc evaluation of our semantic linking method for future work in Section 5.3. We also acknowledge that our sample of tweets, based on “authoritative users,” is comparatively small and might be biased. Therefore, we intend to apply the best-performing methods to a much larger, random sample of microblog posts to see how it performs there. Further, in this paper we have focused on a domainindependent way of obtaining high-recall concept candidate rankings. We believe, however, that including additional information such as from NER could further improve semantic linking performance. For future work we also intend to consider bootstrapping or co-training, in which the concepts with the highest confidence are fed back as new training material. Finally we note that Wikipedia contains a few thousand links to Twitter in the articles’ “External Links” sections and we intend to investigate to what extent we can use this information for semantic linking.

References

;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2012 AddingSemanticstoMicroblogPosts	Edgar Meij Maarten de Rijke Wouter Weerkamp			Adding Semantics to Microblog Posts				10.1145/2124295.2124364		2012

↑ 1http://blog.twitter.com/2011/06/200-million-tweets-per-day.html

[1] 1http://blog.twitter.com/2011/06/200-million-tweets-per-day.html

[1]