2011 RecognizingNamedEntitiesinTweet

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Named Entity Recognition Algorithm.

Notes

Cited By

Quotes

Abstract

The challenges of Named Entities Recognition (NER) for tweets lie in the insufficient information in a tweet and the unavailability of training data. We propose to combine a K-Nearest Neighbors (KNN) classifier with a linear Conditional Random Fields (CRF) model under a semi-supervised learning framework to tackle these challenges. The KNN based classifier conducts pre-labeling to collect global coarse evidence across tweets while the CRF model conducts sequential labeling to capture fine-grained information encoded in a tweet. The semi-supervised learning plus the gazetteers alleviate the lack of training data. Extensive experiments show the advantages of our method over the baselines as well as the effectiveness of KNN and semi-supervised learning.

1 Introduction

Named Entities Recognition (NER) is generally understood as the task of identifying mentions of rigid designators from text belonging to named-entity types such as persons, organizations and locations (Nadeau and Sekine, 2007). Proposed solutions to NER fall into three categories: 1) The rule-based (Krupka and Hausman, 1998); 2) the machine learning based (Finkel and Manning, 2009; Singh et al., 2010) ; and 3) hybrid methods (Jansche and Abney, 2002). With the availability of annotated corpora, such as ACE05, Enron (Minkov et al., 2005) and CoNLL03 (Tjong Kim Sang and De Meulder, 2003), the data driven methods now become the dominating methods.

However, current NER mainly focuses on formal text such as news articles (Mccallum and Li, 2003; Etzioni et al., 2005). Exceptions include studies on informal text such as emails, blogs, clinical notes (Wang, 2009). Because of the domain mismatch, current systems trained on non-tweets perform poorly on tweets, a new genre of text, which are short, informal, ungrammatical and noise prone. For example, the average F1 of the Stanford NER (Finkel et al., 2005), which is trained on the CoNLL03 shared task data set and achieves state-of-the-art performance on that task, drops from 90.8% (Ratinov and Roth, 2009) to 45.8% on tweets.

Thus, building a domain specific NER for tweets is necessary, which requires a lot of annotated tweets or rules. However, manually creating them is tedious and prohibitively unaffordable. Proposed solutions to alleviate this issue include: 1) Domain adaption, which aims to reuse the knowledge of the source domain in a target domain. Two recent examples are Wu et al. (2009), which uses data that is informative about the target domain and also easy to be labeled to bridge the two domains, and Chiticariu et al. (2010), which introduces a high-level rule language, called NERL, to build the general and domain specific NER systems; and 2) semi-supervised learning, which aims to use the abundant unlabeled data to compensate for the lack of annotated data. Suzuki and Isozaki (2008) is one such example.

Another challenge is the limited information in tweet. Two factors contribute to this difficulty. One is the tweet’s informal nature, making conventional features such as part-of-speech (POS) and capitalization not reliable. The performance of current NLP tools drops sharply on tweets. For example, OpenNLP [1], the state-of-the-art POS tagger, gets only an accuracy of 74.0% on our test data set. The other is the tweet’s short nature, leading to the excessive abbreviations or shorthand in tweets, and the availability of very limited context information. Tackling this challenge, ideally, requires adapting related NLP tools to fit tweets, or normalizing tweets to accommodate existing tools, both of which are hard tasks.

We propose a novel NER system to address these challenges. Firstly, a K-Nearest Neighbors (KNN) based classifier is adopted to conduct word level classification, leveraging the similar and recently labeled tweets. Following the two-stage prediction aggregation methods (Krishnan and Manning, 2006), such pre-labeled results, together with other conventional features used by the state-of-the-art NER systems, are fed into a linear Conditional Random Fields (CRF) (Lafferty et al., 2001) model, which conducts fine-grained tweet level NER. Furthermore, the KNN and CRF model are repeatedly retrained with an incrementally augmented training set, into which high confidently labeled tweets are added. Indeed, it is the combination of KNN and CRF under a semi-supervised learning framework that differentiates ours from the existing. Finally, following Lev Ratinov and Dan Roth (2009), 30 gazetteers are used, which cover common names, countries, locations, temporal expressions, etc. These gazetteers represent general knowledge across domains. The underlying idea of our method is to combine global evidence from KNN and the gazetteers with local contextual information, and to use common knowledge and unlabeled tweets to make up for the lack of training data.

12,245 tweets are manually annotated as the test data set. Experimental results show that our method outperforms the baselines. It is also demonstrated that integrating KNN classified results into the CRF model and semi-supervised learning considerably boost the performance. Our contributions are summarized as follows.

  1. We propose to a novel method that combines a KNN classifier with a conventional CRF based labeler under a semi-supervised learning framework to combat the lack of information in tweet and the unavailability of training data.
  2. We evaluate our method on a human annotated data set, and show that our method outperforms the baselines and that both the combination with KNN and the semi-supervised learning strategy are effective.

The rest of our paper is organized as follows. In the next section, we introduce related work. In Section 3, we formally define the task and present the challenges. In Section 4, we detail our method. In Section 5, we evaluate our method. Finally, Section 6 concludes our work.

6 Conclusions and Future work

We propose a novel NER system for tweets, which combines a KNN classifier with a CRF labeler under a semi-supervised learning framework. The KNN classifier collects global information across recently labeled tweets while the CRF labeler exploits information from a single tweet and from the gazetteers. A serials of experiments show the effectiveness of our method, and particularly, show the positive effects of KNN and semi-supervised learning.

In future, we plan to further improve the performance of our method through two directions. Firstly, we hope to develop tweet normalization technology to make tweets friendlier to the NER task. Secondly, we are interested in integrating new entities from tweets or other channels into the gazetteers.

Footnotes

References

  • 1. Peter F. Brown, Peter V. DeSouza, Robert L. Mercer, Vincent J. Della Pietra, Jenifer C. Lai, Class-based n-gram Models of Natural Language, Computational Linguistics, v.18 n.4, p.467-479, December 1992
  • 2. Laura Chiticariu, Rajasekar Krishnamurthy, Yunyao Li, Frederick Reiss, Shivakumar Vaithyanathan, Domain Adaptation of Rule-based Annotators for Named-entity Recognition Tasks, Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, p.1002-1012, October 09-11, 2010, Cambridge, Massachusetts
  • 3. Michael Collins, Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms, Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing, p.1-8, July 06, 2002 doi:10.3115/1118693.1118694
  • 4. Doug Downey, Matthew Broadhead, Oren Etzioni, Locating Complex Named Entities in Web Text, Proceedings of the 20th International Joint Conference on Artifical Intelligence, p.2733-2739, January 06-12, 2007, Hyderabad, India
  • 5. Oren Etzioni, Michael Cafarella, Doug Downey, Ana-Maria Popescu, Tal Shaked, Stephen Soderland, Daniel S. Weld, Alexander Yates, Unsupervised Named-entity Extraction from the Web: An Experimental Study, Artificial Intelligence, v.165 n.1, p.91-134, June 2005 doi:10.1016/j.artint.2005.03.001
  • 6. Tim Finin, Will Murnane, Anand Karandikar, Nicholas Keller, Justin Martineau, Mark Dredze, Annotating Named Entities in Twitter Data with Crowdsourcing, Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk, p.80-88, June 06-06, 2010, Los Angeles, California
  • 7. Jenny Rose Finkel, Christopher D. Manning, Nested Named Entity Recognition, Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1, August 06-07, 2009, Singapore
  • 8. Jenny Rose Finkel, Trond Grenager, Christopher Manning, Incorporating Non-local Information Into Information Extraction Systems by Gibbs Sampling, Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, p.363-370, June 25-30, 2005, Ann Arbor, Michigan doi:10.3115/1219840.1219885
  • 9. Honglei Guo, Huijia Zhu, Zhili Guo, Xiaoxun Zhang, Xian Wu, Zhong Su, Domain Adaptation with Latent Semantic Association for Named Entity Recognition, Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, May 31-June 05, 2009, Boulder, Colorado
  • 10. Martin Jansche, Steven P. Abney, Information Extraction from Voicemail Transcripts, Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing, p.320-327, July 06, 2002 doi:10.3115/1118693.1118734
  • 11. Jing Jiang and ChengXiang Zhai. 2007. Instance Weighting for Domain Adaptation in Nlp. In ACL, Pages 264--271.
  • 12. Dan Klein, Christopher D. Manning, Accurate Unlexicalized Parsing, Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, p.423-430, July 07-12, 2003, Sapporo, Japan doi:10.3115/1075096.1075150
  • 13. Vijay Krishnan, Christopher D. Manning, An Effective Two-stage Model for Exploiting Non-local Dependencies in Named Entity Recognition, Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, p.1121-1128, July 17-18, 2006, Sydney, Australia doi:10.3115/1220175.1220316
  • 14. George R. Krupka and Kevin Hausman. 1998. Isoquest: Description of the Netowl#8482; Extractor System As Used in Muc-7. In MUC-7.
  • 15. John D. Lafferty, Andrew McCallum, Fernando C. N. Pereira, Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data, Proceedings of the Eighteenth International Conference on Machine Learning, p.282-289, June 28-July 01, 2001
  • 16. Andrew McCallum, Wei Li, Early Results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-enhanced Lexicons, Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, p.188-191, May 31, 2003, Edmonton, Canada doi:10.3115/1119176.1119206
  • 17. Scott Miller, Jethran Guinness, and Alex Zamanian. 2004. Name Tagging with Word Clusters and Discriminative Training. In HLT-NAACL, Pages 337--342.
  • 18. Einat Minkov, Richard C. Wang, William W. Cohen, Extracting Personal Names from Email: Applying Named Entity Recognition to Informal Text, Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, p.443-450, October 06-08, 2005, Vancouver, British Columbia, Canada doi:10.3115/1220575.1220631
  • 19. David Nadeau and Satoshi Sekine. 2007. A Survey of Named Entity Recognition and Classification. Linguisticae Investigationes, 30:3--26.
  • 20. Lev Ratinov, Dan Roth, Design Challenges and Misconceptions in Named Entity Recognition, Proceedings of the Thirteenth Conference on Computational Natural Language Learning, June 04-05, 2009, Boulder, Colorado
  • 21. Sameer Singh, Dustin Hillard, Chris Leggetter, Minimally-supervised Extraction of Entities from Text Advertisements, Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, p.73-81, June 02-04, 2010, Los Angeles, California
  • 22. Jun Suzuki and Hideki Isozaki. 2008. Semi-supervised Sequential Labeling and Segmentation Using Giga-word Scale Unlabeled Data. In ACL, Pages 665--673.
  • 23. Erik F. Tjong Kim Sang, Fien De Meulder, Introduction to the CoNLL-2003 Shared Task: Language-independent Named Entity Recognition, Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, p.142-147, May 31, 2003, Edmonton, Canada doi:10.3115/1119176.1119195
  • 24. Yefeng Wang, Annotating and Recognising Named Entities in Clinical Notes, Proceedings of the ACL-IJCNLP 2009 Student Research Workshop, August 04-04, 2009, Suntec, Singapore
  • 25. Dan Wu, Wee Sun Lee, Nan Ye, Hai Leong Chieu, Domain Adaptive Bootstrapping for Named Entity Recognition, Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3, August 06-07, 2009, Singapore
  • 26. Kazuhiro Yoshida, Jun'ichi Tsujii, Reranking for Biomedical Named-entity Recognition, Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing, June 29-29, 2007, Prague, Czech Republic
  • 27. Tong Zhang, David Johnson, A Robust Risk Minimization based Named Entity Recognition System, Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, p.204-207, May 31, 2003, Edmonton, Canada doi:10.3115/1119176.1119210
  • 28. GuoDong Zhou, Jian Su, Named Entity Recognition Using An HMM-based Chunk Tagger, Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, July 07-12, 2002, Philadelphia, Pennsylvania doi:10.3115/1073083.1073163;


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2011 RecognizingNamedEntitiesinTweetFuru Wei
Ming Zhou
Xiaohua Liu
Shaodian Zhang
Recognizing Named Entities in Tweets2011