2007 AddingNPStructToThePennTreebank

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Penn Treebank, noun phrase, Noun-Phrase Premodifier, NP Bracketing Task.

Notes

Cited By

  • ~31 …

Quotes

Abstract

The Penn Treebank does not annotate within base noun phrases (NPs), committing only to flat structures that ignore the complexity of English NPs. This means that tools trained on Treebank data cannot learn the correct internal structure of NPs.

This paper details the process of adding gold-standard bracketing within each noun phrase in the Penn Treebank. We then examine the consistency and reliability of our annotations. Finally, we use this resource to determine NP structure using several statistical approaches, thus demonstrating the utility of the corpus. This adds detail to the Penn Treebank that is necessary for many NLP applications.


References

  • Ann Bies, Mark Ferguson, Karen Katz, and Robert MacIntyre. (1995). Bracketing guidelines for Treebank II style Penn Treebank project. Technical report, University of Pennsylvania.
  • Daniel M. Bikel. (2004). On the Parameter Space of Generative Lexicalized Statistical Parsing Models. Ph.D. thesis, University of Pennsylvania.
  • Thorsten Brants and Alex Franz. (2006). Web 1T 5-gram version 1. Linguistic Data Consortium.
  • Ted Briscoe and John Carroll. (2006). Evaluating the accuracy of an unlexicalized statistical parser on the PARC DepBank. In: Proceedings of the Poster Session of COLING/ACL-06. Sydney, Australia.
  • Michael Collins. (1999). Head-Driven StatisticalModels for Natural Language Parsing. Ph.D. thesis, University of Pennsylvania.
  • Usama M. Fayyad and Keki B. Irani. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. In: Proceedings of the 13th International Joint Conference on Artifical Intelligence (IJCAI–93), pages 1022– 1029. Chambery, France.
  • Roxana Girju, Dan Moldovan, Marta Tatu, and Daniel Antohe. (2005). On the semantics of noun compounds. Journal of Computer Speech and Language - Special Issue on Multiword Expressions, 19(4):313–330.
  • Julia Hockenmaier. (2003). Data and Models for Statistical Parsing with Combinatory Categorial Grammar. Ph.D. thesis, University of Edinburgh.
  • Tracy Holloway King, Richard Crouch, Stefan Riezler, Mary Dalrymple, and Ronald M. Kaplan. (2003). The PARC700 dependency bank. In: Proceedings of the 4th International Workshop on Linguistically Interpreted Corpora (LINC-03). Budapest, Hungary.
  • Seth Kulick, Ann Bies, Mark Libeman, Mark Mandel, Ryan McDonald, Martha Palmer, Andrew Schein, and Lyle Ungar. (2004). Integrated annotation for biomedical information extraction. In: Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics. Boston.
  • Mirella Lapata and Frank Keller. (2004). The web as a baseline: Evaluating the performance of unsupervised web-based models for a range of NLP tasks. In: Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pages 121–128. Boston.
  • (Lauer, 1995a) ⇒ Mark Lauer. (1995). “Corpus Statistics Meet the Noun Compound: Some empirical results.” In: Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics. doi:10.3115/981658.981665
  • Mitchell Marcus. 1980. A Theory of Syntactic Recognition for Natural Language. MIT Press, Cambridge, MA.
  • Mitchell Marcus, Beatrice Santorini, and Mary Marcinkiewicz. (1993). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313–330.
  • Preslav Nakov and Marti Hearst. (2005). Search engine statistics beyond the n-gram: Application to noun compound bracketing. In: Proceedings of CoNLL-2005, Ninth Conference on Computational Natural Language Learning. Ann Arbor, MI.
  • Lance A. Ramshaw and Mitchell P. Marcus. (1995). Text chunking using transformation-based learning. In: Proceedings of the Third ACLWorkshop on Very Large Corpora. Cambridge MA, USA.
  • Mark Steedman. (2000). The Syntactic Process. MIT Press, Cambridge, MA.
  • Ralph Weischedel and Ada Brunstein. (2005). BBN pronoun coreference and entity type corpus. Technical report, Linguistic Data Consortium.

,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2007 AddingNPStructToThePennTreebankDavid Vadas
James R. Curran
Adding Noun Phrase Structure to the Penn TreebankProceedings of the 45th Annual Meeting of the Association for Computational Linguisticshttp://www.cs.usyd.edu.au/~james/pubs/pdf/acl07nps.pdf2007