2005 MultiLingNEExtAndTrans

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Named Entity Recognition

Notes

Cited By

Quotes

Abstract

  • Named entities (NE), the noun or noun phrases referring to persons, locations and organizations, are among the most information bearing linguistic structures. Extracting and translating named entities benefits many natural language processing problems such as cross-lingual information retrieval, cross-lingual question answering and machine translation.
  • In this thesis work we propose an efficient and effective framework to extract and translate NEs from text and speech. We adopt the hidden Markov model (HMM) as the baseline NE extraction system, and investigate its performance in multiple language pairs with varying amounts of training data. We expand the baseline text NE tagger with a context-based NE extraction model, which aims to detect and correct NE recognition errors from automatic speech recognition hypotheses. We also adapt the broadcast news trained NE tagger for meeting transcripts.
  • We develop several language-independent features to capture phonetic and semantic similarity measures between source and target NE pairs. We incorporate these features to solve various NE translation problems presented in different language pairs (Chinese to English, Arabic to English and Hindi to English), with varying resources (parallel and non-parallel corpora as well as the World WideWeb) and different input data streams (text and speech).
  • We also propose a cluster-specific name transliteration framework. By grouping names from similar origins into one cluster and training cluster-specific transliteration and language models, we manage to dramatically reduce the name transliteration error rates.
  • The proposed NE extraction and translation framework improves NE detection performance, boosts NE translation and transliteration accuracies and helps increase machine translation quality. Overall, it significantly reduces NE information loss caused by machine translation errors and enables efficient information access overcoming language and media barriers.

Thesis Contribution

This thesis work advances the research on NE extraction and translation in the following ways:

  • We design a set of crosslingual, language-independent similarity features which characterize the pronunciation similarity, the semantic similarity and the contextual similarity between NE translations;
  • We propose an NE translation framework that integrates the above features to solve various NE translation problems: bilingual NE alignment, NE projection and NE translation mining from non-parallel corpora. We successfully apply the framework in multiple language pairs: Chinese-English, Arabic-English and Hindi-English. We improve both NE translation accuracy and machine translation quality when integrating the NE translation into a statistical machine translation system.
  • We develop a cluster-specific name transliteration framework and substantially improve name transliteration accuracy and reduce character error rate.
  • ² We design an information-theoretic measure to estimate information loss from speech recognition and machine translation. Based on this measure, the proposed NE translation techniques significantly reduce the NE information loss by about 50%.
  • ² We extend theHMMNE tagger with a context-based NE extraction model, aim to detect and correct speech NE recognition errors. This approach, combined with speech recognition confidence measures and information retrieval techniques, improves speech NE extraction and translation accuracy. To the author’s knowledge, this is the first attempt towards speech NE translation.
  • ² We adapt a broadcast news trained NE tagger on meeting transcripts, and significantly improve the NE extraction performance.

Named Entity Recognition

  • Named entity recognition (NER), also known as NE extraction, NE detection, NE tagging or NE identification, is to recognize structured information, such as proper names (person, location and organization), time (date and time) and numerical values (currency and percentage) from natural language text. It is one of the first IE tasks to be researched. Many NER systems based on patternmatching rules or statistical models achieved satisfactory performances on well-formed text. Based on the 1997 MUC-7/MET-2 evaluation, NE recognition systems have achieved 94% F score on English newswire text and 85%-91% on Chinese text, 87%-93% on Japanese text.

Referrnces

  • AL-ONAIZAN, Y. & KNIGHT, K. (2002). Translating named entities using monolingual and bilingual resources. In ACL, 400–408. 2.4
  • Douglas E. Appelt, HOBBS, J., ISRAEL, D. & TYSON, M. (1993). Fastus: A finite-state processor for information extraction from real world texts. In: Proceedingseeding of IJCAI-93. 2.2.1, 6.2
  • ARBABI, M., FISCHTHAL, S.M., CHENG, V.C. & BART, E. (1994). Algorithms for arabic name transliteration. IBM Journal of Research and Development, 38, 183. 2.4
  • BANERJEE, S. & LAVIE, A. (2005). Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization at the 43th Annual Meeting of the Association of Computational Linguistics (ACL-2005), Ann Arbor, Michigan. 9.3.4
  • BIKEL, D.M., MILLER, S., SCHWARTZ, R. & WEISCHEDEL, R. (1997). Nymble: a high-performance learning name-finder. In: Proceedings of Applied Natural Language Processing, 194–201. 2.2.2, 3.1, 3.1, 6.2
  • BLUM, A. & Tom M. Mitchell. (1998). Combining labeled and unlabeled data with cotraining. In: Proceedings of the 11th Annual Conference on Computational Learning Theory, ACM. 7.2.2
  • BORTHWICK, A. (1999). A Maximum Entropy Approach to Named Entity Recognition. Ph.D. thesis, New York University. 2.2.2
  • BRILL, E. (1995). Transformation-based error-driven learning and natural language processing: A case study in part of speech tagging. Computational Linguistics, 21, 543–565. 2.2.2
  • BROWN, P.F., COCKE, J., PIETRA, S.A.D., PIETRA, V.J.D., JELINEK, F., LAFFERTY, J.D., MERCER, R.L. & ROOSSIN, P.S. (1990). A statistical approach to machine translation. Comput. Linguist., 16, 79–85. 2.3.3.2, 7.1.1
  • BROWN, P.F., PIETRA, V.J.D., PIETRA, S.A.D. & MERCER, R.L. (1993). The mathematics of statistical machine translation: parameter estimation. Computational Linguistics, 19, 263–311. 2.3.3.2, 4.2, 4.3.2, 5.2.1
  • BROWN, R. (2002). Example-based machine translation, a tutorial. AMTA Tutorials.
  • BROWN, R.D. (2000). Automated generalization of translation examples. In: Proceedings of the Eighteenth International Conference on Computational Linguistics (COLING-2000), 125–131, Saarbr‘‘ cken, Germany. 2.3.3.1
  • CALIFF, M.E. & MOONEY, R.J. (1997). Relational learning of pattern-match rules for information extraction. In: Proceedings of the ACLWorkshop on Natural Language Learning, 9–15, Madrid, Spain. 2.2.1
  • CARRERAS, X., M`ARQUES, L. & PADR ´O, L. (2002). Named entity extraction using adaboost. In: Proceedings of CoNLL-2002, 167–170, Taipei, Taiwan. 2.2.2
  • CHENG, P.J., TENG, J.W., CHEN, R.C., WANG, J.H., LU, W.H. & CHIEN, L.F. (2004). Translating unknown queries with web corpora for crosslingual information retrieval. In: Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR 2004), 146–153, ACM Press, Sheffield, United Kingdom. 2.4
  • CHIANG, D. (2005). A hierarchical phrase-based model for statistical machine translation. In ACL ’05: Proceedings of the 44th Annual Meeting on Association for Computational Linguistics, 263–270, Association for Computational Linguistics.
  • CHINCHOR, N. (1998). Overview of MUC-7/MET-2. In: Proceedings of the Seventh Message Understanding Conference(MUC7). 1.1
  • Michael Collins (2001). Ranking algorithms for named-entity extraction: boosting and the voted perceptron. In ACL ’02: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, 489–496, Association for Computational Linguistics, Morristown, NJ, USA. 2.2.2
  • Arthur P. Dempster, LAIRD, N.M. & RUBIN, D.B. (1977). Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, B, 39. 4.1
  • George Doddington (2002). Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In: Proceedings of the Human Language Technology Conference (HLT), San Diego, CA. 5.1.4
  • DORR, B.J., JORDAN, P.W. & BENOIT, J.W. (1998). A survey of current paradigms in machine translation. Tech. Rep. CS-TR-3961. 2.3
  • FARWELL, D. & WILKS, Y. (1990). Ultra: A multi-lingual machine translator. Technical Report MCCS-90-202, Computing Research Laboratory, New Mexico State University. 2.3.1
  • FLORIAN, R., ITTYCHERIAH, A., JING, H. & ZHANG, T. (2003). Named entity recognition through classifier combination. InW. Daelemans & M. Osborne, eds., Proceedings of CoNLL-2003, 168–171, Edmonton, Canada. 2.2.2
  • GRISHMAN, R. (1997). Information extraction: Techniques and challenges. Summer Convention on Information Extraction (SCIE), 10–27. 2.1, 2.2.1
  • GRISHMAN, R. & SUNDHEIM, B. (1995). Design of the muc-6 evaluation. In: Proceedings of MUC-6. 6.2 HUANG, F. & VOGEL, S. (2002). Improved named entity translation and bilingual named entity extraction. In: Proceedings of the 2002 International Conference on Multimodal Interfaces (ICMI ’02). 2.4
  • HUANG, F., ZHANG, Y. & VOGEL, S. (2005a). Mining key phrase translations from web corpora. In: Proceedings of the HLT-EMNLP 2005, Vancouver, BC, Canada. 2.4
  • HUANG, F., ZHANG, Y. & VOGEL, S. (2005b). Mining key phrase translations from web corpora. In the Proceedings of the Human Language Technology and Empirical Methods for Natural Language Processing (HLT-EMNLP), Vancouver, BC, Canada.
  • KNIGHT, K. & GRAEHL, J. (1997). Machine transliteration. In P.R. Cohen & W.Wahlster, eds., Proceedings of the Thirty-Fifth Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics, 128–135, Association for Computational Linguistics, Somerset, New Jersey. 2.4, 7
  • KUBALA, F., SCHWARTZ, R., STONE, R. & WEISCHEDEL, R. (1998). Named entity extraction from speech. In DARPA Broadcast News Transcription and Understanding Workshop, Lansdowne, VA. 6.2
  • John D. Lafferty, Andrew McCallum & Fernando Pereira (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the International Conference on Machine Learnning. 2.2.2
  • LAVIE, A., LANGLEY, C., WAIBEL, A., LAZZARI, G., PIANESI, F., COLETTI, P., BALDUCCI, F. & TADDEI, L. (2001). Architecture and design considerations in nespole!: a speech translation system for e-commerce applications. In: Proceedings of HLT 2001 Human Language Technology Conference, San Diego, California. 2.3.4
  • LAVIE, A., VOGEL, S., LEVIN, L., PETERSON, E., PROBST, K., FONT, A., REYNOLDS, R., CARBONELL, J. & COHEN, R. (2003). Experiments with a hindi-to-english transfer-based mt system under a miserly data scenario.
  • LAWRENCE, S. & GILES, C.L. (1999). Accessibility of Information on the Web. Nature, 400, 107–109. 1.1
  • LYMAN, P., VARIAN, H.R., CHARLES, P., GOOD, N., JORDAN, L.L. & PALE, J. (2003). How much information? (2003). 1.1
  • Christopher D. ManningD. & SCHUTZE, H. (1999). Foundations of Statistical Natural Language Processing. MIT Press. 7.1.1, 7.3.1
  • MARCU, D. & WONG, W. (2002). A phrase-based, joint probability model for statistical machine translation. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, Philadephia, PA. 7.3.1
  • MCCALLUM, A. & LI, W. (2003). Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In: Proceedings of the Seventh Conference on Natural Language Learning (CoNLL). 2.2.2
  • MELAMED, I.D. (2000). Modles of translational equivalence among words. Computational Linguistics, 26(2), 221–249. 5.1.2
  • MENG, H., LO, W.K., CHEN, B. & TANG, K. (2001). Generating phonetic cognates to handle named entities in english-chinese cross-language spoken document retrieval. In: Proceedings of the ASRU-2001, Trento, Italy. 2.4, 7
  • MILLER, D., BOISEN, S., SCHWARTZ, R., STONE, R. &WEISCHEDEL, R. (2000). Named entity extraction from broadcast news. In the sixth conference on Applied Natural Language Processing, 316–324, Seattle, WA. 6.2
  • MITAMURA, T., NYBERG, E. & CARBONELL, J. (1991). An efficient interlingua translation system for multi-lingualdocument production. 2.3.1 MOORE, R.C. (2003). Learning translations of named-entity phrases from parallel corpora. In: Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics, Budapest, Hungary. 2.4
  • NEY, H. (1999). Speech translation: Coupling of recognition and translation. In IEEE International Conference on Acoustics, Speech, and Signal Processing, 517–520, Phoenix, AR. 2.3.4
  • NGAI, G. & FLORIAN, R. (2001). Transformation-based learning in the fast lane. In: Proceedings of NAACL’01, 40–47, Pittsburgh, PA. 2.2.2
  • OARD, D. (2003). The surprise langauge exercises. ACM Transactions on Asian Language Information Processing, 2. 5.2
  • OCH, F.J., TILLMANN, C. & NEY, H. (1999). Improved alignment models for statistical machine translation. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, 20–28, University of Maryland, College Park, MD. 2.3.3.2, 7.3.1
  • OGILVIE, P. & CALLAN, J. (2001). Experiments using the lemur toolkit. In: Proceedings of the 2001 Text REtrieval Conference (TREC 2001),National Institute of Standards and Technology, special publication 500-250., 103–108. 5.3.1 PALMER, D., OSTENDORF, M., & BURGER, J. (2000). Robust information extraction from automati-cally generated speech transcriptions. Speech Communication, 32, 95–109. 6.2
  • PAPINENI, K., ROUKOS, S., WARD, T. & ZHU, W.J. (2002). Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the Association of Computational Linguistics, 311–318. 2.3.3.2, 5.1.4
  • RILOFF, E. (1996). Automatically generating extraction patterns from untagged text. In: Proceedings of the Thirteenth National Conference on Artifical Intelligence, ROBINSON, P., BROWN, E., BURGER, J., CHINCHOR, N., DOUTHAT, A., FERRO, L. & Lynette Hirschman (1999). Overview: Information extraction from broadcast news. In: Proceedings of DARPA Broadcast NewsWorkshop, 27–30. 6.1 S. MILLER, H.F.L.R.R.S.R.S.R.W., M. CRYSTAL & THE ANNOTATION GROUP (1998). Bbn: Description of the sift system as used for muc-7. In: Proceedings of 7th Message Understanding Conference, Fairfax, VA. 2.2.2
  • Satoshi Sekine, GRISHMAN, R. & SHINNOU, H. (1998). A decision tree method for finding and classifying names in japanese texts. In: Proceedings of the Sixth Workshop on Very Large Corpora, Montreal, Cananda. 2.2.2 SHERMAN, C. (2001). Google Fires New Salvo in Search Engine Size Wars. http://searchenginewatch.com/searchday/article.php/2158371.
  • STALLS, B. & KNIGHT, K. (1998). Translating names and technical terms in arabic text. In: Proceedings of the COLING/ACL Workshop on Computational Approaches to Semitic Languages, Montreal, Quebec Canada. 2.4, 7
  • TANG, M., LUO, X. & ROUKOS, S. (2002). Active learning for statistical natural language parsing. In ACL (2002). 3.3
  • UCHIDA, H. (1985). Fujitsu machine translation system atlas. In: Proceedings of International Sympositon MT. 2.3.1
  • VENUGOPAL, A., VOGEL, S. & WAIBEL, A. (2003). Effective phrase translation extraction from alignment models. In ACL, 319–326. 2.3.3.2
  • VIRGA, P. & KHUDANPUR, S. (2003). Transliteration of proper names in crosslingual information retrieval. In: Proceedings of the ACL-2003 Workshop on Multi-lingual Named Entity Recognition, Japan. 7, 7.3.3
  • VITERBI, A. (1967). Error bound for convolutional codes and asymptotically optimum decoding algorithm. IEEE Transaction on Information Theory, 13, 260–269. 3.1
  • VOGEL, S., NEY, H. & TILLMANN, C. (1996). Hmm based word alignment in statistical translation. In: Proceedings of the 16th International Conference on Computational Linguistics (COLING’96), Copenhagen, Denmark. 2.3.3.2, 4.2 VOGEL, S., ZHANG, Y., HUANG, F., TRIBBLE, A., VENOGUPAL, A., ZHAO, B. & WAIBEL, A. (2003). The CMU statistical translation system. In: Proceedings of MT Summit IX, New Orleans, LA. 2.3.3.2, 5.1.4, 5.2.1, 5.3.4.2, 7.3.1
  • WAHLSTER, W., ed. (2000). Verbmobil: Foundations of Speech-to-Speech Translation. Springer, Berlin. 2.3.4
  • WAIBEL, A., LAVIE, A. & LEVIN, L.S. (1997). Janus: A system for translation of conversational speech. K¨ unstliche Intelligenz. 2.3.1
  • WU, D. (1997). Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Computational Linguistics, 23, 377–404. 2.3.3.2,
  • WU, D., NGAI, G. & CARPUAT, M. (2003). A stacked, voted, stacked model for named entity recognition. In W. Daelemans & M. Osborne, eds., Proceedings of CoNLL-2003, 200–203, Edmonton, Canada. 2.2.2
  • YAMADA, K. & KNIGHT, K. (2001). A syntax-based statistical translation model. In ACL ’01: Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, 523–530, Association for Computational Linguistics, Morristown, NJ, USA. 2.3.3.2
  • YAROWSKY, D. & NGAI, G. (2001). Inducing multilingual pos taggers and np bracketers via robust projection across aligned corpora. In the Proceedings of NAACL, 200–207. 2.4, 5.2.1
  • ZECHNER, K. (2001). Automatic summarization of spoken dialogs in unrestricted domains. In Ph.D Thesis, Language Technology Institute, Carnegie Mellon University. 6
  • ZHAI, L., FUNG, P., SCHWARTZ, R., CARPUAT, M. & WU, D. (2004). Using nbest lists for named entity recognition from chinese speech. In the Proceedings of the HLT/NAACL 2004, Boston, MA. 6.2
  • ZHANG, Y. & VINES, P. (2004). Using the web for automated translation extraction in cross-language infor-mation retrieval. In: Proceedings of the ACM International Conference on Research and Development in Information Retrieval
  • (SIGIR 2004), 162–169, ACM Press, Sheffield, United Kingdom. 2.4 ZHANG, Y. & VOGEL, S. (2005). An efficient phrase-to-phrase alignment model for arbitrarily long phrase and large corpora. In: Proceedings of the Tenth Conference of the European Association for Machine Translation (EAMT-05), The European Association for Machine Translation, Budapest, Hungary. 7.2.2
  • ZHANG, Y., VOGEL, S. & WAIBEL, A. (2003). Integrated phrase segmentation and alignment algorithm for statistical machine translation. In: Proceedings of International Conference on Natural Language Processing and Knowledge Engineering (NLP-KE’03), Beijing, China. 2.3.3.2
  • ZHANG, Y., HUANG, F. & VOGEL, S. (2005). Mining translations of oov terms from the web through cross-lingual query expansion. In the Proceedings of the 28th Annual International ACM SIGIR, Salvador, Brazil. 5.3.2, 9.3.2
  • ZHAO, B. & VOGEL, S. (2003).Word alignment based on bilingual bracketing. In Rada Mihalcea & T. Pedersen, eds., HLT-NAACL 2003 Workshop: Building and Using Parallel Texts: Data Driven Machine Translation and Beyond, 15–18, Association for Computational Linguistics, Edmonton, Alberta, Canada. 2.3.3.2
  • ZHOU, G. & SU, J. (2002). Named entity recognition using an hmm-based chunk tagger. In: Proceedings of the 40th Annual Meeting of the ACL, Philadelphia, PA. 2.2.2,


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2005 MultiLingNEExtAndTransFei HuangMultilingual Named Entity Extraction and Translation from Text and SpeechDoctoral Dissertationhttp://www.lti.cs.cmu.edu/Research/Thesis/FeiHuang06.pdf2005