2008 ADependencyParsApprToBiomedTMining

From GM-RKB
Jump to: navigation, search

Subject Headings: Text Mining, Biomedical Domain, Dependency Parsing.

Notes

Cited By

Quotes

Abstract

Biomedical research is currently facing a new type of challenge: an excess of information, both in terms of raw data from experiments and in the number of scientific publications describing their results. Mirroring the focus on data mining techniques to address the issues of structured data, there has recently been great interest in the development and application of text mining techniques to make more effective use of the knowledge contained in biomedical scientific publications, accessible only in the form of natural human language.

This thesis describes research done in the broader scope of projects aiming to develop methods, tools and techniques for text mining tasks in general and for the biomedical domain in particular. The work described here involves more specifically the goal of extracting information from statements concerning relations of biomedical entities, such as protein-protein interactions. The approach taken is one using full parsing — syntactic analysis of the entire structure of sentences — and machine learning, aiming to develop reliable methods that can further be generalized to apply also to other domains.

The five papers at the core of this thesis describe research on a number of distinct but related topics in text mining. In the first of these studies, we assessed the applicability of two popular general English parsers to biomedical text mining and, finding their performance limited, identified several specific challenges to accurate parsing of domain text. In a follow-up study focusing on parsing issues related to specialized domain terminology, we evaluated three lexical adaptation methods. We found that the accurate resolution of unknown words can considerably improve parsing performance and introduced a domain-adapted parser that reduced the error rate of theoriginal by 10% while also roughly halving parsing time.

To establish the relative merits of parsers that differ in the applied formalisms and the representation given to their syntactic analyses, we have also developed evaluation methodology, considering different approaches to establishing comparable dependency-based evaluation results. We introduced a methodology for creating highly accurate conversions between different parse representations, demonstrating the feasibility of unification of idiverse syntactic schemes under a shared, application-oriented representation. In addition to allowing formalism-neutral evaluation, we argue that such unification can also increase the value of parsers for domain text mining. As a further step in this direction, we analysed the characteristics of publicly available biomedical corpora annotated for protein-protein interactions and created tools for converting them into a shared form, thus contributing also to the unification of text mining resources. The introduced unified corpora allowed us to perform a task-oriented comparative evaluation of biomedical text mining corpora. This evaluation established clear limits on the comparability of results for text mining methods evaluated on different resources, prompting further efforts toward standardization.

To support this and other research, we have also designed and annotated BioInfer, the first domain corpus of its size combining annotation of syntax and biomedical entities with a detailed annotation of their relationships. The corpus represents a major design and development effort of the research group, with manual annotation that identifies over 6000 entities, 2500 relationships and 28,000 syntactic dependencies in 1100 sentences. In addition to combining these key annotations for a single set of sentences, BioInfer was also the first domain resource to introduce a representation of entity relations that is supported by ontologies and able to capture complex, structured relationships.

Part I of this thesis presents a summary of this research in the broader context of a text mining system, and Part II contains reprints of the five included publications.

References

2007

  • Ahmed, S. T., Chidambaram, D., Davulcu, H., and Baral, C. (2005). IntEx: A syntactic role driven protein-protein interaction extractor for bio-medical text. In: Proceedings of the ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases (BioLINK’05), pages 54–61.
  • Airola, A., Pyysalo, S., Bj¨orne, J.,, Pahikkala, T., Ginter, F., and Salakoski, T. (2008). A graph kernel for protein-protein interaction extraction. In Proceedings of the ACL’08 Workshop on Current Trends in Biomedical Natural Language Processing (BioNLP’08), pages 1–9. Association for Computational Linguistics.
  • Alex, B., Grover, C., Haddow, B., Kabadjov, M., Klein, E., Matthews, M., Roebuck, S., Tobin, R., and Wang, X. (2008). Assisted curation: Does text mining really help? In: Proceedings of the Pacific Symposium on Biocomputing (PSB’08).
  • Alphonse, E., Aubin, S., Bessi´eres, P., Bisson, G., Hamon, T., Laguarigue, S., Nazarenko, A., Manine, A.-P., N´edellec, C., Vetah, M. O. A., Poibeau, T., and Weissenbacher, D. (2004). Event-based information extraction for the biomedical domain: The Caderige project. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA), pages 43–49.
  • Ananiadou, S., Kell, D. B., and Tsujii, J. (2006). Text mining and its potential applications in systems biology. Trends in Biotechnology, 24:571–579.
  • Ananiadou, S. and McNaught, J., editors (2006). Text Mining for Biology and Biomedicine. Artech House Publishers. Ananiadou, S. and Nenadic, G. (2006). Automatic terminology management in biomedicine. In Ananiadou, S. and McNaught, J., editors, Text Mining for Biology and Biomedicine, pages 67–97. Artech house.
  • Ando, R. K. (2007). BioCreative II gene mention tagging system at IBM Watson. In: Proceedings of the Second BioCreative Challenge Evaluation, pages 101–103.
  • Andrade, M. A. and Valencia, A. (1998). Automatic extraction of keywords from scientific text: Application to the knowledge domain of protein families. Bioinformatics, 14(7):600–607.
  • Aubin, S. (2003). Evaluation comparative de deux analyseurs produisant des relations syntaxiques. In: Proceedings of the Workshop Traitement Automatique des Langues Naturelles (TALN), pages 67–76.
  • Aubin, S. (2005). LLL challenge - syntactic analysis guidelines. Technical report, LIPN, Universit´e Paris Nord, Villetaneuse. Aubin, S., Nazarenko, A., and N´edellec, C. (2005). Adapting a general parser to a sublanguage. In: Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP’05), pages 89–93.
  • Bader, G. D., Donaldson, I., Wolting, C., Ouellette, B. F., Pawson, T., and Hogue, C. W. (2001). BIND– the biomolecular interaction network database. Nucleic Acids Reserarch, 29(1):242–245.
  • Bairoch, A., Apweiler, R., Wu, C. H., Barker, W. C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., Magrane, M., Martin, M. J., Natale, D. A., O’Donovan, C., Redaschi, N., and Yeh, L.-S. L. (2005). The universal protein resource (UniProt). Nucleic Acids Research, 33(Suppl. 1):D154–159.
  • Baumgartner, William A., J., Cohen, K. B., Fox, L. M., Acquaah-Mensah, G., and Hunter, L. (2007). Manual curation is not sufficient for annotation of genomic databases. Bioinformatics, 23(13):i41–48.
  • Bj¨orne, J., Pyysalo, S., Ginter, F., and Salakoski, T. (2008). How complex are complex protein-protein interactions? In: Proceedings of the Third International Symposium on Semantic Mining in Biomedicine (SMBM’08). To appear.
  • Black, E., Abney, S., Flickenger, D., Gdaniec, C., Grishman, R., Harrison, P., Hindle, D., Ingria, R., Jelinek, F., Klavans, J., Liberman, M., Marcus, M., Roukos, S., Santorini, B., and Strzalkowski, T. (1991). A procedure for quantitatively comparing the syntactic coverage of english grammars. In: Proceedings of the DARPA Speech and Natural Language Workshop, pages 306–311.
  • Blaschke, C., Andrade, M. A., Ouzounis, C., and Valencia, A. (1999). Automatic extraction of biological information from scientific text: Proteinprotein interactions. In: Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology (ISMB’99), pages 60–67.
  • Blaschke, C. and Valencia, A. (2002). The frame-based module of the SUISEKI information extraction system. IEEE Intelligent Systems, 17(2).
  • Bloomfield, L. (1933). Language. Holt, Rinehart and Winston. Bod, R. (2001). What is the minimal set of fragments that achieves maximal parse accuracy? In: Proceedings of 39th Annual Meeting of the Association for Computational Linguistics (ACL’01), pages 66–73.
  • Bod, R. (2003). An efficient implementation of a new dop model. In: Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL’03), pages 19–26.
  • Bodenreider, O. (2004). The unified medical language system (UMLS): Integrating biomedical terminology. Nucleic Acids Research, 32(Suppl. 1):D267–270.
  • Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.-C., Estreicher, A., Gasteiger, E., Martin, M. J., Michoud, K., O’Donovan, C., Phan, I., Pilbout, S., and Schneider, M. (2003). The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Research, 31(1):365–370.
  • Bresnan, J. and Kaplan, R. (1982). Lexical-functional grammar: A formal system for grammatical representation. In The Mental Representation of Grammatical Relations, pages 173–281. MIT Press.
  • Brill, E. (1992). A simple rule-based part of speech tagger. In: Proceedings of the Third Conference on Applied Natural Language Processing (ANLP’02), pages 152–155.
  • Briscoe, T. and Carroll, J. (2002). Robust accurate statistical annotation of general text. In: Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02), pages 1499–1504.
  • Bunescu, R. C. (2007). Learning for Information Extraction: From Named Entity Recognition and Disambiguation To Relation Extraction. PhD thesis, University of Texas at Austin.
  • Bunescu, R. C., Ge, R., Kate, R. J., Marcotte, E. M., Mooney, R. J., Ramani, A. K., and Wong, Y. W. (2005). Comparative experiments on learning information extractors for proteins and their interactions. Artificial Intelligence in Medicine, 33(2):139–155.
  • Bunescu, R. C. and Mooney, R. (2006). Subsequence kernels for relation extraction. In Advances in Neural Information Processing Systems 1. (NIPS’06), pages 171–178.
  • Bunescu, R. C. and Mooney, R. J. (2005). A shortest path dependency kernel for relation extraction. In: Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT-EMNLP’05), pages 724–731.
  • Carroll, J. E., Briscoe, E., and Sanfilippo, A. (1998). Parser evaluation: A survey and a new proposal. In: Proceedings of the First International Conference on Language Resources and Evaluation (LREC’98).
  • Casta˜no, J. and Pustejovsky, J. (2005). Tagging with delayed disambiguation. In: Proceedings of the Fifth International Workshop on Finite-State Methods and Natural Language Processing (FSMNLP’05), pages 285–287.
  • Charniak, E. (1997). Statistical Parsing with a Context-Free Grammar and Word Statistics. In: Proceedings of the Fourteenth National Conference on Artificial Intelligence (AAAI’97).
  • Charniak, E. (2000). A maximum-entropy-inspired parser. In: Proceedings of the First Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL’00), pages 132–139.
  • Charniak, E. and Johnson, M. (2005). Coarse-to-fine n-best parsing and maxent discriminative reranking. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), pages 173–180.
  • Chomsky, N. (1957). Syntactic Structures. Mouton. Chun, H.-W., Yoshimasa Tsuruoka., Jin-Dong Kim., Shiba, R., Nagata, N., Hishiki, T., and Tsujii, J. (2006). Extraction of gene-disease relations from medline using domain dictionaries and machine learning. In: Proceedings of the Pacific Symposium on Biocomputing (PSB’06), pages 4–15.
  • Clark, S. and Curran, J. (2007). Formalism-independent parser evaluation with ccg and depbank. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (ACL’07), pages 248–255.
  • Clegg, A. B. (2008). Computational-Linguistic Approaches to Biomedical Text Mining. PhD thesis, University of London.
  • Clegg, A. B. and Shepherd, A. J. (2005). Evaluating and integrating treebank parsers on a biomedical corpus. In: Proceedings of the Association for Computational Linguistics Workshop on Software.
  • Clegg, A. B. and Shepherd, A. J. (2007). Benchmarking natural-language parsers for biological applications using dependency graphs. BMC Bioinformatics, 8(1):24.
  • Cohen, A. M. and Hersh, W. R. (2005). A survey of current work in biomedical text mining. Briefings in Bioinformatics, 6(1):57–71.
  • Collins, M. (1996). A new statistical parser based on bigram lexical dependencies. In: Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics (ACL’96), pages 184–191.
  • Collins, M. (1997). Three generative, lexicalised models for statistical parsing. In: Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics (ACL’97), pages 16–23.
  • Collins, M. (1999). Head-Driven Statistical Models for Natural Language Parsing. PhD thesis, University of Pennsylvania. Collins, M. (2000). Discriminative reranking for natural language parsing. In: Proceedings of the Seventeenth International Conference on Machine Learning (ICML’00).
  • Corney, D. P. A., Buxton, B. F., Langdon, W. B., and Jones, D. T. (2004). BioRAT: Extracting biological information from full-length papers. Bioinformatics, 20(17):3206–3213.
  • Couto, F. M., Silva, M. J., Lee, V., Dimmer, E., Camon, E., Apweiler, R., Kirsch, H., and Rebholz-Schuhmann, D. (2006). GOAnnotator: linking protein GO annotations to evidence text. Journal of Biomedical Discovery and Collaboration, 1:19.
  • Covington, M. A. (1990). A dependency parser for variable-word-order languages. Technical Report AI-1990-01, University of Georgia.
  • Craven, M. and Kumlien, J. (1999). Constructing biological knowledge bases by extracting information from text sources. In: Proceedings of the Seventh International Conference on Intelligent Systems in Molecular Biology (ISMB’99), pages 77–86.
  • Culotta, A. and Sorensen, J. (2004). Dependency tree kernels for relation extraction. In: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics (ACL’04), pages 423–429.
  • Daraselia, N., Yuryev, A., Egorov, S., Novichkova, S., Nikitin, A., and Mazo, I. (2004). Extracting human protein interactions from MEDLINE using a full-sentence parser. Bioinformatics, 20(5):604–611.
  • Deriviere, J., Hamon, T., and Nazarenko, A. (2006). A scalable and distributed NLP architecture for web document annotation. In: Proceedings of the Fifth International Conference on Natural Language Processing Fin- TAL’06, pages 56–67.
  • Ding, J., Berleant, D., Nettleton, D., and Wurtele, E. (2002). Mining MEDLINE: Abstracts, sentences, or phrases? In: Proceedings of the Pacific Symposium on Biocomputing (PSB’02), pages 326–337.
  • Ding, J., Berleant, D., Xu, J., and Fulmer, A. W. (2003). Extracting biochemical interactions from MEDLINE using a link grammar parser. In Proceedings of the 15th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’03), pages 467–471.
  • Dingare, S., Nissim, M., Finkel, J., Manning, C., and Grover, C. (2005). A system for identifying named entities in biomedical text: How results from two evaluations reflect on both the system and the evaluations. Comparative and Functional Genomics, 6(1-2):77–85.
  • Doddington, G., Mitchell, A., Przybocki, M., Ramshaw, L., Strassel, S., and Weischedel, R. (2004). The Automatic Content Extraction (ACE) program: Tasks, data, and evaluation. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04), pages 837–840.
  • Donaldson, I., Martin, J., de Bruijn, B., Wolting, C., Lay, V., Tuekam, B., Zhang, S., Baskin, B., Bader, G. D., Michalickova, K., Pawson, T., and Hogue, C. W. (2003). PreBIND and Textomy – mining the biomedical literature for protein-protein interactions using a support vector machine. BMC Bioinformatics, 4:11.
  • Entwisle, J. and Powers, D. (1998). The present use of statistics in the evaluation of nlp parsers. In: Proceedings of the Joint Conference on New Methods in Language Processing and Computational Natural Language Learning (NeMLaP-CoNLL’98), pages 215–224.
  • Erkan, G., ¨Ozgür, A., and Radev, D. R. (2007). Semi-supervised classification for extracting protein interaction sentences using dependency parsing. In: Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL’07), pages 228–237.
  • Finkel, J., Dingare, S., Manning, C. D., Nissim, M., Alex, B., and Grover, C. (2005). Exploring the boundaries: Gene and protein identification in biomedical text. BMC Bioinformatics, 6(Suppl. 1):S5.
  • Finkel, J., Dingare, S., Nguyen, H., Nissim, M., Manning, C., and Sinclair, G. (2004). Exploiting context for biomedical entity recognition: From syntax to the web. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA), pages 88–91.
  • Franz´en, K., Eriksson, G., Olsson, F., Asker, L., Lid´en, P., and C¨oster, J. (2002). Protein names and how to find them. International Journal of Medical Informatics, 4(67):49–61.
  • Friedman, C., Kra, P., and Rzhetsky, A. (2002). Two biomedical sublanguages: A description based on the theories of Zellig Harris. Journal of Biomedical Informatics, 35:222–235.
  • Friedman, C., Kra, P., Yu, H., Krauthammer, M., and Rzhetsky, A. (2001). GENIES: A natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics, 17(Suppl. 1):S74–S82.
  • Fukuda, K., Tsunoda, T., Tamura, A., and Takagi, T. (1998). Toward information extraction: Identifying protein names from biological papers. In Proceedings of the Pacific Symposium on Biocomputing (PSB’98), pages 707–718.
  • Fundel, K., Kuffner, R., and Zimmer, R. (2007). RelEx–Relation extraction using dependency parse trees. Bioinformatics, 23(3):365–371.
  • Gaifman, H. (1965). Dependency systems and phrase-structure systems. Information and Control, 8:304–337.
  • Gaizauskas, R., Demetriou, G., Artymiuk, P. J., and Willett, P. (2003). Protein structures and information extraction from biological texts: The PASTA system. Bioinformatics, 19(1):135–143.
  • Galton, A. (2006). Processes as continuants (abstract). In: Proceedings of the Thirteenth International Symposium on Temporal Representation and Reasoning (TIME’06).
  • Ginter, F. (2007). Towards Information Extraction in the Biomedical Domain: Methods and Resources. PhD thesis, Turku Centre for Computer Science (TUCS).
  • Ginter, F., Boberg, J., J¨arvinen, J., and Salakoski, T. (2004a). New techniques for disambiguation in natural language and their application to biological text. Journal of Machine Learning Research, 5:605–621.
  • Ginter, F., Pyysalo, S., Bj¨orne, J., Heimonen, J., and Salakoski, T. (2007). BioInfer relationship annotation manual. Technical Report TR 806, Turku Centre for Computer Science (TUCS).
  • Ginter, F., Pyysalo, S., Boberg, J., J¨arvinen, J., and Salakoski, T. (2004b). Ontology-based feature transformations: A data-driven approach. In Proceedings of the Fourth International Conference EsTAL 04, Alicante, Spain, pages 279–290.
  • Ginter, F., Pyysalo, S., and Salakoski, T. (2005). Document classification using semantic networks with an adaptive similarity measure. In: Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP’05), pages 204–211.
  • Giuliano, C., Lavelli, A., and Romano, L. (2006). Exploiting shallow linguistic information for relation extraction from biomedical literature. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL’06), pages 401–408.
  • Goffeau, A., Barrell, B. G., Bussey, H., Davis, R. W., Dujon, B., Feldmann, H., Galibert, F., Hoheisel, J. D., Jacq, C., Johnston, M., Louis, E. J., Mewes, H. W., Murakami, Y., Philippsen, P., Tettelin, H., and Oliver, S. G. (1996). Life with 6000 genes. Science, 274(5287):546–567.
  • Goodman, J. (1997). Probabilistic feature grammars. In: Proceedings of the Fourth International Workshop on Parsing Technologies (IWPT’97).
  • Grigoriev, A. (2003). On the number of protein-protein interactions in the yeast proteome. Nucleic Acids Research, 31(14):4157–4161.
  • Grinberg, D., Lafferty, J., and Sleator, D. (1995). A robust parsing algorithm for link grammars. In: Proceedings of the Fourth International Workshop on Parsing Technologies (IWPT’95).
  • Grishman, R. (2001). Adaptive information extraction and sublanguage analysis. In: Proceedings of the IJCAI’01 Workshop on Adaptive Text Extraction and Mining.
  • Grishman, R. (2003). Information extraction. In Mitkov, R., editor, The Oxford Handbook of Computational Linguistics, pages 545–559. Oxford University Press.
  • Grishman, R. and Sundheim, B. (1996). Message understanding conference- 6: A brief history. In: Proceedings of the 16th International Conference on Computational Linguistics (COLING’96), pages 466–471.
  • Grover, C., Carroll, J., and Briscoe, T. (1993). The Alvey natural language tools grammar (4th release). Technical Report 284, University of Cambridge.
  • Grover, C., Lapata, M., and Lascarides, A. (2005). A comparison of parsing technologies for the biomedical domain. Journal of Natural Language Engineering, 11(1):27–65.
  • Grover, C. and Lascarides, A. (2001). Xml-based data preparation for robust deep parsing. In: Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics (ACL’01), pages 260–267.
  • Hanley, J. A. and McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143(1).
  • Hao, Y., Zhu, X., Huang, M., and Li, M. (2005). Discovering patterns to extract protein-protein interactions from the literature: Part II. Bioinformatics, 21(15):3294–3300.
  • Hara, T., Miyao, Y., and Tsujii, J. (2007). Evaluating impact of re-training a lexical disambiguation model on domain adaptation of an hpsg parser. In: Proceedings of the Tenth International Conference on Parsing Technologies (IWPT’07), pages 11–22.
  • Harris, Z. (1968). Mathematical Structures of Language. Wiley-Interscience. Hasegawa, T., Sekine, S., and Grishman, R. (2004). Discovering relations among named entities from large corpora. In: Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL’04), pages 415–422.
  • Hatzivassiloglou, V., Dubou´e, P., A., and Rzhetsky, A. (2001). Disambiguating proteins, genes and RNA in text: A machine learning approach. Bioinformatics, 17(Suppl. 1):97–106.
  • Haverinen, K., Ginter, F., Pyysalo, S., and Salakoski, T. (2008). Accurate conversion of dependency parses: targeting the stanford scheme. In: Proceedings of the Third International Symposium on Semantic Mining in Biomedicine (SMBM’08). To appear.
  • Hays, D. G. (1964). Dependency theory: A formalism and some observations. Language, 40:511–525.
  • Heimonen, J., Pyysalo, S., Ginter, F., and Salakoski, T. (2008). Complexto- pairwise mapping of biological relationships using a semantic network representation. In: Proceedings of the Third International Symposium on Semantic Mining in Biomedicine (SMBM’08). To appear.
  • Hersh, W., Buckley, C., Leone, T. J., and Hickam, D. (1994). OHSUMED: An interactive retrieval evaluation and new large test collection for research. In: Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR’94), pages 192–201.
  • Hersh, W. R., Bhupatiraju, R. T., Ross, L., Johnson, P., Cohen, A. M., and Kraemer, D. F. (2004). TREC 2004 genomics track overview. In Proceedings of the 13th Text Retrieval Conference (TREC’04).
  • Hirschman, L., Colosimo, M., Morgan, A., and Yeh, A. (2005a). Overview of BioCreAtIvE task 1B: Normalized gene lists. BMC Bioinformatics, 6(Suppl. 1):S11.
  • Hirschman, L., Morgan, A. A., and Yeh, A. S. (2002a). Rutabaga by any other name: extracting biological names. Journal of Biomedical Informatics, 35(4):247–259.
  • Hirschman, L., Park, J. C., Tsujii, J., Wong, L., and Wu, C. H. (2002b). Accomplishments and challenges in literature data mining for biology. Bioinformatics, 18(12):1553–1561.
  • Hirschman, L., Yeh, A., Blaschke, C., and Valencia, A. (2005b). Overview of BioCreAtIvE: Critical assessment of information extraction for biology. BMC Bioinformatics, 6(Suppl. 1):S1.
  • Huang, L. (2008). Forest reranking: Discriminative parsing with non-local features. In: Proceedings of ACL-08: HLT, pages 586–594.
  • Huang, M., Zhu, X., Hao, Y., Payan, D. G., Qu, K., and Li, M. (2004). Discovering patterns to extract protein-protein interactions from full texts. Bioinformatics, 20(18):3604–3612.
  • Hunter, L. and Cohen, K. B. (2006). Biomedical language processing: What’s beyond PubMed? Molecular Cell, 21(5):589–594.
  • Hunter, L., Lu, Z., Firby, J., Baumgartner, W. A., Johnson, H. L., Ogren, P. V., and Cohen, K. B. (2008). OpenDMAP: An open-source, ontologydriven concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and cell-specific gene expression. BMC Bioinformatics, 9(78).
  • Jang, H., Lim, J., Lim, J.-H., Park, S.-J., Lee, K.-C., and Park, S.-H. (2006). Finding the evidence for protein-protein interactions from pubmed abstracts. Bioinformatics, 22(14):e220–226.
  • Jenssen, T.-K., Laegreid, A., Komorowski, J., and Hovig, E. (2001). A literature network of human genes for high-throughput analysis of gene expression. Nature Genetics, 28:21–28.
  • Joshi-Tope, G., Gillespie, M., Vastrik, I., D’Eustachio, P., Schmidt, E., de Bono, B., Jassal, B., Gopinath, G., Wu, G., Matthews, L., Lewis, S., Birney, E., and Stein, L. (2005). Reactome: a knowledgebase of biological pathways. Nucleic Acids Research, 33(Suppl. 1):D428–432.
  • Jurafsky, D. and Martin, J. H. (2000). Speech and Language Processing. Prentice-Hall.
  • Kakkonen, T. (2007). Framework and Resources for Natural Language Parser Evaluation. PhD thesis, University of Joensuu. Kanehisa, M. and Goto, S. (2000). KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Research, 28(1):27–30.
  • Kaplan, R., Riezler, S., King, T. H., Maxwell III, J. T., Vasserman, A., and Crouch, R. (2004). Speed and accuracy in shallow and deep stochastic parsing. In: Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HTL-NAACL’04), pages 97–104.
  • Karamanis, N., Seal, R., Lewin, I., McQuilton, P., Vlachos, A., Gasperin, C., Drysdale, R., and Briscoe, T. (2008). Natural Language Processing in aid of FlyBase curators. BMC Bioinformatics, 9:193.
  • Karlsson, F. (1990). Constraint grammar as a framework for parsing running text. In: Proceedings of the 13th International Conference on Computational linguistics (COLING’90), pages 168–173.
  • Karopka, T., Scheel, T., Bansemer, S., and Glass, ¨A. (2004). Automatic construction of gene relation networks using text mining and gene expression data. Medical Informatics and the Internet in Medicine, 29(2):169–183.
  • Karp, P. D., Riley, M., Saier, M., Paulsen, I. T., Collado-Vides, J., Paley, S. M., Pellegrini-Toole, A., Bonavides, C., and Gama-Castro, S. (2002). The EcoCyc database. Nucleic Acids Research, 30(1):56–58.
  • Karttunen, L. (2007). Word play. Computational Linguistics, 33(4):443–467.
  • Katrenko, S. and Adriaans, P. (2006). Learning relations from biomedical corpora using dependency trees. In: Proceedings of the First Workshop on Knowledge Discovery and Emergent Complexity in BioInformatics (KDECB’06), pages 61–80.
  • Jin-Dong Kim., Ohta, T., and Tsujii, J. (2008a). Corpus annotation for mining biomedical events from literature. BMC Bioinformatics, 9(10).
  • Jin-Dong Kim., Ohta, T., Yoshimasa Tsuruoka., Tateisi, Y., and Collier, N. (2004). Introduction to the bio-entity recognition task at JNLPBA. In Collier, N., Ruch, P., and Nazarenko, A., editors, Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA), pages 70–75.
  • Kim, S., Shin, S.-Y., Lee, I.-H., Kim, S.-J., Sriram, R., and Zhang, B.-T. (2008b). PIE: an online prediction system for protein-protein interactions from text. Nucleic Acids Research, 36(Suppl. 2):W411–415.
  • Kim, S., Yoon, J., and Yang, J. (2008c). Kernel approaches for genic interaction extraction. Bioinformatics, 24(1):118–126.
  • Koike, A., Kobayashi, Y., and Takagi, T. (2003). Kinase pathway database: An integrated protein-kinase and nlp-based protein-interaction resource. Genome Research, 13:1241–1243.
  • Koike, A., Niwa, Y., and Takagi, T. (2005). Automatic extraction of gene/protein biological functions from biomedical text. Bioinformatics, 21(7):1227–1236.
  • Krallinger, M., Leitner, F., and Valencia, A. (2007). Assessment of the second BioCreative PPI task: Automatic extraction of protein-protein interactions. In: Proceedings of the Second BioCreative Challenge Evaluation, pages 41–54.
  • Krauthammer, M. and Nenadic, G. (2004). Term identification in the biomedical literature. Journal of Biomedical Informatics, 37:512–526.
  • Kulick, S., Bies, A., Liberman, M., Mandel, M., McDonald, R., Palmer, M., Schein, A., and Ungar, L. (2004). Integrated annotation for biomedical information extraction. In: Proceedings of the HLT-NAACL 2004 Workshop on Linking Biological Literature, Ontologies and Databases (BioLINK’04), pages 61–68.
  • Laippala, V., Ginter, F., Pyysalo, S., and Salakoski, T. (2008). Resourceefficient construction of a full parser for Finnish nursing narratives. In Proceedings of the First Louhi Conference on Text and Data Mining of Clinical Documents. To appear.
  • Lease, M. and Charniak, E. (2005). Parsing biomedical literature. In: Proceedings of the Second International Joint Conference on Natural Langage Processing (IJCNLP’05), pages 58–69.
  • Lehnert, W., Cardie, C., Fisher, D., McCarthy, J., Riloff, E., and Soderland, S. (1992). University of massachusetts: Muc-4 test results and analysis. In Proceedings of the Fourth Message Understanding Conference (MUC-4), pages 151–158.
  • Leroy, G. and Chen, H. (2002). Filling preposition-based templates to capture information from medical abstracts. In: Proceedings of the Pacific Symposium on Biocomputing (PSB’02), pages 350–361.
  • Leroy, G., Chen, H., and Martinez, J. D. (2003). A shallow parser based on closed-class words to capture relations in biomedical text. Journal of Biomedical Informatics, 36(3):145–158.
  • Leser, U. and Hakenberg, J. (2005). What makes a gene name? named entity recognition in the biomedical literature. Briefings in Bioinformatics, 6(4):357–369.
  • Levy, R. and Manning, C. (2004). Deep dependencies from context-free statistical parsers: Correcting the surface dependency approximation. In Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL’04), pages 327–334.
  • Lin, D. (1995). A dependency-based method for evaluating broad-coverage parsers. In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’95), pages 1420–1427.
  • Liu, H., Hu, Z.-Z., Zhang, J., andWu, C. (2006). BioThesaurus: a web-based thesaurus of protein and gene names. Bioinformatics, 22(1):103–105.
  • Magerman, D. M. (1994). Natural language parsing as statistical pattern recognition. PhD thesis, Stanford University.
  • Magerman, D. M. (1995). Statistical decision-tree models for parsing. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics (ACL’95), pages 276–283.
  • Manning, C. D. and Hinrich Schütze (1999). Foundations of Statistical Natural Language Processing. MIT Press.
  • Marcus, M. P., Santorini, B., and Marcinkiewicz, M. A. (1993). Building a large annotated corpus of English: The Penn treebank. Computational Linguistics, 19(2):313–330.
  • deMarneffe, M.-C., MacCartney, B., andManning, C. D. (2006). Generating typed dependency parses from phrase structure parses. In: Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06), pages 449–454.
  • Mathivanan, S., Periaswamy, B., Gandhi, T., Kandasamy, K., Suresh, S., Mohmood, R., Ramachandra, Y., and Pandey, A. (2006). An evaluation of human protein-protein interaction data in the public domain. BMC Bioinformatics, 7(Suppl. 5)(S19).
  • McClosky, D., Charniak, E., and Johnson, M. (2006). Effective self-training for parsing. In: Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL’06), pages 152–159.
  • McCray, A. T., Aronson, A. R., Browne, A. C., Rindflesch, T. C., Razi, A., and Srinivasan, S. (1993). UMLS knowledge for biomedical language processing. Bulletin of the Medical Library Association, 81(2):184–194.
  • McDonald, D. M., Chen, H., Su, H., and Marshall, B. B. (2004). Extracting gene pathway relations using a hybrid grammar: The Arizona relation parser. Bioinformatics, 20(18):3370–3378.
  • McDonald, R., Pereira, F., Kulick, S., Winters, S., Jin, Y., and White, P. (2005). Simple algorithms for complex relation extraction with applications to biomedical IE. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), pages 491–498.
  • McNaught, J. and Black, W. J. (2006). Information extraction. In Ananiadou, S. and McNaught, J., editors, Text Mining for Biology and Biomedicine, pages 143–177. Artech house.
  • Gabor Melli. (2007). Inductive approaches to the detection and classification of semantic relation mentions. Technical report, Simon Fraser School of Computing Science.
  • (Melli et al., 2007) ⇒ Gabor Melli, Martin Ester, and Anoop Sarkar. (2007). “Recognition of Multi-sentence n-ary Subcellular Localization Mentions in Biomedical Abstracts.” In: Proceedings of the 2nd International Symposium on Languages in Biology and Medicine (LBM 2007).
  • Mel’?cuk, I. A. (1988). Dependency Syntax: Theory and Practice. State University of New York Press.
  • Mikheev, A. (1997). Automatic rule induction for unknown-word guessing. Computational Linguistics, 23(3):405–423.
  • Miller, S., Crystal, M., Fox, H., Ramshaw, L., Schwartz, R., Stone, R., and Weischedel, R. (1998). Algorithms that learn to extract information - BBN: Description of the SIFT system as used for MUC-7. In: Proceedings of the Seventh Message Understanding Conference (MUC-7).
  • Mitkov, R., editor (2003). The Oxford Handbook of Computational Linguistics. Oxford University Press.
  • Mitsumori, T., Murata, M., Fukuda, Y., Doi, K., and Doi, H. (2006). Extracting protein-protein interaction information from biomedical text with SVM. IEICE Transactions on Information and Systems, E89-D(8).
  • Miyao, Y., Sætre, R., Sagae, K., Matsuzaki, T., and Tsujii, J. (2008). Taskoriented evaluation of syntactic parsers and their representations. In: Proceedings of the 46th Annual Meeting of the Association of Computational Linguistics (ACL’08), pages 46–54.
  • Miyao, Y., Sagae, K., and Tsujii, J. (2007). Towards framework-independent evaluation of deep linguistic parsers. In: Proceedings of the Grammar Engineering across Frameworks Workshop (GEAF’07).
  • Miyao, Y. and Tsujii, J. (2005). Probabilistic disambiguation models for wide-coverage HPSG parsing. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), pages 83–90.
  • Ann Arbor, Michigan. Association for Computational Linguistics. Morgan, A. A. and Hirschman, L. (2007). Overview of BioCreative II gene normalization. In: Proceedings of the Second BioCreative Challenge Evaluation, pages 101–103.
  • Mueller, E. T. (1987). Daydreaming and Computation: A computer model of everyday creativity, learning, and emotions in the human stream of thought. PhD thesis, University of California, Los Angeles.
  • Müller, H.-M., Kenny, E. E., and Sternberg, P. W. (2004). Textpresso: An ontology-based information retrieval and extraction system for biological literature. PLoS Biology, 2(11):e309.
  • N´edellec, C. (2005). Learning language in logic - genic interaction extraction challenge. In: Proceedings of the Learning Language in Logic Workshop (LLL’05).
  • Nivre, J., Hall, J., Kübler, S., McDonald, R., Nilsson, J., Riedel, S., and Yuret, D. (2007). The CoNLL 2007 shared task on dependency parsing. In: Proceedings of the CoNLL Shared Task Session of EMNLP-CoNLL’07.
  • pages 915–932.
  • Nobata, C., Collier, N., and Tsujii, J. (2000). Comparison between tagged corpora for the named entity task. In: Proceedings of the ACL Workshop on Comparing Corpora, pages 20–27.
  • Ohta, T., Miyao, Y., Ninomiya, T., Yoshimasa Tsuruoka., Yakushiji, A., Masuda, K., Takeuchi, J., Yoshida, K., Hara, T., Jin-Dong Kim., Tateisi, Y., and Tsujii, J. (2006). An intelligent search engine and GUI-based efficient MEDLINE search tool based on deep syntactic parsing. In: Proceedings of COLING-ACL’06, pages 17–20.
  • Ohta, T., Tateisi, Y., Mima, H., and Tsujii, J. (2002). GENIA corpus: An annotated research abstract corpus in molecular biology domain. In Proceedings of the Human Language Technology Conference (HLT’02), pages 73–77.
  • Ono, T., Hishigaki, H., Tanigami, A., and Takagi, T. (2001). Automated extraction of information on protein-protein interactions from the biological literature. Bioinformatics, 17(2):155–161.
  • Pahikkala, T. (2008). New Kernel Functions and Learning Methods for Text and Data Mining. PhD thesis, Turku Centre for Computer Science (TUCS).
  • Pahikkala, T., Ginter, F., Boberg, J., J¨arvinen, J., and Salakoski, T. (2005a). Contextual weighting for support vector machines in literature mining: An application to gene versus protein name disambiguation. BMC Bioinformatics, 6(1):157.
  • Pahikkala, T., Pyysalo, S., Boberg, J., J¨arvinen, J., and Salakoski, T. (2008). Matrix representations, linear transformations, and kernels for natural language processing. Machine Learning. To appear.
  • Pahikkala, T., Pyysalo, S., Boberg, J., Myll¨ari, A., and Salakoski, T. (2005b). Improving the performance of bayesian and support vector classifiers in word sense disambiguation using positional information. In: Proceedings of the International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning (AKRR’05), pages 90–97.
  • Pahikkala, T., Pyysalo, S., Ginter, F., Boberg, J., J¨arvinen, J., and Salakoski, T. (2005c). Kernels incorporating word positional information in natural language disambiguation tasks. In: Proceedings of the Eighteenth International Florida Artificial Intelligence Research Society Conference (FLAIRS’05), pages 442–447.
  • Pahikkala, T., Tsivtsivadze, E., Airola, A., Boberg, J., and Salakoski, T. (2007). Learning to rank with pairwise regularized least-squares. In SIGIR’ 07 Workshop on Learning to Rank for Information Retrieval, pages 27–33.
  • Pahikkala, T., Tsivtsivadze, E., Boberg, J., and Salakoski, T. (2006). Graph kernels versus graph representations: A case study in parse ranking. In Proceedings of the ECML-PKDD’06 workshop on Mining and Learning with Graphs (MLG’06).
  • Palakal, M., Stephens, M., Mukhopadhyay, S., Raje, R., and Rhodes, S. (2002). A multi-level text mining method to extract biological relationships. In: Proceedings of the First IEEE Computer Society Bioinformatics Conference (CSB’02), pages 97–108.
  • Pallett, D. S., Garofolo, J. S., and Fiscus, J. G. (2000). Measurements in support of research accomplishments. Communications of the ACM, 43(2):75–79.
  • Park, J. C. (2001). Using combinatory categorial grammar to extract biomedical information. IEEE Intelligent Systems, 16(6):62–67.
  • Park, J. C., Kim, H. S., and Kim, J.-J. (2001). Bidirectional incremental parsing for automatic pathway identification with combinatory categorial grammar. In: Proceedings of the Pacific Symposium on Biocomputing (PSB’01).
  • Park, J. C. and Kim, J.-J. (2006). Named entity recognition. In Ananiadou, S. and McNaught, J., editors, Text Mining for Biology and Biomedicine, pages 121–142. Artech house.
  • Peri, S. et al. (2004). Human protein reference database as a discovery resource for proteomics. Nucleic Acids Research, 32(Suppl. 1):D497–501.
  • Petrov, S. and Klein, D. (2007). Improved inference for unlexicalized parsing. In: Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL’07), pages 404–411.
  • Phuong, T. M., Lee, D., and Lee, K. H. (2003). Learning rules to extract protein interactions from biomedical text. In: Proceedings of the seventh Pacific-Asia conference on knowledge discovery and data mining (PAKDD’03), pages 148–158.
  • Plake, C., Hakenberg, J., and Leser, U. (2005). Optimizing syntax patterns for discovering protein-protein interactions. In: Proceedings of the ACM Symposium on Applied Computing, pages 195–201.
  • Poggio, T. and Smale, S. (2003). The mathematics of learning: Dealing with data. Notices of the American Mathematical Society (AMS), 50(5).
  • Pollard, C. J. and Sag, I. A. (1994). Head-Driven Phrase Structure Grammar. University of Chicago Press.
  • Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(2).
  • Preiss, J. (2003). Using grammatical relations to compare parsers. In: Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL’03), pages 291–298.
  • Proux, D., Rechenmann, F., and Juillard, L. (2000). A pragmatic information extraction strategy for gathering data on genetic interactions. In Proceedings of Proceedings of the Eight International Conference on Intelligent Systems for Molecular Biology (ISMB’00), pages 279–285.
  • Proux, D., Rechenmann, F., Julliard, L., Pillet, V., and Jacq, B. (1998). Detecting gene symbols and names in biological texts: A first step toward pertinent information extraction. Genome Informatics, 9:72–80.
  • Pustejovsky, J., Casta˜no, J., Zhang, J., Kotecki, M., and Cochran, B. (2002). Robust relational parsing over biomedical literature: Extracting inhibit relations. In: Proceedings of the Pacific Symposium on Biocomputing (PSB’02), pages 362–373.
  • Pyysalo, S. (2003). Mining biomedical literature for protein-protein interactions using support vector machines. Master’s thesis, University of Oulu. Pyysalo, S., Ginter, F., Pahikkala, T., Boberg, J., J¨arvinen, J., Salakoski, T., and Koivula, J. (2004). Analysis of link grammar on biomedical dependency corpus targeted at protein-protein interactions. In: Proceedings of the International Workshop on Natural language Processing in Biomedicine and its Applications (JNLPBA), pages 15–21.
  • Pyysalo, S., Sætre, R., Tsujii, J., and Salakoski, T. (2008). Why biomedical relation extraction results are incomparable and what to do about it. In Proceedings of the Third International Symposium on Semantic Mining in Biomedicine (SMBM’08). To appear.
  • Radford, A. (2004). Minimalist Syntax: Exploring the Structure of English. Cambridge University Press.
  • Ramani, A. K., Bunescu, R. C., Mooney, R. J., and Marcotte, E. M. (2005). Consolidating the set of known human protein-protein interactions in preparation for large-scale mapping of the human interactome. Genome Biology, 6(R40).
  • Ratnaparkhi, A. (1997). A linear observed time statistical parser based on maximum entropy models. In: Proceedings of the Second Conference on Empirical Methods in Natural Language Processing (EMNLP’97).
  • Rebholz-Schuhmann, D., Kirsch, H., and Couto, F. (2005). Facts from text – is text mining ready to deliver? PLoS Biology, 3(2):188–191.
  • Rebholz-Schuhmann, D., Marcel, S., Albert, S., Tolle, R., Casari, G., and Kirsch, H. (2004). Automatic extraction of mutations from Medline and cross-validation with OMIM. Nucleic Acids Research, 32(1):135–142.
  • Reynar, J. C. and Ratnaparkhi, A. (1997). A maximum entropy approach to identifying sentence boundaries. In: Proceedings of the Fifth Conference on Applied Natural Language Processing (ANLP’97), pages 16–19.
  • Rifkin, R., Yeo, G., and Poggio, T. (2003). Regularized least-squares classification. In Suykens, J., Horvath, G., Basu, S., Micchelli, C., and Vandewalle, J., editors, Advances in Learning Theory: Methods, Model and Applications, chapter 7, pages 131–154.
  • van Rijsbergen, C. J. (1979). Information Retrieval. Butterworth- Heinemann.
  • Rinaldi, F., Schneider, G., Kaljurand, K., Hess, M., and Romacker, M. (2006). An environment for relation mining over richly annotated corpora: The case of GENIA. In: Proceedings of the Second International Symposium on Semantic Mining in Biomedicine (SMBM’06).
  • Rindflesch, T., Rajan, J., and Hunter, L. (2000a). Extracting molecular binding relationships from biomedical text. In: Proceedings of the Applied Natural Language Processing Conference of the North American Chapter of the Association for Computational Linguistics (ANLP-NAACL’00), pages 188–195.
  • Rindflesch, T., Tanabe, L., Weinstein, J. N., and Hunter, L. (2000b). EDGAR: Extraction of drugs, genes and relations from the biomedical literature. In: Proceedings of the Pacific Symposium on Biocomputing (PSB’00), pages 514–525.
  • Rindflesch, T. C., Hunter, L., and Aronson, A. R. (1999). Mining molecular binding terminology from biomedical text. In: Proceedings of the AMIA Annual Symposium, pages 127–131.
  • Rosario, B. and Hearst, M. (2004). Classifying semantic relations in bioscience texts. In: Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL’04), pages 430–437.
  • Rzhetsky, A., Iossifov, I., Koike, T., Krauthammer, M., Kra, P., Morris, M., Yu, H., Dubou´e, P. A., Weng, W., Wilbur, W. J., Hatzivassiloglou, V., and Friedman, C. (2004). GeneWays: A system for extracting, analyzing, visualizing, and integrating molecular pathway data. Journal of Biomedical Informatics, 37(1):43–53.
  • Sætre, R., Sagae, K., and Tsujii, J. (2007). Syntactic features for proteinprotein interaction extraction. In: Proceedings of the Second International Symposium on Languages in Biology and Medicine (LBM’07).
  • Sagae, K., Miyao, Y., Matsuzaki, T., and Tsujii, J. (2008a). Challenges in mapping of syntactic representations for framework-independent parser evaluation. In: Proceedings of the ICGL’08 Workshop on Automated Syntatic Annotations for Interoperable Language Resources.
  • Sagae, K., Miyao, Y., Saetre, R., and Tsujii, J. (2008b). Evaluating the effects of treebank size in a practical application for parsing. In Software Engineering, Testing, and Quality Assurance for Natural Language Processing, pages 14–20.
  • Sagae, K. and Tsujii, J. (2007). Dependency parsing and domain adaptation with LR models and parser ensembles. In: Proceedings of the CoNLL Shared Task Session of EMNLP-CoNLL’07, pages 1044–1050.
  • Samuelsson, C. and Voutilainen, A. (1997). Comparing a linguistic and a stochastic tagger. In: Proceedings of the 35th Annual Meeting of the Association of Computational Linguistics (ACL’97), pages 246–253.
  • Sanchez-Graillet, O. and Poesio, M. (2007). Negation of protein protein interactions: analysis and extraction. Bioinformatics, 23(13):i424–432.
  • Schneider, G. (2007). Hybrid Long-Distance Functional Dependency Parsing. PhD thesis, University of Zurich.
  • Sekimizu, T., Park, H. S., and Tsujii, J. (1998). Identifying the interaction between genes and gene products based on frequently seen verbs in medline abstracts. Genome Informatics, 9:62–71.
  • Sekine, S. (1997). The domain dependence of parsing. In: Proceedings of the Fifth Conference on Applied Natural Language Processing (ANLP’97), pages 96–102.
  • Shen, L., Satta, G., and Joshi, A. (2007). Guided learning for bidirectional sequence classification. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (ACL’07), pages 760–767.
  • Sleator, D. D. and Temperley, D. (1991). Parsing English with a Link Grammar. Technical Report CMU-CS-91-196, Carnegie Mellon University.
  • Smith, B., Ceusters, W., Klagges, B., Kohler, J., Kumar, A., Lomax, J., Mungall, C., Neuhaus, F., Rector, A., and Rosse, C. (2005). Relations in biomedical ontologies. Genome Biology, 6(5):R46.
  • Smith, L., Rindflesch, T., and Wilbur, W. J. (2004). MedPost: A part-ofspeech tagger for biomedical text. Bioinformatics, 20(14):2320–2321.
  • Spasic, I., Ananiadou, S., McNaught, J., and Kumar, A. (2005). Text mining and ontologies in biomedicine: Making sense of raw text. Briefings in Bioinformatics, 6(3):239–251.
  • Stapley, B. and Benoit, G. (2000). Biobibliometrics: Information retrieval and visualization from co-occurrences of gene names in medline abstracts. In: Proceedings of the Pacific Symposium on Biocomputing (PSB’00), pages 529–540.
  • Stelzl, U., Worm, U., Lalowski, M., Haenig, C., Brembeck, F. H., Goehler, H., Stroedicke, M., Zenkner, M., Schoenherr, A., Koeppen, S., Timm, J., Mintzlaff, S., Abraham, C., Bock, N., Kietzmann, S., Goedde, A., Toksoz, E., Droege, A., Krobitsch, S., Korn, B., Birchmeier, W., Lehrach, H., and Wanker, E. E. (2005). A human protein-protein interaction network: A resource for annotating the proteome. Cell, 122:957–968.
  • Stephens, M., Palakal, M., Mukhopadhyay, S., Raje, R., and Mostafa, J. (2001). Detecting gene relations from MEDLINE abstracts. In: Proceedings of the Pacific Symposium on Biocomputing (PSB’01), pages 483–495.
  • Suber, P. (2002). Open access to the scientific journal literature. Journal of Biology, 1(3).
  • Sun, C., Lin, L., Wang, X., and Guan, Y. (2007). Using maximum entropy model to extract protein-protein interaction information from biomedical literature. In: Proceedings of the Third International Conference on Intelligent Computing (ICIC’07), pages 730–737.
  • Sundheim, B. and Chinchor, N. (1995). Named entity task definition. In Proceedings of the Sixth Message Understanding Conference (MUC-6), pages 319–332.
  • Sundheim, B. M. (1995). Overview of results of the muc-6 evaluation. In Proceedings of the Sixth Message Understanding Conference (MUC-6), pages 13–31.
  • Swanson, D. R. (1986). Fish oil, Raynaud’s syndrome, and undiscovered public knowledge. Perspectives in biology and medicine, 20:7–18.
  • Swanson, D. R. (1988). Migraine and magnesium: Eleven neglected connections. Perspectives in biology and medicine, 31(4):526–557.
  • Szolovits, P. (2003). Adding a medical lexicon to an english parser. In Proceedings of the 2003 AMIA Annual Symposium, pages 639–643.
  • Tanabe, L., Xie, N., Thom, L. H., Matten, W., and Wilbur, W. J. (2005). GENETAG: A tagged corpus for gene/protein named entity recognition. BMC Bioinformatics, 6(Suppl. 1):S3.
  • Tapanainen, P. and J¨arvinen, T. (1997). A non-projective dependency parser. In: Proceedings of the Fifth Conference on Applied Natural Language Processing (ANLP’97), pages 64–71.
  • Tateisi, Y., Yakushiji, A., Ohta, T., and Tsujii, J. (2005). Syntax annotation for the GENIA corpus. In: Proceedings of the Second International Joint Conference on Natural Langage Processing (IJCNLP’05), pages 222–227.
  • Temkin, J. M. and Gilder, M. R. (2003). Extraction of protein interaction information from unstructured text using a context-free grammar. Bioinformatics, 19(16):2046–2053.
  • Tesni`ere, L. (1959). ´ El´ements de Syntaxe Structurale. Klincksiek. Thomas, J., Milward, D., Ouzounis, C., Pulman, S., and Carroll, M. (2000). Automatic extraction of protein interactions from scientific abstracts. In Proceedings of the Pacific Symposium on Biocomputing (PSB’00), pages 538–549, Honolulu, HI.
  • Tomanek, K., Wermter, J., and Hahn, U. (2007). A reappraisal of sentence and token splitting for life science documents. In: Proceedings of the 12th International Medical Informatics Congress (MedInfo’07).
  • Torii, M., Kamboj, S., and Vijay-Shanker, K. (2003). An investigation of various information sources for classifying biological names. In: Proceedings of the ACL’03 Workshop on Natural Language Processing in the Biomedical Domain (BioNLP’03), pages 113–120.
  • Torii, M., Kamboj, S., and Vijay-Shanker, K. (2004). Using name-internal and contextual features to classify biological terms. Journal of Biomedical Informatics, 37:498–511.
  • Tsai, R. T.-H., Wu, S.-H., Chou, W.-C., Lin, Y.-C., He, D., Hsian, J., Sung, T.-Y., and Hsu, W.-L. (2006). Various criteria in the evaluation of biomedical named entity recognition. BMC Bioinformatics, 7:92.
  • Tsivtsivadze, E., Pahikkala, T., Airola, A., Boberg, J., and Salakoski, T. (2008). A sparse regularized least-squares preference learning algorithm. In: Proceedings of the Tenth Scandinavian Conference on Artificial Intelligence (SCAI’08). To appear.
  • Tsivtsivadze, E., Pahikkala, T., Boberg, J., and Salakoski, T. (2007). Locality kernels for sequential data and their applications to parse ranking. Applied Intelligence. To appear.
  • Tsivtsivadze, E., Pahikkala, T., Pyysalo, S., Boberg, J., Myll¨ari, A., and Salakoski, T. (2005). Regularized least-squares for parse ranking. In: Proceedings of the Sixth International Symposium on Intelligent Data Analysis (IDA’05), Madrid, Spain, pages 464–474.
  • Yoshimasa Tsuruoka., Tateishi, Y., Jin-Dong Kim., Ohta, T., McNaught, J., Ananiadou, S., and Tsujii, J. (2005). Developing a robust part-of-speech tagger for biomedical text. In: Proceedings of the Panhellenic Conference on Informatics, pages 382–392.
  • Turmo, J., Ageno, A., and Catal`a, N. (2006). Adaptive information extraction. ACM Computing Surveys, 38(2):4.
  • Van Landeghem, S., Saeys, Y., De Baets, B., and Van de Peer, Y. (2008). Extracting protein-protein interactions from text using rich feature vectors and feature selection. In: Proceedings of the Third International Symposium on Semantic Mining in Biomedicine (SMBM’08). To appear.
  • Venter, C. J. et al. (2001). The sequence of the human genome. Science, 291(5507):1304–1351.
  • Voutilainen, A. (1995). A syntax-based part-of-speech analyser. In: Proceedings of the Seventh Conference of the European Chapter of the Association for Computational Linguistics (EACL’95), pages 157–164.
  • Voutilainen, A. (1997). Designing a (finite-state) parsing grammar. In Roche, E. and Schabes, Y., editors, Finite-State Language Processing, pages 283–303. MIT Press.
  • Voutilainen, A., Heikkil¨a, J., and Anttila, A. (1992). Constraint grammar of English: A performance-oriented introduction. University of Helsinki, Department of General Linguistics.
  • Wattarujeekrit, T., Shah, P., and Collier, N. (2004). PASBio: Predicateargument structures for event extraction in molecular biology. BMC Bioinformatics, 5(1):155.
  • Wilbur, J., Smith, L., and Tanabe, L. (2007). Biocreative 2 gene mention task. In: Proceedings of the Second BioCreative Challenge Evaluation, pages 7–16.
  • Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics, 1:80–83.
  • Xenarios, I., Rice, D. W., Salwinski, L., Baron, M. K., Marcotte, E. M., and Eisenberg, D. (2000). DIP: The Database of Interacting Proteins. Nucleic Acids Research, 28(1):289–291.
  • Xiao, J., Su, J., Zhou, G., and Tan, C. (2005). Protein-protein interaction extraction: A supervised learning approach. In: Proceedings of the First International Symposium on Semantic Mining in Biomedicine (SMBM’05), pages 51–59, Hinxton, UK.
  • Xuan, W., Watson, S. J., and Meng, F. (2007). Tagging sentence boundaries in biomedical literature. In: Proceedings of the Eighth International Conference on Intelligent Text Processing and Computational Linguistics (CICLing’07), pages 186–195.
  • Yakushiji, A. (2006). Relation Information Extraction Using Deep Syntactic Analysis. PhD thesis, University of Tokyo.
  • Yakushiji, A., Miyao, Y., Ohta, T., Tateisi, Y., and Tsujii, J. (2006). Automatic construction of predicate-argument structure patterns for biomedical information extraction. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’06), pages 284–292.
  • Yakushiji, A., Miyao, Y., Tateisi, Y., and Tsujii, J. (2005). Biomedical information extraction with predicate-argument structure patterns. In Proceedings of the First International Symposium on Semantic Mining in Biomedicine (SMBM’05), pages 60–69.
  • Yakushiji, A., Tateisi, Y., Miyao, Y., and Tsujii, J. (2001). Event extraction from biomedical papers using a full parser. In: Proceedings of the Pacific Symposium on Biocomputing (PSB’01), pages 408–419.
  • Yang, Z., Lin, H., and Wu, B. (2007). BioPPIExtractor: A protein-protein interaction extraction system for biomedical literature. Expert Systems with Applications.
  • Yeh, A., Morgan, A., Colosimo, M., and Hirschman, L. (2005). BioCre- AtIvE task 1A: Gene mention finding evaluation. BMC Bioinformatics, 6(Suppl. 1):S2.
  • Zanzoni, A., Montecchi-Palazzi, L., Quondam, M., Ausiello, G., Helmer- Citterich, M., and Cesareni, G. (2002). Mint: A molecular interaction database. FEBS Letters, 513(1):135–140.
  • Zelenko, D., Aone, C., and Richardella, A. (2003). Kernel methods for relation extraction. Journal of Machine Learning Research, 3:1083–1106.
  • Zhang, M., Zhang, J., and Su, J. (2006). Exploring syntactic features for relation extraction using a convolution tree kernel. In: Proceedings of the Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics (HLT-NAACL’06), pages 288–295.
  • Zhou, D., He, Y., and Kwoh, C. K. (2006). Extracting protein-protein interactions from the literature using the hidden vector state model. In Proceedings of the Second International Workshop on Bioinformatics Research and Applications (IWBRA’06), pages 718–725.
  • Zhou, G., Shen, D., Zhang, J., Su, J., and Tan, S. (2005). Recognition of protein/gene names from text using an ensemble of classifiers. BMC Bioinformatics, 6(Suppl. 1):S7.
  • Zhou, G. and Su, J. (2004). Exploring deep knowledge resources in biomedical name recognition. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA), pages 96–99.
  • Zweigenbaum, P., Demner-Fushman, D., Yu, H., and Cohen, K. B. (2007). Frontiers of biomedical text mining: Current progress. Briefings in Bioinformatics.,


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2008 ADependencyParsApprToBiomedTMiningSampo PyysaloA Dependency Parsing Approach to Biomedical Text Mininghttps://oa.doria.fi/bitstream/handle/10024/39934/pyysalo-phdthesis2008.pdf2008