2004 ExtractingProteinFunctionInfoFromMEDLINE

(Daraselia et al., 2004) ⇒ Nikolai Daraselia, Sergei Egorov, Andrey Yazhuk, Svetlana Novichkova, Anton Yuryev, Ilya Mazo. (2004). “Extracting Protein Function Information from MEDLINE Using a Full-Sentence Parser.” In: Second European Workshop on Data Mining and Text Mining for Bioinformatics, pages 11-18.

Subject Headings: NER - Protein, Relation Recognition, GenBank.

Notes

Cited By

Rinaldi et al., 2006

Quotes

Abstract

Identification of proteins and other domain specific concepts

Our approach for protein identification utilizes a semiautomatically curated protein name dictionary, which was based on and compiled from the LocusLink database and additionally enriched by incorporating protein names, aliases, descriptions and gene names from the linked GenBank, GoldenPath, and HUGO database entries. The resulting collection of protein “descriptors” contained along with correct protein names, functional keywords (e.g., “kinase”) clone names, as well as some completely irrelevant contaminant words and phrases. To improve the quality of this collection, the occurrence of each of the potential protein name in the 2003 MEDLINE release was determined by the method described below, and erroneous names were manually removed from the top 20,000 entries sorted by occurrence.

The rest of the entries were automatically processed in order to:

Remove records containing a single word with a character length of 1 or 2 (e.g., ’A’, ‘C’, ‘AS’)
Remove entries with length 3 or 4 not containing at least one digit (e.g., ‘AHH,’ ‘ATDC’)
Remove purely numerical entries (e.g., ‘3742643’)
Remove entries consisting only of measures (e.g., ‘23 kDa protein’)

The resulting protein name dictionary consists of 245,248 records describing 81,915 unique proteins each assigned a LocusLink identifier.

In order to ignore variations in the protein name spelling we use a single specialized tokenization process for target text and dictionary entries. Tokenization converts the input text into a sequence of tokens; tokens are made from the longest sequences of characters belonging to the same class. The preprocessor considers each punctuation character as belonging to a separate class. All letters belong to the alphabetical class, and all digits to the numerical class. White space is treated as a token separator and is not considered a token. Numerical and punctuation sequences are converted into tokens with no special processing. Alphabetical sequences are first converted to lower case and then searched for prefixes and suffixes made of English spelling of Greek letters (e.g., ‘alpha,’ ‘beta,’ ‘gamma,’ etc.). If such prefixes or suffixes are identified, they are stripped off and treated as separate tokens. The described tokenization procedure is followed by simple and efficient subsequence search applied to token sequences.

Protein names are identified by a variation of a string search algorithm. To implement the original idea of “relaxed” protein name matching, MEDLINE abstracts and dictionary entries are processed by the same tokenizer, and tokens belonging to the small list of “excluded” words are ignored. Next, a sentence is scanned for the presence of uninterrupted token sequences, consisting only of tokens derived from the dictionary processing.

When the token sequence is assembled, it is required that the corresponding original character sequence is not immediately preceded by a word or a number with no separating white space and does not end in a word or a number not immediately followed by a punctuation mark (,;.?). Next, each token sequence is passed through a validation step to check if it satisfies the following constraints:

Comma (,) is not allowed as first or last token
Comma is allowed between single quote (') and a number
Comma is allowed between a word (alphabetical token) of character length > 1 and "a"
Comma is allowed between two alphabetical tokens/numbers the second of which is not "a"
Comma is not allowed in other cases
Slash (/) is only allowed between + and –
Period (.) should not be followed by white space

Each qualifying token sequence is searched for the presence of dictionary entries by trying all of its subsequences from long to short and from left to right. If the subsequence lookup in the dictionary results in positive identification, the subsequence is marked up with a corresponding ID and the rest of the tokens to its right are searched for more matches.

An approach towards identification of chemical names is somewhat different: we have chosen not use formal mapping of a chemical name to an ID in any formal nomenclature due to a large size of the chemical dictionaries and simply mark up a name of a chemical substance in a text. Therefore, we employ a simplified matching algorithm that disregards rules of chemical nomenclature and instead utilizes the list of chemical “root words” that was created using the 2001 edition of United Medical Language System (UMLS) Metathesaurus. A total of 309,160 UMLS concept strings with semantic types “Organic chemical” and “Pharmacologic substance” were tokenized by our preprocessor module. Numbers and punctuation marks were removed, and the resulting list of approximately 77,000 nonredundant alphabetical tokens was used for chemical name tagging. The algorithm tokenizes the input sentence using the same procedure as for protein name recognition and the resulting sequence of tokens is searched for the longest subsequence satisfying the following criteria:

It starts with a numerical token, a Greek letter spelled out in English, or an alphabetical token belonging to the list of chemical name constituents (the sum of “adjective” and “noun” morphemes).
It contains numerical tokens, Greek letters, alphabetical tokens belonging to the list of chemical name constituents, and special punctuation symbols: comma, single quote, plus, minus, and round parentheses.
If it contains parentheses, they should be balanced.
A comma should be surrounded by two numerical tokens.
A single quote should follow a numerical token.

Identification of other domain-specific concepts (cellular objects, complexes, and cellular processes) is based on the separate dictionary of such entities and is done in the manner identical to the protein identification.

References

[1] Blaschke, C., Andrade, M.A., Ouzounis, C., and Valencia, A. (1999). Automatic extraction of biological information from scientific text: protein – protein interactions. Ismb: 60-67. [2] Blaschke C., and Valencia, A. (2002). The frame-based module of the Suiseki information extraction system, IEEE Intelligent Systems 17: 14-20. [3] Ding, J., Berleant, J., Nettleton,D., and Wurtele, E.(2002) Mining MEDLINE: Abstracts, Sentences, or Phrases? Pac Symp. Biocomput. [4] Friedman, C. Kra, P., Yu, H., Michael Krauthammer, and Rzhetsky, A. (2001) GENIES: a natural language processing system for the extraction of molecular pathways from journal articles. Bioinformatics 17, Suppl 1: S74-S82. [5] Humphreys, K., Demetriou, G., and Gaizauskas, R. (2000). Two applications of information extraction to biological science journal articles: enzyme interactions and protein structures. Pac Symp. Biocomput.: 505-516. [6] Karp, P.D., Riley, M., Paley, S.M., Pelligrini-Toole, A., and Krummenacker, M. (1999). Eco Cyc: Encyclopedia of Escherichia coli genes and metabolism. Nucleic Acid Res. 27: 55- 58. [7] Novichkova, S., Egorov, S., and Daraselia, N. (2003) Medscan, a natural language processing engine for MEDLINE abstracts. Bioinformatics, in press. [8] Ono, T., Hishikagi, H., Tanigami, A, and Takagi, T. (2001). Automated extraction of information on protein – protein interactions from the biological literature. Bioinformatics 17: 155- 161. [9] Park, J.C., Kim, H.S., and Kim, J.J. (2001). Bidirectional incremental parsing for automatic pathway identification with combinatory categorical grammar. Pac. Symp. Biocomput. 6: 396- 407. [10] See-Kiong, N., and Wong, M. (1999). Toward routine automatic pathway discovery from on-line scientific text abstracts. Genome Informatics 10: 104 – 112. [11] Sekimizu, T., Park, H.S., and Jun'ichi Tsujii (1998). Identifying the interaction between genes and gene products based on frequently seen verbs in MEDLINE abstracts. Genome informatics 9: 62-71. [12] Stephens, M., Palakal, S., Mukhopadhyay, S., and Raje, R. (2001). Detecting gene relations from MEDLINE abstracts. Pac Symp Biocomput.: 483 – 495. [13] Thomas, J., Milward, D., Ouzounis, C.A., Pulman, S., and Caroll, M. (2000). Automatic extraction of protein interactions from scientific abstracts. Pac. Symp. Biocomput.: 541-552. [14] Yakushiji, A., Tateisi, Y., Miyao, Y., and Jun'ichi Tsujii (2001). Event extraction from biomedical papers using a full parser. Pac. Symp. Biocomput. 6: 408-419.

,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2004 ExtractingProteinFunctionInfoFromMEDLINE	Nikolai Daraselia Sergei Egorov Andrey Yazhuk Svetlana Novichkova Anton Yuryev Ilya Mazo			Extracting Protein Function Information from MEDLINE Using a Full-Sentence Parser			http://www2.informatik.hu-berlin.de/Forschung Lehre/wm/ws04/2.pdf