2004 ExactVsApproxProteinNER

Subject Headings: NER - Protein, ProMiner

Notes

We present a simple and efficient tool for exact matching of terms in a synonymlist against medline abstracts. It does not recognize spellings of a synonym which are not in the synonymlist and does not consider context for matching. Its main application is to test different synonymlists and to evaluate different kinds of expansions of synonymlists performed during curation. This tool allows us to rapidly evaluate modifications of synonyms and enables us to build high-quality synonymlists. These can then also be used as a prerequisite for text-mining with other text-mining programs. Additionally, we used a simple post filter in order to improve speci city of our results. Our goal in participating at the BioCreative-contest was to assess the sensitivity and specificity that can be achieved with extensively curated synonymlists and basically naive exact string-matching, and to assess the difference to more sophisticated text-mining approaches. We participated as group 24 in Task 1b for yeast and mouse. We did not ubmit results for fly because of the significant overlap of fly protein names with common english words for which our approach is not adapted. Our mouse synonymlist was also used by group 16 with a more sophisticated search algorithm implemented in the tool ProMiner[1, 2]. This contest allows us to compare the two approaches on a blind prediction basis and for an independent test set.

With our system we showed that it is possible to achieve good performance in protein name recognition with exact text matching. Our system does not need to be adapted for a specific synonymlist in terms of parameter tuning or internal lists. This allows for straightforward application.
It is crucial for our attempt to use synonymlists which are as complete and correct as possible. Therefore, we used a system for the extensive curation of protein synonym lists. This curation is largely independent of the synonymlist to be curated as the curation steps are of general character. Nevertheless, the system can be adapted easily to cover specific problems of synonymlists, like missing synonyms which are frequently used in texts but are present in the synonymlist only with slight differences in spelling. One disadvantage of the extensive curation is the fact that the synonym lists become very large as they need to cover all possible different spellings of a protein name. In order to avoid this, one could consider making the text search more exible, e.g. by including certain equivalent expressions directly in the search tool.

D. Hanisch, Katrin Fundel, H.T. Mevissen, R. Zimmer, and J. Fluck. (2004). “Prominer: Organism-specific protein name detection using approximate string matching.” In: Proceedings of the BioCreative Challenge Evaluation Workshop 2004.
D. Hanisch, J. Fluck, H.T. Mevissen, and R. Zimmer. (2003). “Playing Biology's Name Game: Identifying protein names in scientific text.” In: Pacific Symposium on Biocomputing, 8.

,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2004 ExactVsApproxProteinNER	Katrin Fundel Ralf Zimmer Daniel Guttler and Joannis Apostolakis			Exact versus approximate string matching for protein name identification		Proceedings of BioCreative 2004 Workshop	http://www.pdg.cnb.uam.es/BioLink/workshop BioCreative 04/handout/pdf/BioCreative WorkshopPaper 040301.pdf			2004