2004 AnEntityTagger

Jump to: navigation, search

Subject Headings: Gene Named Entity Recognition Algorithm, Named Entity Recognition Algorithm, Machine Learning, Conditional Random Field


Cited By



VTag is an application for identifying the type, genomic location and genomic state-change of acquired genomic aberrations described in text. The application uses a machine learning technique called conditional random fields. VTag was tested with 345 training and 200 evaluation documents pertaining to cancer genetics. Our experiments resulted in 0.8541 precision, 0.7870 recall and 0.8192 F-measure on the evaluation set.

The software is available at http://www.cis.upenn.edu/group/datamining/software_dist/biosfier/.


The core of VTag is a probability model called Conditional Random Fields (CRFs) (Lafferty et al., 2001). These models are convenient because they allow us to combine the effects of many potentially informative features and have previously been successfully used for other biomedical named entity taggers (McDonald and Pereira, 2004). CRFs model the conditional probability of a tag sequence given an observation sequence: P(T|O) where O is an observation sequence, in our case a sequence of tokens in the abstract, and T=t1,t2,...,tn is a corresponding tag sequence in which each tab labels the corresponding token with on of TYPE, LOCATION, INITIAL-STATE, ALTERED-STATE and OTHER. CRFs are log-linear models based on a set of feature functions, fi(tj,tj-1,O) that map predicates on observation/tag-transition pairs to binary values. Each feature has an associated weight, li, that measure its effect on the overall choice of tags. These models are convenient because they allow us to combine the effects of many potentially informative features. …

Given a trained model, the optimal tag sequence for new examples is found with the Viterbi algorithm (Rabiner, 1993).

. …


  • Collier, N., Nobata, C. and Tsujii, J. (2000) Extracting the names of genes and gene products with a hidden Markov model. In: Proceedings of the 18th International Conference on Computational Lingustics (COLING’2003), Saarbrucken, Germany, pp. 201–207.
  • (Kulick et al., 2003) ⇒ S. Kulick, M. Liberman, Martha Palmer, and A. Schein. (2003). “Shallow Semantic Annotations of Biomedical Corpora for Information Extraction.” In: Proceedings of the Third Meeting of the Special Interest Group on Text Mining at ISMB 2003.
  • J. Lafferty, [[A. McCallum, and F. Pereira. (2001) Conditional Random Fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of ICML-01, pp. 282–289.
  • McCallum,A.K. (2002). MALLET: a machine learning for language toolkit.
  • McDonald,R. and Pereira,F. (2004). Identifying gene and protein mentions in text using conditional random fields. In A Critical Assessment of Text Mining Methods in Molecular Biology workshop, 2004.
  • Rabiner,L. (1993) A tutorial on hidden Markov models and selected applications in speech recognition. InWaibel,A. and Lee,K.F. (eds), Readings in Speech Recognition. Morgan Kaufmann Publishers, San Francisco, CA, pp. 267–296.
  • (Sha & Pereira, 2003) ⇒ Fei Sha, and Fernando Pereira. (2003). “Shallow Parsing with Conditional Random Fields.” In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (HLT-NAACL 2003). doi:10.3115/1073445.1073473
  • Tanabe,L. and Wilbur,W. (2002). Tagging gene and protein names in biomedical text. Bioinformatics, 18, 1124–1132. UPenn Biomedical Information Extraction Group (2003) BioEntities: entity definitions for oncology.
  • Wang, J.Y., Lian, S.T., Chen, Y.F., Yang, Y.C., Chen, L.T., Lee, K.T., Huang,T.J. and Lin,S.R. (2002). Unique K-ras mutational pattern in pancreatic adenocarcinoma from Taiwanese patients. Cancer Lett., 180, 153–158.
  • Yu,H., Hatzivassiloglou,V., Rzhetsky,A. and Wilbur,W.J. (2002). Automatically identifying gene/protein terms in MEDLINE abstracts. J. Biomed. Inform., 35, 322–330.


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2004 AnEntityTaggerRyan T. McDonald
R. Scott Winters
Mark Mandel
Yang Jin
Peter S. White
Fernando Pereira
An Entity Tagger for Recognizing Acquired Genomic Variations in Cancer LiteratureBioinformatics Subject Areahttp://bioinformatics.oxfordjournals.org/cgi/content/abstract/20/17/324910.1093/bioinformatics/bth3502004