2005 ASurveyOfCurrWorkInBioTextMining

Jump to navigation Jump to search

Subject Headings: Biomedical Text Mining, Text Mining, Biomedical Literature text-mining, bioinformatics, natural language processing.


Cited By



  • The volume of published biomedical research, and therefore the underlying biomedical knowledge base, is expanding at an increasing rate. Among the tools that can aid researchers in coping with this information overload are text mining and knowledge extraction. Significant progress has been made in applying text mining to named entity recognition, text classification, terminology extraction, relationship extraction and hypothesis generation. Several research groups are constructing integrated flexible text-mining systems intended for multiple uses. The major challenge of biomedical text mining over the next 5–10 years is to make these systems useful to biomedical researchers. This will require enhanced access to full text, better understanding of the feature space of biomedical literature, better methods for measuring the usefulness of systems to users, and continued cooperation with the biomedical research community to ensure that their needs are addressed.

Named entity recognition

  • At first glace, the task of named entity recognition (NER) appears straightforward. The goal is to identify, within a collection of text, all of the instances of a name for a specific type of thing: for example, all of the drug names within a collection of journal articles, or all of the gene names and symbols within a collection of MEDLINE abstracts. Hansich and de Bruijn and coworkers9,10 believed that solving this problem would allow more complex text-mining tasks to be addressed. The idea is that recognising biological entities in text allows for further extraction of relationships and other information by identifying the key concepts of interest and allowing those concepts to be represented in some consistent, normalised form.
  • This task has been challenging for several reasons. First, there does not exist a complete dictionary for most types of biological named entities, so simple textmatching algorithms do not suffice. In addition, the same word or phrase can The goal of biomedical text mining is to shift the burden of information overload from the researcher to the computer Recognising biological entities in text allows for further extraction of relationships and other information by identifying the key concepts of interest refer to a different thing depending upon context (eg ferritin can be a biological substance or a laboratory test). Conversely, many biological entities have several names (eg PTEN and MMAC1 refer the same gene). Biological entities may also have multi-word names (eg carotid artery), so the problem is additionally complicated by the need to determine name boundaries and resolve overlap of candidate names.
  • Because of the potential utility and complexity of the problem, NER has attracted the interest of many researchers, and there is a tremendous amount of published research in this topic. With the large amount of genomic information being generated by biomedical researchers, it should not be surprising that in the genomics era, much of the work in biomedical NER has focused on recognising gene and protein names in free text.
  • The approaches generally fall into three categories: lexicon-based, rules-based and statistically based. Combined approaches also have been used. The output may be a set of tags assigning a predicted type to each word or phrase of interest, as in part-of-speech (POS) tagging,11 or as a score designating the confidence that a word or phrase is of a given type of interest. Systems are typically measured in terms of precision (number of correct predictions divided by total number of predictions) and recall (number of correct predictions divided by number of actual named entities in the text). Precision and recall are often combined into a single measure, either using the F-score, defined as the harmonic mean of precision and recall (2PR/[P+R]),12 or by reporting the balanced precision and recall, defined as the point where precision and recall are equal.
  • One of the most successful rules-based approaches to gene and protein NER in biomedical texts has been the AbGene system of Tanabe and Wilbur.13 It has been used as the NER component in extracting relationships by several other researchers.14,15 AbGene works by extending the Brill POS tagger11,16,17 to include gene and protein names as a tag type with the system trained on 7,000 hand-tagged sentences from biomedical text. AbGene then applies manually generated post-processing rules based on lexical-statistical characteristics that help further identify the context in which gene names are used and eliminate false positives and negatives. The system achieved a precision of 85.7 per cent at a recall of 66.7 per cent.
  • In contrast to the tagging approach used by Tanabe and Wilbur, Chang et al. created the GAPSCORE system,18 which assigns a numerical score to each word within a sentence by examining the appearance, morphology and context of the word and then applying a classifier trained on these features. Words with higher scores are more likely to be gene and protein names or symbols. After training on the Yapex corpus,19 precision, recall and F-score were computed for both the exact matches and ‘sloppy’ matches (defined as a true positive if any part of gene name is predicted correctly), with the system performing much better with sloppy matches (precision 74 per cent, recall 81 per cent, F-measure 77 per cent), than with exact matches (precision 59 per cent, recall 50 per cent, F-measure 54 per cent).
  • A number of other groups have worked in this area. Hanisch et al. used a large dictionary of gene and protein names and semantically classified words that tend to appear in context with protein names, reporting a specificity of 95 per cent and sensitivity of 90 per cent. [10] Zhou et al. trained a hidden Markov model (HMM) on a set of features based on word formation (ie capitalisation), morphology (ie prefix and suffix), POS, semantic triggers (head nouns and verbs) and intra-document name aliases.20 They reported an overall precision of 66.5 per cent at a recall of 66.6 per cent on the GENIA corpus.21 Other gene and protein NER systems include those by Narayanaswamy et al.,[22] Settles [23] and Mika and Rost. [24]



 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2005 ASurveyOfCurrWorkInBioTextMiningAaron M. Cohen
William R. Hersh
A Survey of Current Work in Biomedical Text Mininghttp://skynet.ohsu.edu/~hersh/briefings-05-cohen.pdf10.1093/bib/6.1.57