2008 OverviewofBioCreativeIIGeneMent

From GM-RKB
Jump to navigation Jump to search

Subject Headings: BioCreative II, Gene Mention Task.

Notes

Cited By

Quotes

Abstract

Nineteen teams presented results for the Gene Mention Task at the BioCreative II Workshop. In this task participants designed systems to identify substrings in sentences corresponding to gene name mentions. A variety of different methods were used and the results varied with a highest achieved F1 score of 0.8721. Here we present brief descriptions of all the methods used and a statistical analysis of the results. We also demonstrate that, by combining the results from all submissions, an F score of 0.9066 is feasible, and furthermore that the best result makes use of the lowest scoring submissions.

Background

Results and discussion

Basic concepts

Before proceeding to the individual system descriptions, we give, for readers who are not familiar with natural language processing (NLP), a few paragraphs summarizing the basic terminology. For an introduction to NLP see [9] or [10]. Text is commonly processed by segmenting it into sentences or excerpts, and tokenized by breaking it up further into words, numbers, and punctuation generally called tokens, which each consist of a string of characters without white space. In this process, hyphens and punctuation often receive special treatment. A word may be further analyzed by a process called lemmatization into its lemma, which is the uninflected base form of the word that you would find as a dictionary entry. Different derivations and inflections are said to have this base form as their lemma. There is sometimes ambiguity in this concept. Alternatively, words may be stemmed by an algorithm that strips off suffixes to yield a reduced form, and this often gives a good approximation to the lemma. Tokens of text may be assigned tags which are categories from some given domain, for instance parts of speech (POS; for example, noun, verb, auxiliary). The process of identifying noun phrases and verb phrases is called chunking, which usually relies on POS tagging as its first step. As a further refinement, a sentence may be analyzed into its full syntactic structure, which is called parsing.

NER seeks to identify the words and phrases in text that reference entities in a given category, such as people, places, or companies, or in this application genes and proteins. NER is frequently accomplished with B-I-O tagging, which classifies each token as being at the beginning of the named entity (B), continuing the entity (I), or outside of any entity to be tagged (O). There are several lexical resources (sources of information about words) commonly used in solving the NER problem. A gazetteer is a list of names belonging to a particular category, such as places, persons, companies, genes, and so on. A lexicon is a source of information about different forms or grammatical properties of words. A thesaurus is a source of information indicating words with similar and/or related meanings. Systems in the BioCreative I challenge were classified as open if they used lexical resources, particularly gazetteers, and otherwise closed. A commonly used lexical resource is the Unified Medical Language System (UMLS), a controlled vocabulary of biomedical terminology maintained by the US National Library of Medicine.

Machine learning refers to computer algorithms that ' learn' to recognize concepts given a training set, which is a collection of pre-classified entities that serve as examples and counter-examples of the concept of interest. When training set examples have been classified by a human expert, the training is called supervised, otherwise it is unsupervised. Semisupervised approaches use a combination of the two. An important approach in machine learning describes each entity by a set of features, or attributes that are either present or absent for that entity. For example, the words appearing in text are frequently used as features, as are sequences of n words appearing consecutively, called n-grams. A new unseen entity can be analyzed into its description by features and categorized by a previously trained machine learning algorithm. Since most machine learning algorithms are very successful in classifying the examples of the training set, it is important to evaluate the performance of the algorithm on a test set of entities that do not appear in the training set. In this challenge, a test set was provided to participants for evaluating their systems after they were given a period of time with the training set. Often, it is necessary to divide randomly a collection (or corpus) and to use one portion as training and the remainder for testing. When this is done repeatedly it is called cross-validation. Decision trees, boosted decision trees, support vector machines (SVM), and case based reasoning are general machine learning methods. Some machine learning algorithms can be conveniently applied to problems involving tagging, including Hidden Markov models (HMM)]], SVMs and conditional random fields (CRFs). There are public domain libraries that are frequently used for machine learning, among them WEKA [11] for general machine learning and MALLET[1]for CRFs.

Individual

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2008 OverviewofBioCreativeIIGeneMentSophia Katrenko
Christian Blaschke
Preslav Nakov
Chengjie Sun
Chun-Nan Hsu
Cheng-Ju Kuo
Larry Smith
I-Fang Chung
Yu-Shi Lin
Roman Klinger
Kuzman Ganchev
Lorraine K. Tanabe
Rie J. Ando
Christoph M. Friedrich
Andreas Vlachos
Overview of BioCreative II Gene Mention Recognition10.1186/gb-2008-9-s2-s22008