NER Predictor Feature

References

(Tasi et al., 2006) ⇒ Tzong-han Tsai, Wen-Chi Chou, Shih-Hung Wu, Ting-Yi Sung, Jieh Hsiang, and Wen-Lian Hsu. (2006). “Integrating Linguistic Knowledge into a Conditional Random Field Framework to Identify Biomedical Named Entities.” In: Expert Systems with Applications: An International Journal, 30(1). doi:10.1016/j.eswa.2005.09.072
- In the NER problem, we regard each word in a sentence as a token. Each token is associated with a tag that indicates the category of the NE and the location of the token within the NE, for example, B_c, I_c where [math]\displaystyle{ c }[/math] is a category. These two tags denote respectively the beginning token and the following token of an NE in category c. In addition, we use the tag $O$ to indicate that a token is not part of an NE. The NER problem can then be phrased as the problem of assigning one of 2n + 1 tags to each token, where n is the number of NE categories. In the JNLPBA 2004 task, there are 5 named entity categories and 11 tags. For example, one way to tag the phrase IL-2 gene expression, CD28, and NFkappa B in a paper is “B-DNA, I-DNA, O, O, B-protein, O, O, B-protein, I-protein”.