2008 ExploitingMultiplyAnnotatedCorporaInBiomedIE

(Haddow & Alex, 2008) ⇒ Barry Haddow, Beatrice Alex. (2008). “Exploiting Multiply Annotated Corpora in Biomedical Information Extraction Tasks." In: Proceedings of the 6th Language Resources and Evaluation Conference (LREC 2008).

Subject Headings: Information Extraction, ITI TXM Corpora, Annotation Task

Notes

This paper experiments with the performance increase that can be gained from having documents annotated by more than one person. It is common practice to have documents reviewed by more than one person/annotator in order to report the IAA (inter-annotator agreement). The IAA then becomes the upper bound expected from an automated solution.
Interestingly, for their NER and Relation Detection task, it appeared to be more effective to have a person annotate a brand-new paper rather than to have them spend the time to clean the data by having a second annotator to redo the result.
Figure 1 of the paper provides the empirical evidence. As more train records are added, performance improves if the record is new rather than scrubbed. "Comparison of the improvement gained from adding further singly annotated data, versus further multiply annotated data, for (a) PPI and (b) TE named entity recognition."
This result helps us because for the PPLRE annotation we did not perform robust inter-annotator agreement analysis. While we did have much of the data reviewed twice or thrice, we did not analyze agreement. I.e. we focused on more papers than on very clean data. This paper provides some evidence for the appropriateness of this decision.
This is another paper by the School of Informatics group at the University of Edinburgh.
It would be interesting to present an analysis of how estimate IAA from a few samples rather than to use the apparently wasteful exercise of having every record reviewed by another person/annotator.

Email

> I am following up on my question about the x-axis of the learning > curves in Figure 1 of your LREC08 paper on exploiting multiply annotated corpora. > > Figure 1: Comparison of the improvement gained from adding further > singly annotated data, versus further multiply annotated data, for (a) > PPI and (b) TE named entity recognition. > > It would help me to reference your paper if you could tell me to > quantify how many records each of the values in the x-axis represented.

The relevant figures (numbers of documents in the training set) are as follows: ppi corpus (both re and ner) 0 82 1 91 2 100 3 109 4 118 5 127 6 136 7 145 8 154 9 163 10 172 te corpus (both re and ner) 0 115 1 123 2 131 3 139 4 146 5 154 6 162 7 169 8 177 9 185 10 192

Cited By

Quotes

Abstract

This paper discusses the problem of utilising multiply annotated data in training biomedical information extraction systems. Two corpora, annotated with entities and relations, and containing a number of multiply annotated documents, are used to train named entity recognition and relation extraction systems. Several methods of automatically combining the multiple annotations to produce a single annotation are compared, but none produces better results than simply picking one of the annotated versions at random. It is also shown that adding extra singly annotated documents produces faster performance gains than adding extra multiply annotated documents.

References

,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2008 ExploitingMultiplyAnnotatedCorporaInBiomedIE	Beatrice Alex Barry Haddow			Exploiting Multiply Annotated Corpora in Biomedical Information Extraction Tasks		Proceedings of the 6th Language Resources and Evaluation Conference	http://www.ltg.ed.ac.uk/np/publications/ltg/papers/Haddow2008Exploiting.pdf			2008