- LREC-2008 is the bi-annual conference of the European Languages Resources Association (ELRA)
- It took place in Marrakech - the first time outside of the European Union.
- It had four days of workshops and three days of talks/posters.
- I attended one of the workshop days and two days of talks/posters.
- It had between 1,000 and 1,100 attendees.
- The workshop that I attended was the full-day workshop on "Building & evaluation resources for biomedical text mining".
- The talks that made an impression on me are summarized below.
- The organizers were:
(Liberman, 2008) ⇒ Mark Liberman. (2008). “The Annotation Conundrum". Invited Talk. In: Proceedings of the Workshop on Building & Evaluation Resources for Biomedical Text Mining collocated with LREC-2008.
- Mark Liberman is with the Linguistic Data Consortium (LDC) which is based in Pennsylvania University
- Suggested that the consensus is now that the annotation effort is significant, particularly if you strive for high inter-annotator agreement.
- Annotation efforts can take over a year.
- He compared the challenge to the problem that natural language database query research encountered in the late 80s - it was bogged by knowledge engineering requirements. However, it appears that in the case of information extraction, the payoff exists in some cases to pay the price.
- He suggested that we loosen the requirement for a GOLD baseline, just as was done for translation.
- He suggested that we invest in ontologies.
- I asked where he saw the corpora for the Biomedical domain in the next five years. He suggested that it would more likely still be distributed than be available for positive feedback. He offered LDC as a placeholder for the PPLRE data.
- The loosing of a gold-standard requirement is helpful to our PPLRE project because we have focused on more data rather than very clean data.
(Alex et al., 2008) ⇒ Bea Alex, Claire Grover, Barry Haddow, Mijail Kabadjov, Ewan Klein, Michael Matthews, Stuart Roebuck, Richard Tobin and Xinglong Wang. (2008). “The ITI TXM Corpora: Tissue Expressions and Protein-Protein Interactions.” In: Proceedings of the Workshop on Building & Evaluation Resources for Biomedical Text Mining collocated with LREC-2008.
- This paper reports on a thorough annotation project of PubMed papers with several annotators working for over a year.
- The project resulted in two different corpora: one for protein-to-protein interactions, the other for t expression.
- "The resulting corpus consists of 217 documents, 133 selected from PubMedCentral and 84 documents selected from the whole of PubMed. Document selection for the TE corpus was performed against PubMed."
- Annotation was performed by a group of nine biologists, all qualified to PhD level in biology, working under the supervision
of an annotation manager (also a biologist) and collaborating with a team of NLP researchers.
- They have a nice characterization of relation attributes
- Before I publish the PPLRE dataset I may borrow their XML markup approach.
- I asked for the proportion of inter-sentential relations, and they provided an annecdotal estimate of: 10% for PPI, and 30% for TE
- Claims to be the first sizeable corpora for tissue expression (TE).
- The ITI TXM corpora were created as part of an ITI Life Sciences Scotland (http://www.itilifesciences.com) research programme with Cognia EU and the University of Edinburgh.
- The full talks and posters that made an impression on me are summarized below.
- Some of the talk sessions had names such as: Information Extraction & Question Answering; Language Resources for Specific Domains: Bio-Medicine and Chemistry; Biomedical Resources; Opinion Mining and Summarization; and Ontologies.
(Haddow and Alex, 2008) ⇒ Barry Haddow and Beatrice Alex. (2008). “Exploiting Multiply Annotated Corpora in Biomedical Information Extraction Tasks.” In: Proceedings of LREC-2008.
- This paper experiments with the performance increase that can be gained from having documents annotated by more than one person. It is common practice to have documents reviewed by more than one person/annotator in order to report the IAA (inter-annotator agreement). The IAA then becomes the upper bound expected from an automated solution.
- Interestingly, for their NER and Relation Detection task, it appeared to be more effective to have a person annotate a brand-new paper rather than to have them spend the time to clean the data by having a second annotator to redo the result.
- Figure 1 of the paper provides the empirical evidence. As more train records are added, performance improves if the record is new rather than scrubbed. "Comparison of the improvement gained from adding further singly annotated data, versus further multiply annotated data, for (a) PPI and (b) TE named entity recognition."
- This result helps us because for the PPLRE annotation we did not perform robust inter-annotator agreement analysis. While we did have much of the data reviewed twice or thrice, we did not analyze agreement. I.e. we focused on more papers than on very clean data. This paper provides some evidence for the appropriateness of this decision.
- This is another paper by the School of Informatics group at the University of Edinburgh.
- It would be interesting to present an analysis of how estimate IAA from a few samples rather than to use the apparently wasteful exercise of having every record reviewed by another person/annotator.
(Wang and Grover, 2008) ⇒ Xinglong Wang and Claire Grover. (2008) [[http://www.gabormelli.com/References/2000s/2008/2008_LearningSpeciesOfBiomedNEs/2008_LearningSpeciesOfBiomedNEs.pdf%7C Learning the Species of Biomedical Named Entities from Annotated Corpora.]] In: Proceedings of LREC-2008.
- The paper proposes a relation detection algorithm for the relation Organism_Component(Organism,Component), where Component is a set of entity types such as: Protein, Gene, and others. Note that the authors don't present their task as a relation detection one.
- The motivation for this work (as we have discovered in PPLRE) is that having this information improves the performance for grounding the mention X to some corresponding database/ontology record. The improvement is due to the ambiguity that can exist in entity names. For example, protein names are often reused for different species - e.g. mouse and human.
- The authors propose (as have done in PPLRE) a heuristic to perform NER of the ORGANISM type.
- It was unclear what proportion of the relations were intersentential. I have sent them an email asking this question.
- Curiously, they do not treat the organism/species mentions as first class entities. They instead refer to them as "species words".
- I was surprised to see that the binary relation between an Organism and its constituents was still an open problem!
- We could publish a paper simply on analyzing this binary relation in PPLRE.
- Our advantages include that:
- we treat the problem as a general relation extraction one.
- our features are neither custom nor manually constructed.
(Nguyen, Kim and Tsujii, 2008) ⇒ Ngan Nguyen, Jin-Dong Kim and Jun'ichi Tsujii. (2008). [[http://www.gabormelli.com/References/2000s/2008/2008_ChallengesInPronounResolutionForBiomedText/ 2008_ChallengesInPronounResolutionForBiomedText.pdf|Challenges in Pronoun Resolution System for Biomedical Text.]] In: Proceedings of LREC-2008.
- This paper covers the topic of pronoun resolution in Biomedical Text.
- Even though we have not performed this type of resolution in PPLRE we may be asked by reviewers to discuss the topic and this may be a good reference.
- Also, given Jing Su's challenge (see below) that anaphor resolution reduces the number intersentential relations we will likely visit the topic of anaphor resolution.
(Sekine, 2008) ⇒ Satoshi Sekine. (2008). “[http://www.gabormelli.com/References/2000s/2008/2008_ExtendedNamedEntityOntologyWithAttribInf/2008_ExtendedNamedEntityOntologyWithAttribInf.pdf%7C Extended Named Entity Ontology with Attribute Information].” In: Proceedings of LREC-2008.
- The author (Sekine) in previous work had extended the number of entity types from the small and simplistic set of PERSON, LOCATION, ORGANIZATION, TIME, MONEY to have 120 entity types (often these are subtypes of the larger types).
- In this paper the author reports the manual addition of attributes to the ~120 entities (e.g. Person.DateOfBirth, Person.Gender, .... )
- Their work can be found at http://nlp.cs.nyu.edu/ene
- This work reminded me of the discussion with Martin about automatically discovering an entity's attributes from the text. The data reported in this paper could be our baseline.
- I asked the author whether he believed that the attributes he extracted could have been automatically extracted. He was not excited by the prospects of automated discovery.
- Discussion with Junichi Tsujii <is.s.u-tokyo.ac.jp@tsujii> about collaborating on Subcellular Locations
- I sent him a follow-up email about collaborating on the annotation of the subcellular location mentions and subcellular localization relations.
- I included links to:
- Discussion with Jian Su <lit.a-star.edu.sg@sujian> from Institute for Infocomm Research, Singapore about the proportion of intra-sentential relations.
- She suggested that a large proportion of these are due to coreference (e.g. pronouns).
- Her new paper at ACL appears to be on semi-supervised relation extraction
- Discussion withSophia Ananiadou <firstname.lastname@example.org>
- Her team is actively looking for senior researchers. I was tempted to submit a CV but opted to focus on my PhD. :-)
- She mentioned that the LREC journal is putting out a special issue on BioMedical data.
- This may be a good venue for a paper on the PPLRE Corpus.
- BioNLP is another workshop relevant to PPLRE
- This workshop series is colocated with ACL http://compbio.uchsc.edu/BioNLP2008/
- The next conference is in June 19, 2008 at ACL in Columbus Ohio
- Analyze the PPLRE relations to estimate of the proportion of intersentential relations that would become intrasentential if anaphora resolution were accounted for. From the exercise we may also get a sense for how simple it is to perform pronoun resolution (there are challenges around explicit vs. implicit mentions). Afterwards touch-base with Jian Su about our results.