PPLRE Research Topics - Relations across Multiple Sentence

From GM-RKB
Jump to navigation Jump to search

Back to PPLRE Research Topics.


  • Synopsis: Most current state or the art Relation Recognition Algorithms only discover Semantic Relations that are contained within a single sentence. Recall performance could be improved by identifying relations that are expressed across multiple sentences. For example, in a biomedical document an organism is often identified early in the document and no longer explicitly restated in latter sentences that mention one of its proteins. In general the spread of relations into multiple sentences is more likely to occur as relations involve more than two entities (see Ternary Relations below). Possible approaches to this challenge include: the addition of Anaphora Resolution and Coreference Resolution on named entities, building a Text Graph that joins on these entities and then performing search on the graph.

Evidence

  • E1) In the current version of the curated data (PPLRE Curated Data v1.3) only 61.1% of the curated relations appear in a single sentence. The best possible Recall by current algorithms would be 61.1%.
  • E2) It is common for one of the entities to be assumed during the later sentences of the discourse. For example, in the PPLRE Curated Corpus v1.3 25.7% of all of the mentioning of the Organism entity occur in the first Sentences of the Document while only 10.2% of the mentionings of the proteins occur on the first sentence.

Examples

Below are several sample passages in which the relation is spread over more than one sentence.


1) In this semi-idealized example each of the three entities is in a separate sentence.

This idealized scenario illustrates a natural sounding sequence of sentences in which all three sentences are required to recognize the relation.

  • "[ORGANISM P. aeruginosa] cells were grown in LB medium. "The [LOCATION outer membrane] was analyzed by gel electrophoresis. Major proteins discovered included [PROTEIN OprM] and [PROTEIN OprN].
  • The example was composed by manually stitching together three real-world sentences that contained a single entity.

2a) Organism not paired with Protein

Passage PPLRE Corpus 30.a.0-1 contains the three relations OPL(A. faecalis, amine dehydrogenase, periplasm), OPL(A. faecalis, azurin, periplasm), and OPL(A. faecalis, cytochrome c, periplasm).

  • "A lysozyme-osmotic shock method is described for fractionation of [ORGANISM Alcaligenes faecalis] which uses glucose to adjust osmotic strength and multiple osmotic shocks."
  • "During phenylethylamine-dependent growth, aromatic [PROTEINa amine dehydrogenase], [PROTEINb bazurin], and a single [PROTEINc cytochrome c] were localized in the [LOCATION periplasm]."

2b) Organism not paired with Protein

Passage PPLRE Text 8611.a.0-1 contains two relations spread accross two sentences.

    • "Plant signal molecules such as acetosyringone and certain monosaccharides induce the expression of [ORGANISM Agrobacterium tumefaciens] virulence (vir) genes, which are required for the processing, transfer, and possibly integration of a piece of the bacterial plasmid DNA (T-DNA) into the plant genome."
    • "Two of the vir genes, [PROTEINa virA] and [PROTEINb virG], belonging to the bacterial two-component regulatory system family, control the induction of vir genes by plant signals."
    • "[PROTEINa virA] encodes a [LOCATION membrane-bound] sensor kinase protein and [PROTEINb virG] encodes a [LOCATIONb cytoplasmic] regulator protein."

Notes:

  • Also illustrates the case where the organism is infrequently mentioned.
  • The organism and one of the proteins are mentioned in the title.
  • The relations are:
    • OPL(A. tumefaciens, virA, cytoplasm)
    • OPL(A. tumefaciens, virG, cytoplasm).
  • The organism and the protein however are never mentioned in the same sentence.
  • The organism name is not even in a sentence that neighbors the sentence with the PL() relation.

2c) Organism not paired with Protein

The passage PPLRE Corpus 7181.a is an example where the organism and the protein are never mentioned in the same sentence.

  • "Phthiocerol dimycocerosate (PDIM), a surface-exposed polyketide lipid necessary for [ORGANISM Mycobacterium tuberculosis] virulence, is the product of several polyketide synthases including PpsE. * "Transport of [OTHER PDIM] requires [PROTEIN MmpL7], a member of the MmpL family of RND permeases."
  • "Overexpression of the interaction domain of [PROTEINa MmpL7] acts as a dominant negative to [LIPID PDIM] synthesis by poisoning the interaction between [PROTEINb synthase] and transporter."
  • "This suggests that [PROTEIN MmpL7] acts in complex with the synthesis machinery to efficiently transport [OTHER PDIM] across the [LOCATION cell membrane]."
  • "Coordination of synthesis and transport may not only be a feature of MmpL-mediated transport in [ORGANISM M. tuberculosis], but may also represent a general mechanism of polyketide export in many different microorganisms."

Notes:

  • The relation is:
    • [[OPL(M. tuberculosis, MmpL7, cell membrane).
  • The document title has the organism name but not the protein.
  • The document abstract only mentions the one organism.
  • A Relation Recognition Pattern for the PL() relation could be: <PROTEIN> acts in * to efficiently transport * across <LOCATION>

3) Location not paired with a Protein

This is an example where the location is not paired with the protein.

    • "The major protein band, of lower M(r), was detected in the [LOCATION periplasmic fraction]."
    • "It had the same M(r) as the [PROTEINa PS1] protein band detected in the supernatant of [ORGANISM C. glutamicum] cultures and presumably corresponds to the mature form of [PROTEINa PS1]."
  • Notes:
    • Two relevant patterns on the two sentences. Notice that the concept band is a candidate connector.
      • PROTEIN <= band ⇒ detected ⇒ in ⇒ supernatant ⇒ of ⇒ cultures ⇒ ORGANISM
      • band <= detected ⇒ in ⇒ LOCATION
    • The permutation from the two proteins with identical name can be resolved via simple Co-reference Resolution.
    • The title could be helpful indirectly because it suggests that the paper is making claims about the protein and organism. However it also mentions a different localization destination (extracellular).