PPLRE High-level System Design

From GM-RKB
Jump to navigation Jump to search

This page introduces the proposed design of the PPLRE Project.

Overview

The system design is that of pipeline composed of the following subsystems: a document/information retriever (IRetriever), a document preprocessor (Preprocessor), a natural language processing annotator (Annotator), and a relationship extractor (RelExtractor). The system output is a set of organism/protein/localization relationships, along with a pointer to the passage that supports the relationship.

File:PPLREDesignEval.gif

'Figure 1 -- High-level design of PPLRE system. The IRetriever downloads a document; the Preprocessor converts a document into a common file format; the Annotator adds deep syntactic and shallow semantic information; finally, RelExtractor identifies PLO relationships based on the REmodel and appends a table.

PPLRE IRetriever=

The IRetriver subsystem downloads the abstracts and paper from PubMed and PubMedCentral based on a set of PMIDs or PMCIDs. Each article is added to a locally maintained file-system based corpus.

  • Note: we plan to switch to Medline corpus.

Preprocessor=

The Preprocessor converts an input document into a format that facilitates further processing, e.g. PDF to text, and extracts some of the information that can be inferred by its structure, e.g. title, author.

Annotator=

The Annotator subsystem applies natural language processing techniques to a document. Currently the NLP processing performed includes: Part-of-Speech tagging, grammar parse-tree generation, semantic role labeling and named-entity recognition (both generic and domain specific).

RelExtractor=

The RelExtractor subsystem reports PPL relationships that are extracted from a document. The output will be placed into a table. (See Figure 1).

Examples of the process are located in a section below entitled “Examples”.

REModel

Notice that the RelExtractor subsystem above includes an internal data structure named REModel. This structure is a predictive model that contains the logic that identifies the desired relationship information that may be contained within a document. The main challenge to the PPLRE task is to create an REModel that has high precision and relatively high recall performance.

Subjective REModel=

One of the preliminary models created was made by hand through trial and error. This model was created because it was easy to create through a simple trial and error approach. One of the patterns used for example was that if a sentence contains a single organism, a single protein and a single cell location then predict that the triple is a relationship. As expected, this approach did not lead to the generation of a model with the necessary recall.

Learned REModel

The general method that will be used to create the REModel model, is not to do so manually by a human expert, but by a learning algorithm that will use the set of articles that contain pre-identified organism/protein/localization relationships to train a model. The subsystem responsible for creating the model is called RelExtractorTrainer.

File:PPLREDesignTrain.gif

'Figure 2 – High-level Design of the PPLRE Training Phase. In the training phase the RelExtractorTrainer subsystem creates the model used by RelExtractor to make its predictions. The input to RelExtractorTrainer are the [math]\displaystyle{ m }[/math] documents for which the domain experts have associated a localization relationship to. For example, if document doc1 contains relationship [organism1, protein1, localization1], then the triple [O1, P1, L1] will be provided to the trainer.

Model Type 2 - Snowball derivative ==

Model Type 3 - Zhongmin's model ==

Model Type 4 - Gabor's model ==

Technical Design

  • Each subsystem in the pipeline can process a batch of items by distributing the work across the many nodes of the Buster computing cluster.