PPLRE High-level System Design
This page introduces the proposed design of the PPLRE Project.
The system design is that of pipeline composed of the following subsystems: a document/information retriever (IRetriever), a document preprocessor (Preprocessor), a natural language processing annotator (Annotator), and a relationship extractor (RelExtractor). The system output is a set of organism/protein/localization relationships, along with a pointer to the passage that supports the relationship.
'Figure 1 -- High-level design of PPLRE system. The IRetriever downloads a document; the Preprocessor converts a document into a common file format; the Annotator adds deep syntactic and shallow semantic information; finally, RelExtractor identifies PLO relationships based on the REmodel and appends a table.
The IRetriver subsystem downloads the abstracts and paper from PubMed and PubMedCentral based on a set of PMIDs or PMCIDs. Each article is added to a locally maintained file-system based corpus.
- Note: we plan to switch to Medline corpus.
The Preprocessor converts an input document into a format that facilitates further processing, e.g. PDF to text, and extracts some of the information that can be inferred by its structure, e.g. title, author.
The Annotator subsystem applies natural language processing techniques to a document. Currently the NLP processing performed includes: Part-of-Speech tagging, grammar parse-tree generation, semantic role labeling and named-entity recognition (both generic and domain specific).
The RelExtractor subsystem reports PPL relationships that are extracted from a document. The output will be placed into a table. (See Figure 1).
Examples of the process are located in a section below entitled “Examples”.
Notice that the RelExtractor subsystem above includes an internal data structure named REModel. This structure is a predictive model that contains the logic that identifies the desired relationship information that may be contained within a document. The main challenge to the PPLRE task is to create an REModel that has high precision and relatively high recall performance.
One of the preliminary models created was made by hand through trial and error. This model was created because it was easy to create through a simple trial and error approach. One of the patterns used for example was that if a sentence contains a single organism, a single protein and a single cell location then predict that the triple is a relationship. As expected, this approach did not lead to the generation of a model with the necessary recall.
The general method that will be used to create the REModel model, is not to do so manually by a human expert, but by a learning algorithm that will use the set of articles that contain pre-identified organism/protein/localization relationships to train a model. The subsystem responsible for creating the model is called RelExtractorTrainer.
'Figure 2 – High-level Design of the PPLRE Training Phase. In the training phase the RelExtractorTrainer subsystem creates the model used by RelExtractor to make its predictions. The input to RelExtractorTrainer are the [math]m[/math] documents for which the domain experts have associated a localization relationship to. For example, if document doc1 contains relationship [organism1, protein1, localization1], then the triple [O1, P1, L1] will be provided to the trainer.
Model Type 2 - Snowball derivative ==
Model Type 3 - Zhongmin's model ==
Model Type 4 - Gabor's model ==
- Each subsystem in the pipeline can process a batch of items by distributing the work across the many nodes of the Buster computing cluster.