PPLRE Project Introduction

Jump to: navigation, search

This page introduces the goals and challenges of the PPLRE Project.

Project Goal

The primary goal of the project is to semi-automatically populate a database that contains the Semantic Relations between Prokaryote organisms, their Proteins and the protein's eventual Cellular Location, as reported in biomedical research papers. For example, a research paper that contains the following sentence: "Detergent fractionation of P. aureginosa has demonstrated that OprM is located in the outer membrane."[PubMedID=7988873] would result in the following Data Record to be added to the database: [P. aureginosa | OprM | outer membrane].

Secondarily, the project will also advance the state of the art of Information Extraction Algorithms.


The primary motivation for this project is a desire to improve the accuracy of the PSORTb predictive model developed by the Brinkman Laboratory, especially to a broader set of proteins. Given a protein’s sequence data the PSORTb model predicts where the protein Localization. We expected that a significant increase in the number of known localizations will result in a significant improvement in the model's accuracy (ideally due to higher Recall).
A secondary motivation of the project is to begin to create a similar database for Archaea, which could then also be used to create a predictive model for Archaea.
Finally, the project's longer term motivation is to create a complete database of allo of the known prokaryote subcellular localizations. Such a database will benefit further biomedical research and in turn humanity as a whole.

Ideal Outcome

The ideal outcomes of the project are:

  1. That the training dataset in ePSORTdb will be increased sufficiently to increase the recall values of the PSORTb program
  2. That we will identify a sufficient number Archaeal protein localizations to create a useful predictive model.

Unfortunately, we do not know how many relationships we need to discover in order to accomplish this; nor do we know how many relationships are identifiable in the literature. As a starting point our aim is to double the current number of localizations for Bacteria and gather as many records on Archaea as there are currently records for Bacteria. I.e.: 2,000 localizations for Bacteria, and 1,000 for Archaea.
Note that to accomplish this outcome we will have to significantly reduce the time required from a domain expert to assess the validity of each relationship. Our hope is that the review time will be approximately 5 person/minutes per valid localization.


The approach that will be used by the project is to use advanced Information Extraction (IE) techniques against a Corpus of approximately 280,000 PubMed abstracts and articles that have been retrieved using Keywords that will likely lead to such relationships.
The techniques will discover patterns based on:

  • A set of approximately 500 localizations that are already known or provided by our domain experts.
  • A set of textual positive and negative textual patterns that domain experts believe to be commonplace.