This page introduces the goals and challenges of the PPLRE Project.

Project Goal

The primary goal of the project is to semi-automatically populate a database that contains the Semantic Relations between Prokaryote organisms, their Proteins and the protein's eventual Cellular Location, as reported in biomedical research papers. For example, a research paper that contains the following sentence: "Detergent fractionation of P. aureginosa has demonstrated that OprM is located in the outer membrane."[PubMedID=7988873] would result in the following Data Record to be added to the database: [P. aureginosa | OprM | outer membrane].

Secondarily, the project will also advance the state of the art of Information Extraction Algorithms.

Motivation

The primary motivation for this project is a desire to improve the accuracy of the PSORTb predictive model developed by the Brinkman Laboratory, especially to a broader set of proteins. Given a protein’s sequence data the PSORTb model predicts where the protein Localization. We expected that a significant increase in the number of known localizations will result in a significant improvement in the model's accuracy (ideally due to higher Recall).
A secondary motivation of the project is to begin to create a similar database for Archaea, which could then also be used to create a predictive model for Archaea.
Finally, the project's longer term motivation is to create a complete database of allo of the known prokaryote subcellular localizations. Such a database will benefit further biomedical research and in turn humanity as a whole.

Ideal Outcome

The ideal outcomes of the project are:

That the training dataset in ePSORTdb will be increased sufficiently to increase the recall values of the PSORTb program
That we will identify a sufficient number Archaeal protein localizations to create a useful predictive model.

Unfortunately, we do not know how many relationships we need to discover in order to accomplish this; nor do we know how many relationships are identifiable in the literature. As a starting point our aim is to double the current number of localizations for Bacteria and gather as many records on Archaea as there are currently records for Bacteria. I.e.: 2,000 localizations for Bacteria, and 1,000 for Archaea.
Note that to accomplish this outcome we will have to significantly reduce the time required from a domain expert to assess the validity of each relationship. Our hope is that the review time will be approximately 5 person/minutes per valid localization.

Approach

The approach that will be used by the project is to use advanced Information Extraction (IE) techniques against a Corpus of approximately 280,000 PubMed abstracts and articles that have been retrieved using Keywords that will likely lead to such relationships.
The techniques will discover patterns based on:

A set of approximately 500 localizations that are already known or provided by our domain experts.
A set of textual positive and negative textual patterns that domain experts believe to be commonplace.

PPLRE Project Introduction

Project Goal

Motivation

Ideal Outcome

Approach

Navigation menu

PPLRE Project Introduction

Project Goal

Motivation

Ideal Outcome

Approach

Navigation menu

Search