070309 DrAmosBairoch

Jump to: navigation, search

Subject Headings: SwissProt.



I met with Dr. Amos Bairoch, a Swiss-Prot expert, on 2007-March-09 to discuss the application of Swiss-Prot to the PPLRE project. Topics discussed included: Swiss-Prot, PPLRE, NER - Protein, Swissknife, IE - Life Sciences
Outcomes tasks included:

  • Download the latest version of Swiss-Prot data (DONE)
  • Analyze the quantity of SUBCELLULAR LOCATION entries in Swiss-Prot. (DONE)
  • Follow-up with Fiona about creating a mapping between PSortdb's and Swiss-Prot's SCL terminology.
  • Analyze the overlap between ePSORTdb and Swiss-Prot
  • Extract the relevant information from Swiss-Prot.
  • Decide on how best to use the information (team meeting).
  • Move Swiss-Prot processing code to use Swissknife
  • Research the NER - Protein task again.

Subcellular Location Data in Swiss-Prot

Controlled Vocabulary for SUBCELLULAR LOCATION

  • By happy coincidence Swiss-Prot is about to update the SUBCELLULAR LOCATION data field to allign to a Controlled Vocabulary.
  • He me earlier with Fiona and began discussions on creating a mapping to their terminology. (Task: chat with Fiona about this)
  • The SUBCELLULAR LOCATION field will still contain free form comments.

Swiss-Prot and ePSORTdb

I spent some time with him working through an example in order to be clear about how to align ePSORTdb data with Swiss-Prot data.

The Example: Ubiquinol oxidase

Location/Localization Terminology

  • Notice how the "COMMENTS" section of the Swiss-Prot record contains the entry:
    • SUBCELLULAR LOCATION: Cell inner membrane; multi-pass membrane protein.
  • On ePSORTdb the experimental_scl property reads:
    • CytoplasmicMembrane
  • This is an example where we need the mapping between the two projects vocabularies.

Matched References

  • Interestingly the single PMID reference in the ePSORTdb record (11017202) also exists in the Swiss-Prot record.
  • This is likely because there are likely few papers that report on experiments on this protein.


  • Notice that one of the comments in the Swiss-Prot record mentions a "SUBCELLULAR LOCATION" of Cell inner membrane multi-pass membrane protein
  • This differs from ePSORTdb's entry of CytoplasmicMembrane.
  • This is an example where the mapping between the two vocabularies is required.

  • Dr. Bairoch mentioned that when the name is followed by anything in parentheses, for example "(By Similarity)" or "(Predicted)", that these are not experimentally validated entries.
  • A quick test through the Swiss-Prot for entries withOUT parentheses suggests that there are
    • 25,607 experimentally validated Bacteria proteins
    • 1,343 experimentally validated Archaea proteins.

% grep "SUBCELLULAR LOCATION" uniprot_sprot_bacteria.dat | grep '\-!-' | grep -v "(" | wc -l 25607

% grep "SUBCELLULAR LOCATION" uniprot_sprot_archaea.dat | grep '\-!-' | grep -v "(" | wc -l 1343

  • A quick test through the Swiss-Prot for entries WITH parentheses suggests that there are
    • 31,689 NON-experimentally tested Bacteria proteins
    • 1,693 NON-experimentally tested Archaea proteins.

% grep "SUBCELLULAR LOCATION" uniprot_sprot_archaea.dat | grep '\-!-' | grep "(" | wc -l 1603

% grep "SUBCELLULAR LOCATION" uniprot_sprot_bacteria.dat | grep '\-!-' | grep "(" | wc -l 31689

  • NOTE: I may have overlooked something significant, but at worst the actual numbers would be at most four times (4x) smaller. I.e. my guess is that worst-case there are only ~6,000 validated proteins; which is still a good size.


  • The Swiss-Prot record has many references for this protein: thirteen (13) of them.
  • Each reference is labelled with an indicator of the type of information that was extracted from the paper.
  • Notice how reference number ten (10) to PMID=16079137 is labeled with "SUBUNIT, AND SUBCELLULAR LOCATION".
  • Notice that this PMID differs from the one referenced in ePSORTdb. The reason for this is likely that it is a newer publication (2005).
  • We can use this label ourselves to get papers that contain experimentally validated OPLs.

  • A quick test through the Swiss-Prot for entries WITH references that are labeled as "SUBCELLULAR LOCATION" suggests that:
    • There are 13 references with experimentally validated Bacteria proteins
    • There are 543 references with experimentally validated Bacteria proteins

% grep "SUBCELLULAR LOCATION" uniprot_sprot_bacteria.dat | grep RP | wc -l 543

% grep "SUBCELLULAR LOCATION" uniprot_sprot_archaea.dat | grep RP | wc -l 13

  • Note: the number of papers is signicantly smaller than proteins (543 vs. 25607) because the labeling of papers began later.

New Accession Number

  • Notice that the Accession number use in ePSORTdb differs from the one in the Swiss-Prot record. This is an example where the Swiss-Prot record was split into two records two years ago when they decided to have each entry be specific to one organism. In this case the protein was formerly shared between "E.Coli" and "E.Coli O6".
  • It is possible to perform the join by looking through Swiss-Prot's old accession numbers.
  • He took an action item down that he would contact Fiona about updating ePSORTdb's Swiss-Prot reference.

Whole Paper

  • The paper reference in Swiss-Prot for the localization of this protein (PMID=16079137) is interesting in that its abstract does not mention the protein.
  • The reason for this absence is that the paper contains a multitude of results: ~43 proteins in total.
  • Notice the wonderfully suggestive title: Protein Complexes of the Escherichia coli Cell Envelope
  • http://www.jbc.org/cgi/reprint/280/41/34409.pdf
  • (BTW, I manually experimented with the whole document. The first challenge is that the PDF is not in PubMed Central, the second challenge is that even with Adobe's latest version of Acrobat the extraction of text is still very noisy. Many sentences are chopped up and spaces missing. I.e. not a discouraging result)

General Recommendations for IE from BioMed papers

  • One way to improve performance will be to use the whole document not just the abstract.
  • His experience with text mining however also suggests that PDF to text conversion is problematic.

Other Candidate IE Tasks

His comments on other candidate IE tasks that came to mind:

Post Translational Modification (PTM)

Mutation and Variations

  • Another area that he thought relevant is "mutation and variations".
  • He pointed me to one of the earlier papers on the application of information extraction to this domain:

Transcription Regulation

Miscellaneous Notes