PPLRE Resources
This page provides an overview for many of the external resources that we plan to use for the PPLRE Project.
PSORTdb
See: PPLRE ePSORTdbOrganismProteinLocalization Table See: PPLRE cPSORTdbOrganismProteinLocalization Table
PubMed
See: PubMed See: PubMed Central
= Stanford Parser
See: Stanford Parser See: PPLRE Stanford Parser
NCBI
See: NCBI.
- It is the source of our OrganismID (See: “NCBIOrganismName and NCBIOrganismTreeNodes Tables” section of detailed data design)
Swiss-Prot
- See: Swiss-Prot.
- It is the data source for the SProtProteinProkaryote table. (see section 5.1.16)
TrEMBL (Translation of EMBL)
- “TrEMBL is automatically generated (from annotated EMBL coding sequences (CDS)) and annotated using software tools. Contains all of what is not in SWISS-PROT. SWISS-PROT + TrEMBL = all known protein sequences. Once in SWISS-PROT, the entry is no more in TrEMBL, but still in EMBL (archive).”
UniProtKB (Universal Protein Knowledge Base)
- “UniProtKB is the central hub for the
collection of functional information on proteins, with accurate, consistent, and rich annotation. In addition to capturing the core data mandatory for each UniProtKB entry (principally, the amino acid sequence, protein name or description, taxonomic data and citation information), as much annotation information as possible is added. This includes widely accepted biological ontologies, classifications and cross-references, and clear indications of the quality of annotation in the form of evidence attribution of experimental and computational data. Created by merging the data in Swiss-Prot, TrEMBL and PIR-PSD, individual UniProt Knowledgebase entries may contain more information than was available in any given separate source database. The UniProt Knowledgebase consists of two sections: a section containing manually-annotated records with information extracted from literature and curator-evaluated computational analysis, and a section with computationally analyzed records that await full manual annotation. For the sake of continuity and name recognition, the two sections are referred to as ‘Swiss-Prot’ and ‘TrEMBL’,
respectively.”
- UniProt Knowledgebase Release 7.0 The
UniProt consortium European Bioinformatics Institute (EBI), Swiss Institute of Bioinformatics (SIB) and Protein Information Resource (PIR), is pleased to announce UniProt Knowledgebase (UniProtKB) Release 7.0 (07-Feb-2006). UniProt (Universal Protein Resource) is a comprehensive catalog of information on proteins. UniProtKB Release 7.0 consists of 2,812,716 entries (UniProtKB/Swiss-Prot: 207,132
entries and UniProtKB/TrEMBL: 2,605,584 entries)
- UniProt databases can be accessed from the web at <a
href="http://www.uniprot.org/">http://www.uniprot.org</a> and downloaded from <a href="http://www.uniprot.org/database/download.shtml">http://www.uniprot.org/database/download.shtml</a>.
Detailed release statistics for TrEMBL and Swiss-Prot sections of the UniProt Knowledgebase can be viewed at <a
href="http://www.ebi.ac.uk/swissprot/sptr_stats/index.html">http://www.ebi.ac.uk/swissprot/sptr_stats/index.html</a> and <a href="http://www.expasy.org/sprot/relnotes/relstat.html">http://www.expasy.org/sprot/relnotes/relstat.html</a> respectively.
ExPASy (Expert Protein Analysis System)
- ExPASy interface. The ExPASy (Expert Protein Analysis System) proteomics server of the Swiss Institute of Bioinformatics (SIB) is dedicated to the analysis of protein sequences and structures as well as 2-D PAGE.
GUI Examples===
Data===
- Complete non-redundant sets of complete proteomes in UniProtKB.
- ftp://ca.expasy.org/README
- ftp://ca.expasy.org/databases/complete_proteomes/entries/bacteria/PSEAE.dat</a> (The PSEAE looks like a short form for PSEudomonas Aeruginosa)
UMLS (Unified Medical Language System)
- The PPLRE project plans to use UMLS to assist with linguistic concept identification and Named Entity Recognition (NER). Specifically the Annotator’s Conceptualizer module will make use of the Metathesaurus and Semantic Network and the MMTx tool. It may also become a sources for organism instances.
- It is a very large ontology that covers more than 100 source vocabularies including GO, NCBI-taxonomy, MeSH, HUGO etc
- Provides several linguistics-oriented tools, one of which is for the NE annotation and is being used as the pre-processing of our NER module. E.g. MMTX
- Website: http://www.nlm.nih.gov/research/umls/
- “The purpose of the National Library of Medicine's (NLM’s) UMLS® is to facilitate the development of computer systems that behave as if they "understand" the meaning of the language of biomedicine and health. To that end, NLM produces and distributes the UMLS Knowledge Sources (databases) and associated software tools (programs) for use by system developers in building or enhancing electronic information systems that create, process, retrieve, integrate, and/or aggregate biomedical and health data and information, as well as in informatics research.” http://www.nlm.nih.gov/research/umls/about_umls.html.
- UMLS consists of three components:
Metathesaurus===
- A large multi-lingual vocabulary database that includes
biomedial and health related concepts, their various terms and relationships among them. Includes more than 100 vocabulary sources, such as: MeSH, <a
href="#_7.22_____Gene_Ontology_(GO)_Cellula">GO</a> and
<st1:Street w:st="on"> <st1:address w:st="on"> <st1:Street
w:st="on">
<st1:address w:st="on">
- “The UMLS Metathesaurus is a very large, multi-purpose, and multi-lingual vocabulary database that contains information about biomedical and health related concepts, their various names, and the relationships among them. Designed for use by system developers, the Metathesaurus is built from the electronic versions of many different thesauri, classifications, code sets, and lists of controlled terms used in patient care, health services billing, public health statistics, indexing and cataloging biomedical literature, and/or basic, clinical, and health services research.” <a href="http://www.nlm.nih.gov/pubs/factsheets/umlsmeta.html">http://www.nlm.nih.gov/pubs/factsheets/umlsmeta.html</a>
Semantic Network===
- An ontology of concepts and their relationships.
- “The Semantic Network consists of (1) a set of
broad subject categories, or Semantic Types, that provide a consistent categorization of all concepts represented in the UMLS Metathesaurus®, and (2) a set of useful and important relationships, or Semantic Relations, that exist between Semantic Types. This section of the documentation provides an overview of the Semantic Network, and describes the files of the Semantic Network. Sample records illustrate structure and content of these files.” <a
href="http://www.nlm.nih.gov/pubs/factsheets/umlssemn.html">http://www.nlm.nih.gov/pubs/factsheets/umlssemn.html</a>
3.SPECIALIST:
- Lexical information of names
- “The
SPECIALIST lexicon has been developed to provide the lexical information needed for the SPECIALIST Natural Language Processing System (NLP). It is intended to be a general English lexicon that includes many biomedical terms. Coverage includes both commonly occurring English words and biomedical vocabulary. The lexicon entry for each word or term records the syntactic, morphological, and orthographic information needed by the SPECIALIST NLP System.” <a
href="http://www.nlm.nih.gov/pubs/factsheets/umlslex.html">http://www.nlm.nih.gov/pubs/factsheets/umlslex.html</a>
- Some
- Organism names: UMLS has 383,064 organism names in its Metathesaurus. This
- Protein names: 330,192 names under semantic type "Amino Acid, Peptide or
- Prokaryote-protein relations: 40,263 pairs, most are co-occurrence
- UMLS has
a web-based query interface <a href="http://umlsks.nlm.nih.gov/">http://umlsks.nlm.nih.gov/</a>(requires free registration - takes a few days to process)
- MetaMap Transfer (MMTx) <a
Protein/Gene Named Entity Recognition, NLP Research
·There is a significant amount of recent research into the question of correctly identifying genes/proteins within natural language text (see sample listing of papers below). Unfortunately, most research papers do not appear to be accompanied by openly available programs. So, instead of developing these research solutions from scratch we plan to stick with freely available / executable programs (see GENIA above).
- Entity Types
1.GENIA Ontology
- <a
href="http://www-tsujii.is.s.u-tokyo.ac.jp/~genia/topics/Corpus/genia-ontology.html">http://www-tsujii.is.s.u-tokyo.ac.jp/~genia/topics/Corpus/genia-ontology.html</a>
- PROTEIN
- domain or region of DNA
- CELL_COMPONENT
- Tasks/Datasets
1.Bio-Entity Recognition Task at BioNLP/NLPBA 2004
- <a href="http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/ERtask/report.html">http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/ERtask/report.html</a>
- Held from March to April 2004
- Demos:
1.<a href="http://nlp.i2r.a-star.edu.sg/demo_bioner.html">http://nlp.i2r.a-star.edu.sg/demo_bioner.html</a>
- Papers:
- Contextual weighting for Support Vector Machines in
literature mining: an application to gene versus protein name disambiguation. T. Pahikkala, et al. <a
href="http://www.biomedcentral.com/1471-2105/6/157">http://www.biomedcentral.com/1471-2105/6/157</a>
- Recognition of protein/gene names from text using an
ensemble of classifiers. G. Zhou, et
al. <a href="http://www.biomedcentral.com/1471-2105/6/S1/S7">http://www.biomedcentral.com/1471-2105/6/S1/S7</a>
- Exploring the boundaries: gene and protein
identification in biomedical text. J.
Finkel et al. <a href="http://www.biomedcentral.com/1471-2105/6/S1/S5">http://www.biomedcentral.com/1471-2105/6/S1/S5</a>
- A simple approach for protein name identification:
prospects and limits. Katrin Fundel, et al. <a
href="http://www.biomedcentral.com/1471-2105/6/S1/S15">http://www.biomedcentral.com/1471-2105/6/S1/S15</a>
- ProMiner: rule-based protein and gene entity
recognition. D. Hanisch.et al. <a
href="http://www.biomedcentral.com/1471-2105/6/S1/S14">http://www.biomedcentral.com/1471-2105/6/S1/S14</a>
- Gene/protein name recognition based on support vector
machine using dictionary as features.
T. Mitsumori et al <a href="http://www.biomedcentral.com/1471-2105/6/S1/S8">
http://www.biomedcentral.com/1471-2105/6/S1/S8</a>- Using co-occurrence network structure to extract
</HTML>
Snowball
See: PPLRE Snowball
Genbank
- See: GenBank.
- One challenge with Genbank is that it contains lots of redundant entries and unconfirmed sequences. That said, Genbank IDs are used more often than TREMBL IDs. The non-redudant curated set of IDs can be found within folders in /home/shared/NCBI_Genomes/curated/Bacteria. The .faa files would contain the IDs and protein names/descriptions (along with protein sequences but that can be ignored).
- The OTHER set of GI numbers (with redundancies) that people sometimes use in recent literature can be found in ftp://ftp.ncbi.nih.gov/genbank/ It is unclear which files are the most useful for this project though. There's the livelists folder which contains a list of GIs + Accession numbers for ALL the entries in Genbank. There are also the gbbct1.seq.gz to gbbct13.seq.gz files, which contains too much information (full Genbank flatfiles - unsure if these are just DNA or DNA + proteins). Apparently there are supposed to be index files that contain less info (just the Accession + GI ids), but according to the release notes, they had trouble generating them for this release (152).
Bio-Acronym Databases
Acronyms are regularly used in biomed articles. The following datasets may help us resolve the meaning of the abbreviations that we encounter in our task.
- ARGH (<a
href="http://invention.swmed.edu/argh/">invention.swmed.edu/argh</a> ): about
221,000 unique acronyms. Zhongmin has the entire database (attributes: acronym,full form, accuracy, context, etc.)
- Acromed (<a
- Standford Abbr.
(<a href="http://abbreviation.stanford.edu/">abbreviation.stanford.edu</a> ): 2,074,367abbreviations, program accessible. An example of searching "CPR":
<st1:State w:st="on"> <st1:place w:st="on"> <a name="OLE_LINK2"></a><aname="OLE_LINK1"></a>Ind |
Abbr. |
Long Form |
Quality (Score) |
#Docs |
1 |
CPR |
Cardio-Pulmonary Resuscitation |
Excellent (0.91) |
1,154 |
2 |
CPR |
Computer Based Patient Records |
Excellent (0.59) |
65 |
3 |
CPR |
C peptide immunoreactivity |
Good (0.33) |
52 |
4 |
CPR |
Cefpirome |
Good (0.34) |
32 |
5 |
CPR |
C-Peptide |
Good (0.13) |
29 |
6 |
CPR |
Computerised Patient Record |
Excellent (0.91) |
18 |
7 |
CPR |
chicken progesterone receptor |
Excellent (0.91) |
14 |
8 |
CPR |
NADPH--cytochrome P450 reductase |
Excellent (0.91) |
13 |
9 |
CPR |
C-peptide reactivity |
Excellent (0.86) |
10 |
10 |
CPR |
Cefpirome sulfate |
Good (0.07) |
10 |
FASTA File Format
·<a href="http://www.ncbi.nlm.nih.gov/BLAST/fasta.shtml">http://www.ncbi.nlm.nih.gov/BLAST/fasta.shtml</a>
·“A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. It is recommended that all lines of text be shorter than 80 characters in length. An example sequence in FASTA format is:”
>gi|532319|pir|TVFV2E|TVFV2E envelope protein
ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT
QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC
HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCK
MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKK
TYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVEITPIGF
APTEVRRYTGGHERQKRVPFVXXXXXXXXXXXXXXXXXXXXXXVQSQHLLAGILQQQKNL
LAAVEAQQQMLKLTIWGVK
- From <a href="http://en.wikipedia.org/wiki/FASTA_format">http://en.wikipedia.org/wiki/FASTA_format</a>: “The FASTA defline format is not formally
GenBank gi|gi-number|gb|accession|locus
<a name="OLE_LINK13"></a><a name="OLE_LINK14"></a>EMBL Data Library gi|gi-number|emb|accession|locus
DDBJ, DNA Database of Japangi|gi-number|dbj|accession|locus
NBRF PIR pir||entry
Protein Research Foundationprf||name
SWISS-PROT sp|accession|name
Brookhaven Protein Data Bank (1) pdb|entry|chain
Brookhaven Protein Data Bank (2) entry:chain|PDBID|CHAIN|SEQUENCE
Patents pat|country|number
GenInfo Backbone Id bbs|number
General database identifiergnl|database|identifier
NCBI Reference Sequence ref|accession|locus
Local Sequence identifierlcl|identifier
Gene Ontology (GO) Cellular Component Ontology
- “The Gene Ontology project provides a controlled vocabulary to describe gene and gene product attributes in any organism. The three organizing principles of GO are molecular function, biological process and cellular component. … The cellular component ontology describes locations, at the levels of subcellular structures and macromolecular complexes. Examples of cellular components include nuclear inner membrane, with the synonym inner envelope, and the ubiquitin ligase complex, with several subtypes of these complexes represented. Generally, a gene product is located in or is a subcomponent of a particular cellular component. The cellular component ontology includes multi-subunit enzymes and other protein complexes, but not individual proteins or nucleic acids. Cellular component also does not include multicellular anatomical terms.”
- “GO Accession ID” is one of the fields in ePSORTdb (see ePSORTdb)
- <a href="http://www.geneontology.org/GO.component.guidelines.shtml">http://www.geneontology.org/GO.component.guidelines.shtml</a>
- The Gene Ontology (GO) is one of the more popular ontology sources used by biologist. E.g. the cellular component ontology is used by the following biology ontologies: BioPax <owl:ObjectProperty rdf:about="#CELLULAR-LOCATION"> <a href="http://www.biopax.org/release/biopax-level2.owl">http://www.biopax.org/release/biopax-level2.owl</a>; and INOH (Integrating Network Objects with Hierarchies.
- Example #1) extracellular region
- <a
- Accession: GO:0005576
- Ontology: cellular_component
- Synonyms:
1.exact: extracellular
- Definition: The space external to the
- Example #2) plasma membrane
- <a
- Accession: GO:0005886
- Ontology: cellular_component
- Synonyms:
1.related: plasma membrane cation-transporting ATPase
2.related: plasma membrane long-chain fatty acid transporter
3.narrow: bacterial inner membrane
4.exact: cell membrane
5.exact: cytoplasmic membrane
6.exact: plasmalemma
7.broad: juxtamembrane
- Definition:
1. The membrane surrounding a cell that separates the cell from its external environment. It consists of a phospholipid bilayer and associated proteins.
- Example #3) periplasmic space (sensu Proteobacteria)
- <a
- Accession: GO:0005886
- Ontology: cellular_component
- Synonyms:
- exact: periplasmic space (sensu Gram-negative bacteria)
- broad: periplasm
- broad: periplasmic space
·Definition: The region between the inner (cytoplasmic) membrane and outer membrane. As in, but not restricted to, the Gram-negative bacteria (Proteobacteria, ncbi_taxonomy_id:1224).
- Example #4) cytoplasm
- <a
- Accession: GO:0005737
- Ontology: cellular_component
- Synonyms: None
- Definition: All of
- Example #5) cellular_component
- <a
- Accession: GO:0005775
- Ontology: cellular_component
- Synonyms: None
- Definition: The part
NLM (National Library of Medicine)
·“The National Library of Medicine (NLM), on the campus of the National Institutes of Health (NIH) in <st1:place w:st="on"> <st1:City w:st="on"> Bethesda, <st1:State w:st="on"> Maryland, is the world's largest medical library. The Library collects materials in all areas of biomedicine and health care, as well as works on biomedical aspects of technology, the humanities, and the physical, life, and social sciences. The collections stand at more than 7 million items--books, journals, technical reports, manuscripts, microfilms, photographs and images.”
·Participates in: NCBI, MESH, UMLS
·<a href="http://www.nlm.nih.gov/">http://www.nlm.nih.gov/</a>
= <a name="_Ref132123190">7.24</a> PDB (Protein
Data Bank)=
- “The Protein Data Bank (PDB) uses macromolecular
Crystallographic Information File (mmCIF) data dictionaries to describe the information content of PDB entries. The RCSB PDB provides a variety of tools and resources for studying the structures of biological macromolecules and their relationships to sequence, function, and disease. The RCSB (Research Collaboratory for Structural Bioinformatics) is a member of the wwPDB whose mission is to ensure that the PDB archive remains an international resourcewith uniform data.”
- <a href="http://www.pdb.org/">http://www.pdb.org</a>
- <a href="ftp://ftp.rcsb.org/pub/pdb/">ftp://ftp.rcsb.org/pub/pdb/</a>
- The Worldwide Protein Data Bank (wwPDB) consists of
three member organizations that act as deposition, data processing and distribution centers for PDB data. The founding members are RCSB PDB (USA), MSD-EBI (Europe) and PDBj ( <st1:country-region w:st="on"> <st1:place w:st="on"> Japan</st1:country-region> ) 1. The mission of the wwPDB is to maintain a single Protein Data Bank Archive of macromolecular structural data that is freely and publicly available to the global community. H. Berman, et al (2003): Announcing the worldwide Protein Data Bank. Nature Structural Biology 10 (12), p. 980. <ahref="http://www.wwpdb.org/">http://www.wwpdb.org/</a>
SWISS-PROT Accession Format
<IMG border=0 width=599 height=744 id="_x0000_i1025" src="PPLRE_7_ResourceDescriptions_files/image001.gif">
ID IdentificationOne; starts the entry
AC Accession number(s)One or more
DT DateThree times
DE DescriptionOne or more
GN Gene name(s)Optional
OS Organism speciesOne or more
OG OrganelleOptional
OC Organism classificationOne or more
RN Reference number One or more
RP Reference positionOne or more
RC Reference comment(s)Optional
RX Reference cross-reference(s) Optional
RA Reference authorsOne or more
RT Reference titleOptional
RL Reference locationOne or more
CC Comments or notesOptional
DR Database cross-references Optional
KW Keywords Optional
FT Feature table dataOptional
SQ Sequence headerOne
Amino Acid SequenceOne
Termination lineOne; ends the entry
PDF file conversion
An hour of experimentation of a few tools led to the selection of Xpdf- <a href="http://www.foolabs.com/xpdf/">http://www.foolabs.com/xpdf/</a>
- Xpdf 3.01pl2 was released 2006-feb-08.
Protein Information Resource (PIR) Center
- iProLink
- WebSite: <a href="http://pir.georgetown.edu/pirwww/iprolink/protname.shtml">http://pir.georgetown.edu/pirwww/iprolink/protname.shtml</a>
- Data: <a
- BTW,
TextMining on Bio
- <a href="http://blimp.cs.queensu.ca/cateR_1.html">http://blimp.cs.queensu.ca/cateR_1.html</a>
- Contains several review papers on text mining in biomedicine:
KDDCup08 Proposal
* KDDCup08 Proposal