Documentation >> Data types and mapping
Table of contents
Overview
This document describes the data types currently implemented within TPB, as well as mappings from each data source to these types in a source dependant manner. For each data source (currently neXtProt, GPM, Human Protein Atlas and Gene Expression Barcode), mappings are provided between the raw source data and TPB data types as well as the thresholds used to define the quality score (i.e. definitions of green, yellow, red and black traffic lights).
Current TPB data types
Protein Expression (PE)
Description:
Evidence for the presence of protein expression. It is a summary of numerous underlying data types, currently from MS- or antibody-based methods, or curated annotation. Note that it is not a measure of expression level (quantitative).
Data level:
1
Parent data type:
N/A
Child data types:

Protein Expression by Mass Spectrometry (PE MS)
Description:
Direct MS-based evidence for protein expression. It is a summary of several underlying data types.
Data level:
2
Parent data type:
PE
Child data types:

Annotation of protein expression by Mass Spectrometry (PE MS ANN)
Description:
Annotated, indirect evidence for MS-based detection of protein expression.
Data level:
3
Parent data type:
Child data types:
None
Direct data sources:

Probability-based MS detection of protein expression (PE MS PROB)
Description:
Evidence for protein expression by MS, based upon the highest probability in a single analysis.
Data level:
3
Parent data type:
Child data types:
None
Direct data sources:
GPM

Frequency of MS detection (PE MS SAM)
Description:
Repeated detection of protein expression by MS.
Data level:
3
Parent data type:
Child data types:
None
Direct data sources:
GPM

Protein expression by antibody technologies (PE ANTI)
Description:
Antibody-based evidence for protein expression. It is a summary of several underlying data types.
Data level:
2
Parent data type:
PE
Child data types:

Annotation of antibodies (PE ANTI ANN)
Description:
Annotated availability of antibodies in Human Protein Atlas
Data level:
3
Parent data type:
Child data types:
None
Direct data sources:

Immunohistochemical detection of protein expression (PE ANTI IHC)
Description:
Detection of protein expression using immunohistochemical methods.
Data level:
3
Parent data type:
Child data types:

Immunohistochemical detection in normal tissues (PE ANTI IHC NORM)
Description:
Detection of protein expression in “normal” (non-diseased) tissue by immunohistochemical methods.
Data level:
4
Parent data type:
Child data types:
None
Direct data sources:

Other evidence for protein expression (PE OTH)
Description:
Any non MS- or antibody-based evidence for protein expression.
Data level:
2
Parent data type:
PE
Child data types:

Curated annotation of protein expression (PE OTH CUR)
Description:
Curated annotation of protein expression.
Data level:
3
Parent data type:
Child data types:
None
Direct data sources:

Post Translational Modification (PTM)
Description:
Evidence for the presence of post translational modifications. It is a summary of underlying data types.
Data level:
1
Parent data type:
N/A
Child data types:

Acetylation (PTM ACE)
Description:
Evidence for acetylation of a gene product (protein). It is a summary of underlying data types.
Data level:
2
Parent data type:
PTM
Child data types:

Lysine Acetylation (PTM ACE LYS)
Description:
Evidence for acetylation on one or more lysine residues of at least one gene product.
Data level:
3
Parent data type:
Child data types:
None
Direct data sources:

N-Terminal acetylation (PTM ACE NTA)
Description:
Evidence for acetylation on the N-terminus of at least one gene product.
Data level:
3
Parent data type:
Child data types:
None
Direct data sources:

Phosphorylation (PTM PHS)
Description:
Evidence for phosphorylation of a gene product (protein). It is a summary of underlying data types.
Data level:
2
Parent data type:
PTM
Child data types:

Phosphoserine (PTM PHS SER)
Description:
Evidence for phosphorylation on one or more serine residues of at least one gene product.
Data level:
3
Parent data type:
Child data types:
None
Direct data sources:

Phosphothreonine (PTM PHS THR)
Description:
Evidence for phosphorylation on one or more threonine residues of at least one gene product.
Data level:
3
Parent data type:
Child data types:
None
Direct data sources:

Phosphotyrosine (PTM PHS TYR)
Description:
Evidence for phosphorylation on one or more tyrosine residues of at least one gene product.
Data level:
3
Parent data type:
Child data types:
None
Direct data sources:

Transcript Expression (TE)
Description:
Evidence for the presence of transcript expression. It is a summary of underlying data types, currently from microarray methods. Note that it is not a measure of expression level (quantitative).
Data level:
1
Parent data type:
N/A
Child data types:

Microarray Transcript Expression (TE MA)
Description:
Evidence for RNA transcript expression based on microarray technology. It is a summary of underlying data types.
Data level:
2
Parent data type:
TE
Child data types:

Microarray Sample Proportion (TE MA PROP)
Description:
Microarray-based, tissue-specific evidence for transcript expression from the Gene Expression Barcode. Based on the proportion of samples from a tissue that demonstrate detection of RNA expression.
Data level:
3
Parent data type:
Child data types:
None
Direct data sources:

Other Transcript Expression (TE OTH)
Description:
Evidence for RNA transcript expression from other sources. It is a summary of underlying data types.
Data level:
2
Parent data type:
TE
Child data types:

Curated annotation of Transcript Expression (TE OTH CUR)
Description:
Curated annotation of transcript expression. Currently based on annotation from neXtProt (protein existence).
Data level:
3
Parent data type:
Child data types:
None
Direct data sources:

Source to data type and quality score mappings
Introduction
For each source repository, a summary of the source files utilised is provided, along with mappings of each data type that is derived from the source and the colour mappings used.
neXtProt
Source file format:
XML
Source repository:
 
Data mapping:
 
1. TPB data type:
Source data:
XPath: proteins/protein/proteinExistence@value
Quality score:
Based on direct mapping from source data to following colour levels.
4 (green):
“protein level”
3 (yellow):
N/A
2 (red):
N/A
1 (black):
"transcript level", “homology”, “predicted” or “uncertain”

2. TPB data type:
Source data:
XPath: proteins/protein/xrefs/xref where @category=“Antibody databases”, @database=“HPA” and @accession starts with “CAB” or “HPA”
Quality score:
Based on count of number of antibodies available. Note, only one entry per protein entry.
4 (green):
N/A
3 (yellow):
count > 1
2 (red):
count = 1
1 (black):
count = 0

3. TPB data type:
Source data:
XPath: proteins/protein/xrefs/xref where @category=“Proteomic databases” and @database=“PeptideAtlas” or “PRIDE”
Quality score:
Based on the @database value.
4 (green):
N/A
3 (yellow):
PeptideAtlas
2 (red):
PeptideAtlas
1 (black):
N/A

4. TPB data type:
Source data:
XPath: proteins/protein/proteinExistence@value
Quality score:
Based on direct mapping from source data to following colour levels.
4 (green):
“protein level” or "transcript level"
3 (yellow):
N/A
2 (red):
N/A
1 (black):
“homology”, “predicted” or “uncertain”

GPM (PE)
Source file format:
XML (customised by R.Beavis for TPB)
Source repository:
 
Data file(s):
URL to the current version is available in an RSS feed at http://gpmdb.thegpm.org/tpb/current.xml
Schema:
gpm2tpb_schema.xsd
Data mapping:
 
1. TPB data type:
Source data:
XPath: gpmdbsummary/protein/identification@beste
Quality score:
Based on the highest log(e) score for each protein.
4 (green):
less than or equal to -10
3 (yellow):
less than or equal to -3 and greater than -10
2 (red):
less than or equal to -1 and greater than -3
1 (black):
higher than -1

2. TPB data type:
Source data:
XPath: gpmdbsummary/protein/identification@samples
Quality score:
Based on number of samples in which the protein was detected.
4 (green):
100 or more samples
3 (yellow):
20 to 99 samples
2 (red):
1 to 19 samples
1 (black):
not detected

GPM (ACE)
Source file format:
Flat file (text)
Source repository:
 
Data mapping:
 
1. TPB data type:
Source data:
modified residue
Quality score:
Based on the number of modification observations on a particular residue.
4 (green):
>=50 observations
3 (yellow):
>=20 observations
2 (red):
>=5 observations
1 (black):
<5 observations

GPM (PTM)
Source file format:
Flat file (text)
Source repository:
 
Data mapping:
 
1. TPB data type:
Source data:
modified residue (S => PTM PHS SER, T => PTM PHS THR, Y => PTM PHS TYR)
Quality score:
Based on the number of modification observations on a particular residue.
4 (green):
>=50 observations
3 (yellow):
>=20 observations
2 (red):
>=5 observations
1 (black):
<5 observations

Human Protein Atlas
Source file format:
XML
Source repository:
 
Data mapping:
 
1a. TPB data type:
Source data:
XPath: proteinAtlas/entry/tissueExpression@type=“APE”&@technology=“IH”
Quality score:
Based on reliability scores generated by HPA (documentation available at http://www.proteinatlas.org/about/quality+scoring#re) though with more stringent colour mappings. Only proteins with positive expression are provided the following quality scores, any tissue that has no expression is given a quality score of 1 (black).
4 (green):
High
3 (yellow):
Medium
2 (red):
Low or Very Low
1 (black):
No protein expression in tissue

1b. TPB data type:
Source data:
XPath: proteinAtlas/entry/tissueExpression@type=“staining”&@technology=“IH”
Quality score:
Based on validation scores generated by HPA (documentation available at http://www.proteinatlas.org/about/quality+scoring#va) though with more stringent colour mappings. Only proteins with positive staining are provided the following quality scores, any tissue that has no staining is given a quality score of 1 (black).
4 (green):
N/A
3 (yellow):
Supportive
2 (red):
Uncertain, Non-supportive
1 (black):
Negative for protein staining in tissue

Gene Expression Barcode
Source file format:
CSV
Source repository:
 
Data mapping:
 
1. TPB data type:
Source data:
Sample proportion for a probe/tissue pair
Quality score:
Based on proportion of samples from a tissue that the probe was detected and also the uniqueness of the probe to the gene.
4 (green):
Proportion >= 0.8 AND probe unique to gene
3 (yellow):
Proportion >= 0.5
2 (red):
Proportion > 0
1 (black):
Proportion = 0

If you make use of TPB in your research, please cite: Goode, et al. "The proteome browser web portal", Journal of Proteome Research 2013, 12(1):172-8. (PMID:23215242)
This site is provided 'as is' by Monash University for use by researchers. This site is for research purposes only. Services are provided to 3rd parties on an 'all care no responsibility' basis. Use of this site indicates your acceptance of the Terms Of Use.
This project is supported by the Australian National Data Service (ANDS). ANDS is supported by the Australian Government through the National Collaborative Research Infrastructure Strategy Program and the Education Investment Fund (EIF) Super Science Initiative.