PRINTS and InterPro – online resources that facilitate discovery of pharmaceutical and commercially relevant information in proteomic and genomic data-sets
Submitting Institution
University of ManchesterUnit of Assessment
Biological SciencesSummary Impact Type
TechnologicalResearch Subject Area(s)
Mathematical Sciences: Statistics
Biological Sciences: Genetics
Medical and Health Sciences: Pharmacology and Pharmaceutical Sciences
Summary of the impact
Automation of genomic data analysis has become essential. High-throughput
sequencing technologies are producing data faster than can be managed and
interpreted, meaning that much biomedical information remains unused.
Research led by Attwood introduced a unique method for protein sequence
characterisation and a derived database of diagnostic protein signatures
(PRINTS). This led directly to the development of a new database
(InterPro), now routinely used to annotate the world's largest protein
sequence archive (UniProt), and complete genomes and metagenomes. The
databases and their search tools have been exploited in the private sector
(including SMEs and multi-national pharmaceutical and agrichemical
companies), generating workflows that have yielded candidate drug targets
and provided insights into disease mechanisms.
Underpinning research
These impacts are explicitly based on research that was funded and
undertaken at the University of Manchester (UoM) from 1999 to date. Key
researchers were:
Professor Teresa Attwood (1999 to date)
Post-Doctoral Research Associates: Dr Alex Mitchell (2000-2011;
EBI InterPro Content Manager, 2011 to date), Dr Jane Mabey
(2001-2006), Dr Mike Croning (1999-2002), Dr Phil Scordis
(2000-2001)
Research Assistants: Mr Paul Bradley (2002-2005), Mr Ala
Uddin (2001-2003), Mr Julian Selley (1999-2002)
PhD students: Mr Neil Maudling (2000-2005), Ms George Moulton
(2000-2005), Ms Anna Gaulton (2000-2004), Mr Will Wright
(1999-2001)
The underpinning research aims to develop protein family databases, and
tools for family analysis and annotation. At one end of the spectrum is
high-throughput genome annotation and at the other is fine-tuned
functional characterisation of pharmaceutically relevant proteins. As
such, many of the team's research projects have been funded (~£359k) by
major pharmaceutical companies and SMEs (AstraZeneca, Pfizer, Cambridge
Drug Discovery, Roche Discovery, GlaxoSmithKline), highlighting the value
of the work in drug-discovery programmes.
The key steps at the UoM were:
- The motivation was to create a pipeline for automatic annotation of
the mass of raw genomic data entering the European Bioinformatics
Institute's (EBI's) TrEMBL database. InterPro was first released in
October 1999, with seed data from PROSITE, Pfam and PRINTS [1].
InterPro's power comes from integrating complementary analysis methods,
allowing it to provide more robust functional diagnoses (i.e.,
the diagnostic sum is more powerful than its component parts).
- Attwood and her team showed that their uniquely selective fingerprint
method for sequence analysis could uncover new, distantly related,
members of protein families (e.g., a range of new tubulins in
protozoal parasites was discovered [2]).
- Attwood's team then exploited the selective nature of the method to
establish an innovative hierarchical approach to protein family
fingerprinting, for the first time allowing extremely fine- grained
functional analyses. They showcased the approach using the
pharmaceutically important GPCRs, creating a compendium of >200
family- and subfamily-specific GPCR fingerprints [3].
The approaches have been extended to numerous families. Such work
directly augmented PRINTS [4,5], doubled its size, and provided >1,000
new family- and subfamily-specific signatures to InterPro. These strands
of research remain fruitful and continue to feed into the impact: PRINTS
is actively maintained and continues to supply InterPro with unique
hierarchical fingerprints [6].
References to the research
The research was published in leading life science journals (Trends in
Pharmacological Sciences, Current Biology) and in the top journal in
the field of bioinformatics for database publications (Nucleic Acids
Research).
1.
Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., Birney,
E., Biswas, M., Bucher, P., Cerutti, L., Corpet, F., Croning, M.D.,
Durbin, R., Falquet, L., Fleischmann, W., Gouzy, J., Hermjakob, H.,
Hulo, N., Jonassen, I. et al. (2001) The InterPro
database, an integrated documentation resource for protein families,
domains and functional sites. Nucleic Acids Res. 29 (1). p. 37-40.
DOI: 10.1093/nar/29.1.37
2. Vaughan, S., Attwood, T.K., Navarro, M., Scott, V.,
McKean, P.,Gull, K. (2000) New
tubulins in protozoal parasites. Current Biology. 10 (7). p.
R258-259. DOI: 10.1016/S0960- 9822(00)00414-0
3. Attwood,
T.K. (2001) A compendium of specific motifs for diagnosing
GPCR subtypes. Trends Pharmacol.Sci. 22 (4). p. 162-165.
DOI:10.1016/S0165-6147(00)01658-8
4. Attwood, T.K., Croning, M.D., Flower, D.R., Lewis, A.P.,
Mabey, J.E., 2028Scordis, P., Selley, J., Wright, W. (2000) PRINTS-S:
the database formerly known as PRINTS. Nucleic Acids Res. 28
(1). p. 225-227. DOI: 10.1093/nar/28.1.225
5.
Attwood, T.K., Bradley, P., Flower, D.R., Gaulton,
A., Maudling, N., Mitchell, A.L., Moulton, G., Nordle, A.,
Paine, K., Taylor, P., Uddin, A., Zygouri, C. (2003) PRINTS
and its automatic supplement, prePRINTS. Nucleic Acids Res. 31
(1). p. 400-402. DOI: 10.1093/nar/gkg030
6. Hunter, S., Apweiler, R., Attwood, T.K., Bairoch, A., Bateman,
A., Binns, D, Bork, P., Das, U., Daugherty, L., Duquenne L., Finn, R.,
Gough, J., Haft, D., Hulo, N., Kahn, D., Kelly, E., Laugraud, A., Letunic,
I., Lonsdale, D., Lopez, R., Madera, M., Maslen, J., McAnulla, C.,
McDowall, J., Mistry, J., Mitchell, A. et al. (2009)
InterPro: the integrative protein signature database. Nucleic Acids
Res. 37. p. D211-215. DOI: 10.1093/nar/gkn785
Details of the impact
Context
Before research at UoM, protein sequence annotation was largely done
manually, and the gold- standard repository, Swiss-Prot, had become a
bottleneck, preventing early access to genomic data. To tackle this
problem, TrEMBL was established at the EBI as an unannotated Swiss-Prot
supplement. Attwood and her research team established InterPro, the
world's first integrated protein family repository, following on from
PRINTS. Its creation made possible the implementation of robust,
high-throughput annotation workflows for TrEMBL and also for complete
genomes. In turn, this led to the development of new drug-discovery
pipelines and global usage of InterPro.
Pathways to impact
PRINTS has been maintained at UoM for 14 years. Attwood re-established
the umbrella website, DbBrowser (the access point for PRINTS and its
analysis/annotation tools) in March 1999, and set up an FTP site for their
anonymous download.
InterPro has grown in size and complexity, now having a dozen partner
databases focusing on gene- and domain-families, protein folds,
architectures and superfamilies. Hosted at the EBI, with continued input
of new diagnostic signatures from its partners (including PRINTS),
InterPro has become the pre-eminent instrument for analysis and functional
annotation of uncharacterised genomic data. Since 2008, InterPro has
featured in ~30 educational national and international workshops, and is
covered in two EBI online e-learning courses: InterPro: Quick tour
and Introduction to Protein Classification at the EBI. The
audience for these workshops includes clinicians, researchers,
technologists and industrialists. Various PRINTS- and InterPro-specific
modules are also available as part of GOBLET, a global, community-based
organisation providing a centralised public facility for sharing
educational materials. The creation of GOBLET was led by Attwood, who is
the current Chair.
Reach and significance of the impact
Establishing one of the world's largest resources for protein
family annotation:
InterPro was quickly adopted as part of the TrEMBL (now part of UniProt)
annotation pipeline, and continues to contribute to the rule-based
automatic annotation system for protein sequences in UniProtKB/TrEMBL [A].
Sequences in UniProt include fingerprint annotations, with hyperlinks back
to PRINTS, making PRINTS one of UoM's most heavily used resources, e.g.,
since 2008, traffic to PRINTS and its services has grown from ~2.5 million
hits/annum to >3.4 million. In 2010, InterPro accounted for 12% of all
Web visits to the EBI: it now draws ~45 million hits/annum from the
UK/Europe, USA, Canada, Japan, etc. (typically, ~3 million of
these from industry), and supports ~2 million searches/month.
New sequencing technologies will assure both a major ongoing impact on
InterPro usage statistics, and a continued role for this pivotal database
in genome- and proteome-annotation programmes [e.g., B-D]. In 2011,
InterPro contributed to the annotation of ~13 million proteins, making it
one of the world's largest, most successful publicly available protein
family annotation resources, on a par with the NCBI's Conserved Domain
Database, and resulting in its recognition as one of ELIXIR's core data
resources [D].
The advantages of using protein fingerprints:
Fingerprints uniquely provide hierarchical `superfamily to subfamily'
diagnoses, so users of PRINTS, InterPro and UniProt benefit from more
fine-grained functional insights than are given by InterPro's `catch-all'
methods: e.g., PRINTS classifies GPCRs and ion channels (the most
cited and downloaded 3D structures, owing to their pharmaceutical
relevance) into ~400 families and subfamilies. Such functional
`fine-tuning' improves the specificity of Gene Ontology (GO) mapping
within InterPro, mappings that are now cross-referenced >66 million
times in UniProt and provide functional terms for >11 million proteins.
GO annotation provided by InterPro is the largest source of automatic GO
annotation for proteins from all organisms [E].
Providing valuable information to pharmaceutical and commercial
organisations:
Around 10% of PRINTS >1,200 and InterPro's >3,100 cumulative
citations are by authors from commercial organisations (including
pharmaceutical companies like GSK, Merck, Novartis, Roche, Pfizer,
AstraZeneca, Intervet and agricultural, agrifood and agrichemical
companies like BASF). Approximately 1.3 million webpages have been served
to visitors from commercial domains from Jan-June 2013 [F].
InterPro developments have fuelled tangible outcomes for pharmaceutical
industries, including:
- Identifying specific genes and signalling pathways that may contribute
to motor neuron degeneration in amyotrophic lateral sclerosis [G]
(authors affiliated to Genentech).
- Identifying gene families that play a critical role in chronic skin
infections caused by Trichophyton rubrum (e.g., athlete's
foot) [H] (author affiliated to Proctor & Gamble).
- Providing evidence that combined inhibition of simultaneously active
receptor tyrosine kinases can lead to an added anti-cancer effect [I]
(author affiliated to Novartis).
One purpose of the American Recovery & Reinvestment Act of 2009
was to increase economic efficiency by spurring technological advances in
science and health. A project was developed to sequence individual human
genomes for $1,000, and hence translate sequencing into a viable clinical
tool, underpinning the worldwide goal of `personalised' medicine. Efforts
to achieve this goal are now generating data on an unimaginable scale,
most of it useless without annotation. InterPro harmonises world-wide
sequence-annotation projects and therefore provides an efficient platform
for transforming raw data into pharmaceutically and clinically useful
information; it thereby helps to realise both the global financial
investment in, and the future clinical outcomes of, these transformative
new sequencing technologies.
Sources to corroborate the impact
A. UniProt Consortium (2013) Update on activities at the Universal
Protein Resource (UniProt) in 2013. Nucleic Acids Res.,
41(Database issue), D43-47
B. Lamesch,
P., et al. (2012) The Arabidopsis Information Resource
(TAIR): improved gene annotation and new tools. Nucleic Acids Res.,
40, D1202-D1210
C. Schneider, M., et al. (2009) The
UniProtKB/Swiss-Prot knowledgebase and its Plant Proteome Annotation
Program. J. Proteomics, 72(3), 567-573. PMID: 19084081
D. ELIXIR: Overview, Progress & Futures. http://ec.europa.eu/research/biotechnology/eu-us-task-
force/pdf/20th-meeting/presentation_of_elixir_project_en.pdf
E. Burge, S., et al. (2012) Manual GO annotation of predictive
protein signatures: the InterPro approach to GO curation. Database,
Article ID bar068. DOI:10.1093/database/bar068
F. Letter from EMBL-EBI, corroborating the important contributions of
InterPro and PRINTS.
G. Phatnani, H.P., et al. (2013) Intricate interplay between
astrocytes and motor neurons in ALS. Proc. Natl. Acad. Sci. USA.
110(8), E756-765. DOI: 10.1073/pnas.1222361110
H. Martinez, D.A., et al. (2012) Comparative genome analysis of Trichophyton
rubrum and related dermatophytes reveals candidate genes involved in
infection. MBio. 3(5), e00259-12. DOI: 10.1128/mBio.00259-12
I. Harbinski, F., et al. (2012) Rescue screens with secreted
proteins reveal compensatory potential of receptor tyrosine kinases in
driving cancer growth. Cancer Discov.,
2(10), 948-959. DOI: 10.1158/2159-8290.CD-12-0237