The application of embedded analytics to hyper-scale and distributed data archives
Submitting Institution
University of LiverpoolUnit of Assessment
Communication, Cultural and Media Studies, Library and Information Management Summary Impact Type
TechnologicalResearch Subject Area(s)
Information and Computing Sciences: Artificial Intelligence and Image Processing, Information Systems
Summary of the impact
The research improves digital data archives by embedding computation into
the storage controllers that maintain the integrity of the data within the
archive. This opens up a number of possibilities:
- Data analysis can be automated and incorporated into the archiving
process;
- The approach improves the archiving of all types of digital objects,
from television broadcasts to genomes;
- The approach can be applied to distributed data and to datasets that
are too big for traditional approaches.
This has impact on three different classes of beneficiary:
- Providers of national data infrastructure in the UK and US, who are
incorporating Cheshire 3 into national data repositories;
- Data Users, such as Astra Zeneca, RAI, Sanger Institute, who are using
Cheshire 3 to extract valuable information from their data;
- Equipment vendors, such as NetApp, Xerox and Bellerophon Mobile, who
are developing commercial systems that will use Cheshire 3.
Underpinning research
The chief innovation at Liverpool University is the development of
"Cheshire 3", a data analysis and processing system in which the analysis
and processing of the data are integral to the data archive. Workflows to
analyse data and discover information are implemented in a virtual machine
that operates directly on the data archive.
This approach, of embedding the computer that does the data processing in
the data archive, has a big advantage. The traditional approach, of moving
the data to be processed from an archive to the computer that does the
processing, becomes impossible when very large amounts of data, petabytes
(millions of gigabytes) or exabytes (billions of gigabytes), are involved.
Storing such large quantities of data requires a great deal of hardware.
Moving the data from archive to computer consumes a great deal of
bandwidth.
The research began in 1997 as a collaboration between Paul Watry
(employed continuously at the University of Liverpool) and Ray Larson of
the University of California at Berkeley. The collaboration, which
continues to the present, was funded jointly by JISC and the National
Science Foundation and produced Cheshire 2, a prototype that worked with
text, numerical and geospatial data based on an earlier text-discovery
system developed by Larson.
The next phase, which was supported by a grant from JISC in 2004 to
Watry, developed the prototype of Cheshire 3, which has the added
capability of being able to deal with distributed data (see reference 6),
and is published as US Patent 20060277170, issued in December 2006.
Subsequent work developed capabilities for "big data" archive and
analysis and developed applications to support particular needs and
projects by integrating Cheshire 3 with different data management
technologies. The work was funded by a variety of collaborative grants
including the SHAMAN Framework 7 project, JISC funding under the "Digging
into Data" programme and the PERICLES Framework 7 project which is now
being prepared for industrial production.
The Cheshire 3 system is used internationally on a production basis and
is integrated with policy-based data management systems such as the
integrated Rule-Oriented Data System (iRODS). As discussed in section 4,
it now forms a significant part of the national information infrastructure
in the UK and the US.
References to the research
2. Sanderson, R., Watry, P. "Integrating data and text mining processes
for digital library applications". ACM/IEEE Joint Conferences on Digital
Libraries, JCDL2007, Vancouver, BC Canada. ISBN: 978-1-59593-644-8. DOI:
10.1145/1255175.1255188.
3. Watry, P, Larson, R., Sanderson, R. "Knowledge generation from digital
libraries and persistent archives", Research and Advanced Technology for
Digital Libraries, 10th European Conference ECDL 2006. Research and
Advanced Technology for Digital Libraries Lectures Notes in Computer
Science Volume 4172 (2006). DOI: 10.1007/11863878_54. ISBN
978-3-540-44638-5.
4. Watry, P., Sanderson, R. Larson, R. US Patent: "Digital Library
System", Patent number 20060277170, issued 7 December 2006. The published
disclosure relates to a digital library system that will operate in both
single processor and grid distributed computing requirements.
5. Watry, P. "Digital preservation theory and application:
Transcontinental persistent archives testbed activity", International
Journal of Digital Curation, vol. 2, No. 2 (2007), pp. 41-68. ISSN:
1746-8256. DOI: 10.2218.ijdc.v212.28.
6. Watry, P., Larson, R. "Cheshire 3 framework white paper: implementing
support for digital repositories in a data grid environment", Local to
Global Data Interoperability — Challenges and Technologies (2005), 60-64.
ISBN 0-7803-9228-0. DOI: 10.1109/LGDI.2005.1612466.
Collaborative research grants:
• The Sustaining Heritage Access through Multivalent ArchiviNg project
(SHAMAN), European Commission (Framework Programme 7 project), 2007-2011,
£1M
• Promoting and Enhancing the Reuse of Information throughout the Content
Lifecycle exploiting Evolving Semantics (PERICLES), European Commission
(Framework Programme 7 project), 2013-2017, £978k
• Integrating data mining and data management technologies for scholarly
enquiry, JISC funding under the "Digging into Data" programme, 2012-2013,
£81k
Details of the impact
Overview
Cheshire 3 has had high impact because its development has been guided by
rich 2-way interactions with different networks of potential
beneficiaries. The beneficiaries include the bodies that provide national
data infrastructures, data users, and equipment vendors. Our interactions
with them increase the reach and the significance of the impact in 3 ways:
- The interactions have ensured that Cheshire 3 is useful. Its
development has been guided by the needs of potential users. It supports
the latest policies, methods and practices for the analysis and sharing
of data and its technologies are closely integrated with the latest data
management technologies.
- The interactions ensure that public and private bodies that provide
data infrastructures are aware of and understand the advantages of
Cheshire 3.
- The interactions ensure that data users know that Cheshire 3 provides
the kinds of enhanced capabilities that they need in order to archive
their data reliably and to extract value from it.
Interactions leading to Impact
Our interactions with providers of national data infrastructure
include long-term collaboration with teams internationally responding to
needs assessment and requirements analysis; integration with other
projects already involved with "big data" management and analytics areas;
and long-term involvement with standards working groups and
implementation of the software at STFC for the Virtual Engineering Centre
at Daresbury. In 2009 the software was demonstrated to 13 Federal Agencies
as a primary technology, which resulted in the formation of the DataNet
Federation Consortium (DFC). This demonstration also formed the basis of
the SHAMAN
Integrated Project prototype (2008-2012), which served to integrate
the Cheshire system with the iRODS adaptive middleware, thus supporting
long-term curation and analysis of archived data.
Our interactions with data users include participation in groups
working to develop social consensus for the analysis and sharing of data,
policies, methods, and practice; and the interoperability mechanisms to
support technology integration and data analysis. This has involved
sustained interaction with emerging networks of expertise across digital
curation projects.
Our interactions with equipment vendors include discussions with
NetApp about the possible adoption of the software for storage devices,
working with corporate developers in London; employment of software
developers across SME (Archive Analytics, SpaceApps) to provide services
and support throughout the EU; use of the software by value added
resellers.
1) Impact on National Data Infrastructures
The software was developed at Liverpool and partly
funded by the NARA US National Archives as a means of supporting
long-term curation and analysis of digital data.
a. The software has supported services across multiple agencies, in the
US, including NSF, NARA, NASA, NIH, DOD, NHPRC, IMLS, and Europe,
including DOE, JISC, EU, EPSRC.
b. The software forms a foundational component of the US National Science
Foundation data infrastructure based at DataNet
Federation Consortium, for national (US) e-Science research
applications.
c. The software contributed to all of these services the ability to index
material within a data grid (cloud) and the ability to provide data
analytics workflows across active collections, across all these
organizations. Cheshire3 has provided this functionality to the DFC and
other NSF research projects that support the national data
cyberinfrastructure for NSF research projects in the United States and
which rely heavily on the extensive set of metadata that allows for the
discovery, analysis, and preservation of studies. The expanded ability of
these initiatives to federate using iRODS and Cheshire discovery services
means that scientific research datasets can be analysed and managed with
policy based rules that protect the authenticity, privacy, provenance,
context, and integrity of datasets.
d. The software is widely used on a service-oriented basis
internationally, and forms the infrastructure of multiple national digital
library services in the UK, including the
Archives Hub (JISC) (1999-present), and the
Incunabula Short Title Catalogue (British Library) (2005-present).
2) Impact on Data Users
a. The integrated software forms the basis of commercial prototypes for
AstraZeneca, Drexel, and Virtual Engineering Centre, for managing data
driven collections; and in 2012 the Virtual Engineering Centre commissioned
a prototype for use in the automotive and aerospace sectors at the Science
and Technology Facilities Council.
b. The integrated system is currently being rolled out for research
organizations, such as Sanger Institute, Science and Technology Facilities
Council (STFC), University of Edinburgh (EPCC), University of Liverpool
(N8 grid).
c. The system has been prototyped for use in managing research data for
the pharmaceutical industry (AstraZeneca) (2008-2010).
d. The research advances in distributed data analytics applied to
petascale (millions of gigabytes) collections have resulted in measurable
impact improving the analysis and management of scientific research data,
including research collaborations across different communities of
practice.
3) Impact on Equipment Developers
a. NetApp, DataDirectNetworks, and Xerox are promoting use of the system and
software for cloud appliance and managing "big data technologies" for
research or predictive use cases. The Business Development Manager for Cloud
Services at NetApp UK Ltd will confirm that the Cheshire technology
developed by Professor Watry's group will be embedded in upcoming NetApp
enterprise products on a commercial basis from 2014.
b. The software is driving investment within the SME communities, both in
the US and EU. The Founder and Senior Science Advisor of Bellerophon
Mobile in the United States, for example, states that "the use of
Cheshire3 workflows to aid discovery of distributed data could contribute
greatly to effectiveness of building mobile knowledge-based systems".
Sources to corroborate the impact
- The Business Development Manager for Cloud Services at NetApp UK Ltd
can be contacted to confirm the impact of Cheshire3 on Equipment
Developers (3a, 3b). He will confirm that NetApp are releasing
commercial products incorporating the technology.
- Head of Search, Taxonomy, and Enteprise Content Management at Astra
Zeneca Pharmaceuticals can be contacted to confirm the impact of
Cheshire3 on data users and its adoption by Astra Zeneca for managing
pharmaceutical data (2a, 2c).
- The Director of the Sustainable Archive Institute at the School of
Information and Library Science, University of North Carolina, can be
contacted to confirm the impact of Cheshire3 on the National data
infrastructure in the US and the UK and its impact on data users (1a-d;
2a, 2b, 2d).
- The Co-founder and Science Advisor of Bellerophon Mobile can be
contacted to confirm the impact of Cheshire3 on data users (2a, 2b, 2d)
and SME investment (3b).
- The Enterprise Architecture Area Manager of the Services Innovation
Laboratory, Xerox Research Centre Europe, has provided a statement to
confirm that Cheshire3 produces measurable impact on the analysis and
management of scientific data and that it is driving commercial
investment (2d, 3b).