Better and faster guarantees of respondent privacy when releasing public statistics
Submitting Institution
University of the West of England, BristolUnit of Assessment
General EngineeringSummary Impact Type
TechnologicalResearch Subject Area(s)
Mathematical Sciences: Numerical and Computational Mathematics
Information and Computing Sciences: Computation Theory and Mathematics
Summary of the impact
Several National Statistics Agencies (NSAs) in Europe now use tools based
on UWE research to ensure published tables are protected from hacking
attempts to breach data privacy. Provision of high-quality data to policy
and decision makers is so important that supplying it to NSAs is often
mandatory for organisations and individuals. In return, NSAs, such as the
UK's Office for National Statistics (ONS), must guarantee a degree of
confidentiality. Our research has benefitted ONS, its clients and data
providers, by exposing serious flaws in existing methodologies and
techniques for protecting confidentiality and by creating tools for (i)
auditing and (ii) protecting large complex tables.
Underpinning research
National Statistics Agencies (NSAs) supply information that governments,
business and public services use to develop policies and make better
decisions. This information is derived from data provided by
"respondents": organisations, companies and people. Respondent's data must
not be published in a form vulnerable to hacking — such that data privacy
(and/or anonymity) is breached. This is a legal duty upon NSAs. For tables
containing "magnitude" data (e.g., financial turnover), this means a
respondent's data cannot be calculated within a given margin of error
(typically 10%). The preferred approach is to leave blank certain
"sensitive" cells of the table, but this is almost always not enough to
protect privacy. A table usually contains sub-totals. As a result,
additional cells must also be "suppressed" to prevent hacking via
mathematical techniques. However this means that less data is published.
Guaranteeing protection while maximising information that is published is
known as the Cell Suppression Problem (CSP). While easy to solve for very
small tables through "exact" mathematical optimisation, increases in size
and complexity can rapidly cause this approach to fail, so in practice
NSAs tend to use rough-and-ready "heuristic" approaches. However, these
can greatly over-suppress. UWE research (a collaboration between staff in
this Unit and in Computer Science) has developed efficient methods that
protect large tables from mathematical hacking, and yet maximise the
information published.
The research was carried out at UWE by Dr Alistair Clark
(Principal Lecturer 1998-2011, Associate Professor 2011-present) and Dr
James Smith (Research Fellow/Senior Lecturer 1996-2007, Associate
Professor 2007-present). Clark has substantial experience of
mathematical optimisation applied to production scheduling, supply chains
and manpower rostering. He has long researched with Smith at the interface
of exact and heuristic approaches, hybridising mathematical and computing
techniques to create tools that produce high-quality near-best solutions.
In 2006, the Office for National Statistics (ONS) approached them to
undertake research into the CSP for large complex tables, subsequently
supported by EPSRC funding for Martin Serpell, now a permanent
early-career researcher at UWE whom they jointly supervised for a PhD.
Their collaboration resulted in the development of:
— "Unpicking" algorithms to rapidly "attack" protected complex
tables, revealing serious issues with existing tools used by ONS and
NSAs. Protection techniques not only over-suppressed, but also
consistently left complex tables vulnerable to hacking (unlike UWE's new
methods). This research resulted in section 4's impact points (a, c, f)
where ONS took action to improve its service to end-users.
— A more efficient mathematical formulation (R1), allowing larger
tables (up to 40,000 cells) to be protected (see impact in section
4(d)).
— Novel methods that pre-process tables to greatly reduce problem
size (R1,2), again allowing larger tables (up to 200,000 cells) to be
protected. Section 4 describes the impact of this improved ability,
detailing ONS initiatives to strengthen their SDC methods, resulting in
impacts 4(d, e).
— New hybrid approaches combining mathematical methods with heuristic
approaches. These outperformed all existing approaches on the huge
variety of data tables published by ONS (R3) (impacts 4(b, d, e)).
— Evidence showing that the best choice among several alternative
methods depended on the type of table being protected, resulting in new
robust, flexible and best-performing methods (R3).
— Combined methods to better protect large complex tables while
publishing more information for end-users (R3) (impacts 4(d, e)).
References to the research
Publications
R1 Serpell, M., Clark, A., Smith, J. and Staggemeier, A. (2008).
Pre-processing Optimisation Applied to the Classical Integer Programming
Model for Statistical Disclosure Control. Lecture Notes in Computer
Science, 5262, 24-36. http://dx.doi.org/10.1007/978-3-540-87471-3_3
R2 Serpell, M. Smith, J., Clark, A. and Staggemeier, A. (2013). A
Preprocessing Optimization applied to the Cell Suppression Problem in
Statistical Disclosure Control. Information Sciences, 238, 22-32.
http://dx.doi.org/10.1016/j.ins.2013.02.006
R3 Smith, J. E., Clark, A. R., Staggemeier, A. T. and Serpell, M. C.
(2012). A Genetic Approach to Statistical Disclosure Control. IEEE
Transactions on Evolutionary Computation, 16(3), 431-441. http://dx.doi.org/10.1109/TEVC.2011.2159271
Grants
Improvements to Cell Suppression in Statistical Disclosure Control,
PI Clark, CI Smith, Office for National Statistics, 2005-06, £20k
Evaluation of Heuristic Approaches to Statistical Disclosure Control,
PI Smith, CI Clark, Office for National Statistics, 2007, £5k
Mathematical Modelling of Statistical Disclosure Control, PI
Smith, CI Clark, PhD student Serpell, EPSRC Mathematical CASE award with
UK Office for National Statistics, 2007 - 2011, £72k
Details of the impact
Our research has benefitted the UK's Office for National Statistics
(ONS), its clients and data providers, by exposing serious flaws in
existing methodologies and techniques for protecting confidentiality and
by creating tools for (i) auditing and (ii) protecting large complex
tables. UWE's research enabled ONS to handle: (i) larger tables than
previously, for example two-dimensional (2D) tables with over 1,000,000
cells; and (ii) complex tables, for example 3D with up to 200,000 cells,
and smaller 4D tables.
UWE's research findings were verified in extensive testing at ONS and by
the developers of the tau-Argus SDC software in 2011 and 2012.
Their confirmation of the weaknesses in existing methodologies has led to
the following (S1):
a. The incorporation of UWE's unpicker algorithm as an auditing tool into
tau-Argos version 3 since Sept 2012. This tool is used by NSAs
worldwide. It is also the standard recommended by the ONS to UK data
providers such as the NHS.
b. A policy decision at ONS in 2012 to change their methodology for
protecting magnitude data.
c. A decision by ONS in 2012 to incorporate UWE's unpicker algorithm as
part of ONS's standard working practice to validate and protect tables.
d. A decision in 2012 to incorporate UWE's protection tool within ONS's
tool set, accompanied in spring 2013 by internal ONS funding for the
necessary development work, with initial deployment in 2013.
The research findings also motivated ONS to provide additional funding to
ensure further impact:
e. The award in May 2013 of internal ONS funding to complete the roll-out
of the full suite of UWE tools and an integrated work-flow that protects
the Business Related Employment Statistics from autumn 2013,
f. The award in spring 2013 of ONS internal funding for the development
work needed to complete the incorporation in early 2014 of UWE's unpicker
algorithm into the desktop workflow management system at ONS's Business
Statistics Methods Unit.
There are three groups of beneficiaries of UWE's research:
-
Respondents to data surveys (such as individuals and
businesses), that supply the source information and who have now been
protected by guaranteed confidentiality. Fortunately, to date there have
been no (highly publicised) breaches, although the existence of
increased computer power and sophistication make this ever more likely.
-
Data providers (not ONS and NSAs), such as local and central
government agencies, the NHS, businesses and other organisations. The
new ability to rapidly validate their tables prior to publication has
allowed them to meet their duty of trust with confidence. The release of
the updated tau-Argus software (and, in ONS's case, the direct
deployment of UWE's programs) meant that many of the world's NSAs are
now able to meet their legal obligation to protect the confidentiality
of the information provided to them (while noting that to date there has
never been a (widely publicised) problem to have undermined confidence).
Data providers now also benefit from faster automated tools for creating
protected tables that replace manual and/or ad-hoc methods used for
larger and/or more detailed datasets. This enables more frequent and
flexible release of valuable data into the public domain.
-
Data users (for example strategic planners and policymakers in
business, health, and government), who now benefit from improved access
to more detailed information. Providing greater data availability is in
line with the National Data Strategy developed by the UK
Data Forum who has stated that "the value to the UK of good data
has never been greater". This has also benefitted the official Open
Data Initiative that aims to make available as much government
data as possible.
The unpicker algorithm has now been deployed at ONS and has since 2012
also been incorporated within the tau-Argus tool maintained by the
Central Bureau of Statistics (CBS), Netherlands (S2), and used by
NSAs worldwide. This has had a significant trans-European and worldwide
impact given that tau-Argus is the most popular SDC tool in use
throughout the world. It was funded by Eurostat and is maintained
by Statistics Netherlands. It incorporates many cell suppression
methods and is free to use.
Strategically, the impact of UWE's research represents a major milestone
on the path to "information on demand", with the aim for a fundamental
shift in the relationship between data providers and clients. Improved
access to information enables business, government or other agencies to
make more informed choices and leads to more efficient and effective
planning and resource utilisation.
Sources to corroborate the impact
[text removed for publication]
Testimonial letters listed below are available from UWE, Bristol.
S1. Testimonial from Head of Collection and Editing Methods and
Statistical Computing Unit, Office for National Statistics, Newport.
S2. Testimonial from former tau-Argus project manager (now retired),
Statistics Netherlands. Corroborates that UWE research enabled Statistics
Netherlands to protect large tables, and that nationals statistics
agencies have been enabled to increase the amount of aggregated
information published whilst protecting them from attacks to breach
contributors' confidentiality.