Using the data to choose the best model for a statistical analysis, using Reversible Jump Markov chain Monte Carlo: generic model choice for an evidence-informed society
Submitting Institution
University of BristolUnit of Assessment
Mathematical SciencesSummary Impact Type
TechnologicalResearch Subject Area(s)
Mathematical Sciences: Statistics
Summary of the impact
Reversible Jump Markov chain Monte Carlo, introduced by Peter Green [1]
in 1995, was the first generic technique for conducting the computations
necessary for joint Bayesian inference about models and their parameters,
and it remains by far the most widely used, 18 years after its
introduction. The paper has been (by September 2013) cited over 3800 times
in the academic literature, according to Google Scholar, the vast majority
of the citing articles being outside statistics and mathematics. This case
study, however, focusses on substantive applications outside academic
research altogether, in the geophysical sciences, ecology and the
environment, agriculture, medicine, social science, commerce and
engineering.
Underpinning research
Statistical analysis of data is a ubiquitously dominant ingredient in
evidence-based decision making across virtually all fields of human
endeavour; most of such analysis is based on statistical models, and much
of this either entails formal choice between models with differing numbers
of parameters, or requires models with variable-dimension parameters. The
research underpinning this impact case study consists of work carried out
from 1993 at the University of Bristol by Peter Green, culminating with
the 1995 publication of a paper [1] in Biometrika, which
introduced Reversible Jump Markov chain Monte Carlo (RJMCMC), a
simulation-based methodology for fitting Bayesian statistical models that
have variable-dimension parameters. Mathematically, Reversible Jump is
formalism for Metropolis-Hastings MCMC on a general state space consisting
of a countable union of Euclidean spaces of differing dimensions. The
paper included 3 illustrative applications. Over the following few years,
Green developed many more substantial applications of RJMCMC in
collaborative research projects, and several resulting publications
[including 2-3] have themselves all stimulated further research and are
well-cited.
References to the research
*[1] Green, P. J. (1995). Reversible jump Markov chain Monte Carlo
computation and Bayesian model determination, Biometrika, 82,
711-732. DOI: 10.1093/biomet/82.4.711
*[2] Richardson, S. and Green, P. J. (1997). On Bayesian analysis of
mixtures with an unknown number of components (with discussion). Read to
the Royal Statistical Society on 15 January 1997. Journal of the Royal
Statistical Society (B), 59, 731-792. DOI:
10.1111/1467-9868.00095
*[3] Giudici, P. and Green, P. J. (1999). Decomposable graphical Gaussian
model determination, Biometrika, 86, 785-801. DOI:
10.1093/biomet/86.4.785
* reference that best indicates the quality of the underpinning research
Details of the impact
The paragraphs below describe in brief a wide range of non-academic
applications of Reversible jump MCMC, all starting or continuing after
2008. Each of these is corroborated by personal communications held on
file, and in some cases also by internal or published documents citing [1]
(and sometimes [2] or [3]).
A. Applications in Geophysical sciences
1. Geophysical source reconstruction. At Defence Research and
Development, Canada, key concepts enunciated in [1] have been used
to design an innovative Bayesian inference methodology to address the
problem of source reconstruction for the difficult case of multiple
sources when even the number of sources is unknown a priori. This effort
has had an impact within the realm of public safety and security as it
addresses a critical capability gap in current emergency and retrospective
management efforts, which involve the covert release of chemical,
biological, or radiological agents into the atmosphere.
2. Geophysical electrical resistivity. The US Geological
Survey is using methodology built on [1] to explore the space of
subsurface electrical resistivity models that are consistent with airborne
geophysical data. Data is acquired using airborne geophysical instruments
that are sensitive to the spatial distribution of electrical resistivity
below ground to depths of ~100m that, in turn, can be interpreted in terms
of geologic or hydrologic properties. "There is real-world impact to this
work- we are using this algorithm to characterize important groundwater
aquifer systems and permafrost in various areas of the U.S." [a]
3. Ground flow models. The Belgian Nuclear Research Centre
(SCKCEN) has developed an MCMC simulation of a highly parameterized
groundwater flow model, based on [1], for uncertainty quantification of
subsurface transport in the context of the Belgian nuclear waste disposal
program.
4. Air pollution, greenhouse gases, remote sensing. Shell
Research uses Bayesian inference, exploiting MCMC techniques
including [1], to estimate the characteristics of sources of airborne
species (gases, particulate matter, etc.) [b]. The main value of inference
from remote sensing of airborne species to Shell, and to society in
general, is to be able to quantify contributions to greenhouse gas
emissions from specific human activities over time. The technology is also
generally useful in detecting unknown or unanticipated sources (`leaks')
of species carried on the wind, and can lead to discovery of (e.g.) new
hydrocarbon reserves. A key general ingredient of useful statistical
solutions to real-world problems is the flexibility and scalability of
Bayesian inference using MCMC, e.g. allowing characterisation of
parameters previously consigned to the `too difficult to measure or
estimate' box. Reversible jump MCMC extends this flexibility considerably
by allowing dimension-jumping.
5. Air pollution, change point models. Cox Associates
have used [1] "in advocating (to risk analysts and regulators, in various
forums) the importance and practicality of applying better statistical
methods for causal analysis of health effects of key regulations, such as
air pollution regulations in the U.S". They have testified on the
importance and practicality of using better methods of causal analysis in
air pollution health effects research before the Subcommittee on Energy
and Power of the House Energy and Commerce Committee of Congress on health
effects of air pollutants (2012) http://energycommerce.house.gov/hearings/hearingdetail.aspx?NewsID=9594.
[c]
6. Climate and land models. The Geophysical Fluid Dynamics
Laboratory (GFDL) of the US National Oceanic and Atmospheric
Administration (NOAA) uses [1] in development of land components for
climate and Earth System models. These models are needed to make climate
projections (e.g. Intergovernmental Panel on Climate Change and national
assessments) and seasonal climate predictions. This approach has been used
to estimate parameters of the phenology module which is incorporated into
a new land model. Furthermore, this new parameterization has been
incorporated into a new model of forest dynamics for forest management (a
collaborative project with the US Forest Service).
B. Applications in Ecology and the Environment
1. Phylogenetics and biodiversity. Modern molecular phylogenetic
inference is of central importance to monitoring species diversity within
changing environments. Several projects within Agriculture and
Agri-Food Canada aim to monitor and understand general features of
biodiversity, and therefore rely heavily and phylogenetic inference. Such
inference, when conducted probabilistically, rests on explicit models of
molecular evolution, and the approach in [1] helps choose the most
appropriate one, or allow phylogenies to be based on a weighted-averaging
over all possible models of a given class. Algorithms based on [1] are
implemented in several phylogenetic inference packages and continue to
stimulate applied research in their labs.
2. Phylogenetics and biodiversity. The Morton Arboretum
in Lisle, Illinois, uses [1] to characterize shifts in chromosome number
evolution as a way of understanding biodiversity shifts in sedges
Chromosome number evolves independently of genome size in a clade with
non-localized centromeres, as well as understanding macroevolutionary
shifts in decomposition rates on the tree of life. A lead scientist at
Morton Arboretum comments that "Both of these have profound implications
for management and conservation of biodiversity as well as ecosystem
processes".
3. Animal abundance. NOAA uses [1] to allow uncertainty
in the number of individuals when estimating the abundance of organisms in
line transect surveys with imperfect detection. It is "currently using
this type of analysis is to estimate the number of seals in the Bering
Sea".
4. Wildlife ecology. At the US Geological Survey Patuxent
Wildlife Research Center, [1] has been applied "to analyses of the North
American Breeding Bird Survey, to toxicological studies, to basic
ecological work on life histories, and in demographic analyses".
5. Ecology of salmon. The US Fish and Wildlife Service
uses [1] in comparing models aimed at assessing the effect of transporting
(exporting) water from the Sacramento — San Joaquin Rivers Delta on the
survival of juvenile salmon as the salmon were out-migrating from
freshwater to the ocean. Water is exported from the Delta for
agricultural, municipal, and personal needs and is thought to directly
affect over 25 million people in California. Coincident with the increase
in water exports over the last 50 years or so, there have been sizable
declines in the abundance of several fish species in the Sacramento and
San Joaquin river systems. Conflicts have arisen between various
stakeholders and interest groups regarding how the water is used and
divided up. There have been many lawsuits and court cases. This work was
discussed at length in the US Federal District Court (Fresno, California)
in April 2010.
6. Ecology, conservation, environment. For Land Care Research
NZ, [1] "plays an important role on analysing data that has a
downstream effect on evolution, ecology and conservation biology" — and
hence environment protection.
C. Agricultural applications
Quantitative Trait Loci (QTLs) in agriculture. At the national
agricultural research centre of Japan (AFFRC), "in our ... genome
analysis of livestock and crops, some useful QTLs affecting traits of
agronomical importance were detected with the developed methods
implementing RJ-MCMC. A project to produce new cultivars of crop (tomato)
or breeds of pig with high genetic performance using the information from
the detected QTLs is now in progress". [d]
D. Medical applications
Protein-DNA interactions and medical implications. Projects at the
UC Denver Medical School utilizing techniques based on [1] include
(i) predicting which human variations or mutations are likely to impact
protein structure and function, thereby causing human disease. This is
particularly important in rare childhood developmental and neurological
diseases; (ii) understanding the relationships among humans in order to
improve interpretation of genome-wide association studies, finding genes
that are components of quantitative disease; (iii) understanding the role
of the interaction of T-cell receptors and major histocompatibility
complex (MHC molecules) on defending disease, and also on causing
autoimmune disease when things go wrong; (iv) using these methods to
understand and predict transcription factor binding mutations, which also
can lead to disease and disease or drug interaction modifiers; (v)
understanding the biology of transposable elements, which are often
heavily implicated in novel diseases, particularly neurological disease,
and can also be useful for predicting gene regions that are likely
disease-causing mutations.
E. Social and commercial applications
Exchange reserves and Criminology in India. (i) Models quantifying
sufficiency of foreign exchange reserves: the Reserve Bank of India
uses [1] for variable selection within a quantile regression model
framework for studying adequacy of foreign exchange reserves to meet US$
demand in India under stressful market conditions. Due to the in-house
nature of these models, these are not published or shared.
(ii) Models for studying crime rates in different states of India: a
project at the Reserve Bank is attempting to determine relevance of
socio-economic variables in determining level of crimes in Indian states.
"Understanding the relevance of various factors in determining crime rate
is very important in controlling crime rate. For instance, lack of toilet
and drinking water facilities require women in India to go away from her
house/hutment, which increase rape rate. Thus, a positive association
between crime against women and lack of toilet/drinking water facilities
demand public policy in developing these basic necessities. Probabilistic
models [based on [1] are] likely to throw light on such aspects of crimes
in India".
F. Applied image analysis and computer vision
1. Computer vision — object tracking. At SORMEA, a French
company specializing in measurements and surveys, studies and modelling,
Geographic Information Systems, acoustics and product development to
improve road safety, an automatic vision-based multi-vehicle tracking
system for measuring vehicle flows on crossroads & roundabouts,
yielding provenance & destination statistics, has been constructed
using [1]. These statistics are requested by local communities in order to
decide whether or not to create, extend or modify crossroads &
roundabouts. The system currently is commercially exploited:
http://www.sormea.fr/fr/r-d-innovation-anacomda/anacomda-o-d.html
2. Imaging of geosynchronous orbits, managing space debris. The
Lawrence Livermore National Laboratory of the US Department of Energy
conducted a project to understand how conventional astronomical facilities
might aid in determining the distribution of space debris in
geosynchronous orbit. The principal application of this work is to protect
valuable space assets from collisions with debris. [1] was applied "to
select different possible pairings of orbital tracks seen in optical
telescopes in a Monte Carlo framework". [e]
G. Implementations of RJMCMC in Software
1. Mr Bayes: [1] is used as one of several standard techniques in
the software MrBayes [f]. The use of reversible jump MCMC was recently
expanded to integration over nucleotide substitution models, and this is
quickly becoming a standard procedure in analyses using the software. The
software is widely used across the life sciences for comparative genetics
and genomics studies, and more generally for studies in evolutionary
biology. The software has attracted more than 16,000 citations to date. It
is used widely in research but also in a number of applied contexts. One
applied context concerns the identification of strains of disease
organisms. Another focuses on phylogenetic studies (inference of
evolutionary trees). Among other things, the evolutionary trees form the
basis for classifications used in natural history museums around the world
and in a wide range of applications related to the environment.
2. WinBugs: The WinBugs system [g] is respected software for
Bayesian analysis, widely used by applied statisticians in both the
private and public sectors, and its scope has been recently extended to
support fitting of a wide range of trans-dimensional models, including
variable selections, automatic curve-fitting using splines, Bayesian
Multivariate adaptive regression splines (MARS) and Classification and
regression trees (CART), normal mixture analysis, spatial epidemiology
clustering models and variable-order Markov chains. All of these
additional functions are based on [1].
3. LIS: NASA's Land Information System (LIS) [h] is a software
framework for high performance land surface modelling and data
assimilation. LIS is led by the Hydrological Sciences Branch at NASA's
Goddard Space Flight Center. LIS software tools are used to develop
customized Land Data Assimilation Systems at NASA's Goddard Space Flight
Centre, NOAA's National Centres for Environmental Prediction and the Air
Force Weather Agency. MCMC methods including [1] are currently being
implemented and incorporated into the system.
Sources to corroborate the impact
Major correspondents include:
[a] Research Geophysicist, USGS Crustal Geophysics and Geochemistry
Science Center, Denver, Colorado.
Corroborates item A2 in section 4.
[b] Scientist, Statistics & Chemometrics, Shell Technology Centre,
Chester, UK.
Corroborates item A4 in section 4.
[c] President of Cox Associates Consulting, Denver, Colorado.
Corroborates item A5 in section 4.
[d] Project leader, Agriculture, Forestry and Fisheries Research Council
(Japan).
Corroborates item C in section 4.
[e] Research scientist, Physics division, Lawrence Livermore National
Laboratory, California.
Corroborates item F2 in section 4.
Links to software cited:
[f] http://mrbayes.net
Corroborates item G1 in section 4.
[g] http://www.mrc-bsu.cam.ac.uk/bugs/welcome.shtml
Corroborates item G2 in section 4.
[h] http://lis.gsfc.nasa.gov/
Corroborates item G3 in section 4.