### UOA10-10: Development and implementation of mathematical algorithms enhance performance of software libraries on GPUs

**Submitting Institution**

University of Oxford**Unit of Assessment**

Mathematical Sciences**Summary Impact Type**

Economic**Research Subject Area(s)**

Information and Computing Sciences: Artificial Intelligence and Image Processing, Computation Theory and Mathematics, Information Systems

**Download original**

PDF**Summary of the impact**

Many of the top supercomputers use Graphical Processing Units (GPUs) to accelerate scientific computing applications with less energy consumption and lower overall cost. GPUs achieve this by having comparatively large numbers of simple processing elements when compared against CPUs, which have fewer, more sophisticated, elements. However, to take full advantage of GPUs requires quite different algorithms and implementation techniques for mathematical software libraries. Researchers at the University of Oxford have developed a number of such algorithms and implementation techniques over the period 2008-2013, which have been incorporated into software libraries distributed by NAG, NVIDIA and the Apache Foundation and have enhanced the performance up to 150x compared with single thread CPU calculations and 20x relative to multithreaded CPU calculations. These libraries are used by large numbers of application developers worldwide.

**Underpinning research**

In 2007, Professor Mike Giles began research on the use of GPUs for Monte Carlo simulations. This exploited the new capability to use NVIDIA graphics cards for high performance computing (HPC) applications through writing programs using CUDA, NVIDIA's proprietary extension to the computer language C. The performance benefits proved to be substantial, with many of the newly developed algorithms exhibiting a speed-up of over 150x compared with single thread CPU execution [2].

This led to a collaboration with Professor Chris Holmes at the University of Oxford's Statistics Department and Professor Arnaud Doucet (who was then at University of British Columbia but moved to the University of Oxford in 2012), in which they demonstrated the performance that could be achieved for more challenging statistical applications such as particle filters; the primary challenge is the re-weighting of the particles which is not easily parallelised [1].

Part of Giles' research programme involved the massively parallel implementation of random number generators, including both L'Ecuyer's mrg32k3a pseudo-random generator, and Sobol's quasi-random generator. This work is documented in [2] which has co-authors from both NAG and NVIDIA; both companies have adopted these generators in their respective random number libraries (see below).

One standard method of converting uniform random numbers into Normal random numbers is through inverting the Normal cumulative distribution function. This is a simple affine transformation of the inverse error function which is a standard function of many mathematical libraries, but the standard way in which it is approximated performs very poorly on GPUs because of their vector computing nature. This led to Giles developing a new approximation which is detailed in [3].

Another key component in many engineering and scientific applications is sparse matrix-vector multiplication. It is easy to implement this efficiently on CPUs, but much harder on GPUs due to the very limited amount of level 1 cache (memory) available to the large number of compute threads. Giles addressed this in a novel way by using multiple compute threads to cooperate to compute each one of the output elements, resulting in a factor 2x speedup compared with NVIDIA's existing CUSPARSE implementation [4].

**References to the research**

* [1] A. Lee, C. Yau, M.B. Giles, A. Doucet, C.C. Holmes. 'On the utility
of graphics cards to perform massively parallel simulation of advanced
Monte Carlo methods'. *Journal of Computational and Graphical
Statistics*, 19(4): 769-789, 2010.

DOI: 10.1198/jcgs.2010.10039 (Google Scholar: 94 citations, Web of
Knowledge: 17 citations)

* [2] T. Bradley, J. du Toit, M.B. Giles, R. Tong, P. Woodhams. 'Parallelisation techniques for random number generators'. pp.231-246 in GPU Computing Gems, Emerald Edition, Morgan Kaufmann, 2011. ISBN: 0123849888

* [3] M.B. Giles. 'Approximating the erfinv function'. pp.109-116 in GPU Computing Gems, Jade Edition, Morgan Kaufmann, 2011. ISBN: 978-0-12-385963-1

[4] I. Reguly, M.B. Giles. 'Efficient sparse matrix-vector multiplication on cache-based GPUs' IEEE Refereed Proceedings of Innovative Parallel Computing Conference, 2012. DOI: 10.1109/InPar.2012.6339602

The three asterisked outputs best indicate the quality of the underpinning research. [1] is in a major international journal, while [2] and [3] are chapters in research monographs instigated by NVIDIA.

**Details of the impact**

Since 2008, the University of Oxford's work on GPUs has had both economic impact and impact on practitioners and professional services, via improvement of existing software and the provision of consulting services. The beneficiaries are NAG, NVIDIA, the Apache Foundation, and the large number of people who use the improved software produced by these companies.

The impact has been achieved through the transfer of software developed by Giles as part of his research. This has gone into libraries developed and maintained by NAG, NVIDIA, and Apache. With NAG and NVIDIA, this came about through long-standing research collaborations and personal contacts. In the case of the Apache Foundation (which develops open-source software),the organisation asked for Giles' software as a result of reading his papers.

The first impact of this research was on the Numerical Algorithms Group
(NAG) and through their subsequent dissemination of the software to the
financial industry. NAG is an Oxford-based world leader in the development
of mathematical computer software libraries, whose products are used
worldwide in both academia and industry. NAG and Giles have long-standing
connections, including the recent development of a wholly new GPU-based
library [B] targeted at the needs of the finance industry. The Vice
President of Sales at NAG states [A] "*Thanks to your contributions, we
developed the GPU random number generation library quite quickly; I
think you were responsible for over half of the original code before it
went into our quality assurance process*". He further states that "*It*
[the NAG GPU library] *is being used by two major banks and two others
have purchased related consultancy services and training....We have also
benefited indirectly from this project, for example one major Tier 1
bank made a significant licence upgrade of ~£100,000 and this upgrade
only became possible by NAG establishing new contacts within the bank
through our GPU work. It is important that we are seen by our customers
as being at the cutting edge of scientific computing research, and our
work in areas like the GPU library is key to that and does help us to
keep existing customers and bring in new ones; over the past 3 years the
percentage of top banks who use our software has increased to 60%.*"
NAG report [C] that the French bank BNP Paribas reported excellent speed
up results (between 150x and 240x relative to a single-threaded CPU
simulation, depending on the number of simulations).

Giles has contributed fundamental software components to two of the NVIDIA's mathematical libraries for sparse linear algebra (CUSPARSE) and random number generation (CURAND) [D]. Both are integral to many scientific applications using NVIDIA's GPUs, without which the scientific applications would be unable to take advantage of the acceleration hardware in the leading supercomputers. The inverse error function routine (erfinv) is now part of their standard mathematics library, while the Sobol quasi-random generator is part of the CURAND random number generation library. The implementation of the mrg32k3a pseudo-random generator is based on [2] referenced above, and the sparse matrix-vector product (spMV) routine was put into the CUSPARSE library to replace the previous version developed internally by NVIDIA. Both of these algorithm libraries were developed and analysed at Oxford. The spMV routine is also a foundation for NVIDIA's new NVAMG solver, an algebraic multigrid solver which is the basis for a new GPU version of Ansys' Fluent computational fluid dynamics (CFD) software, in turn probably the leading commercial CFD solver worldwide. This illustrates the layered nature of software development, with high-level packages addressing specific applications layered on top of lower- level, more generic, more fundamental software.

The Senior Manager for CUDA Libraries and Algorithms at NVIDIA states in
his support letter [D] "*It is clear that high quality efficient
software libraries are important to users implementing their algorithms
on our hardware, and if we did not have our CUDA libraries then NVIDIA
would not hold the powerful position within HPC which it does. This is
well illustrated by the fact that the Titan system at Oak Ridge National
Laboratory, the top supercomputer in the world according to the Top500
list, is based on our new Kepler GPUs. Other indications, quoting from
our CEO's keynote presentation in the 2013 GTC conference* [E],*
are that in 2012 we sold 100M CUDA-capable GPUs, and the CUDA
development kit, including all of the libraries, was downloaded 1.6M
times.....Although it is hard to quantify the impact, indications of our
appreciation of the impact of your work are that we made you one of our
inaugural CUDA Fellows in 2008 (there are still only 11 worldwide and
you are the only one in the UK), and we made Oxford University a CUDA
Centre of Excellence (CCoE) in 2012*". This award included a donation
of $100k to support undergraduate research internships, and hardware
donations with a value of approximately another $100k, with further
donations likely in future years.

Finally, the inverse error function approximation software has also been adopted by the Apache Software Foundation for its Java-based Apache Commons Math library [F]. This has widespread use across numerous sectors, where it is used to convert uniformly-distributed random numbers into normally-distributed random numbers for stochastic simulations written in Java, in application areas as diverse as financial option pricing, biochemical reaction modelling, engineering uncertainty quantification, and the simulation of groundwater flow in nuclear waste repositories.

**Sources to corroborate the impact **

[A] Letter from Vice President of Sales, Numerical Algorithms Group (NAG), dated 12 June 2013, detailing the significance of GPU computing and Giles' influence on GPU computing at NAG. Copy held by the University of Oxford.

[B] Information on GPU computing on the NAG website which mentions Mike
Giles by name:

http://www.nag.co.uk/numeric/GPUs/index.asp

[C] NAG presentation detailing performance results generated by BNP
Paribas

http://www.nag.co.uk/numeric/gpus/FinanceNVIDIA.pdf

[D] Letter from the Senior Manager, CUDA Libraries and Algorithms, NVIDIA, dated 27 March 2013, detailing Giles' contribution to CUDA libraries and their impact on NVIDIA. Copy held by the University of Oxford

[E] 2013 presentation by NVIDIA CEO Jen-Hsun Huang.

http://www.ustream.tv/recorded/30095793
confirms the number of downloads of the library

[F] Email from Independent Contributor to Apache Commons Math library, confirming their use of Giles' implementation of erfinv. Copy held by University of Oxford