Submitting Institution
University of AberdeenUnit of Assessment
Computer Science and InformaticsSummary Impact Type
TechnologicalResearch Subject Area(s)
Information and Computing Sciences: Artificial Intelligence and Image Processing
Language, Communication and Culture: Linguistics
Summary of the impact
Data-to-text utilises Natural Language Generation (NLG) technology that
allows computer systems to generate narrative summaries of complex data
sets. These can be used by experts, professional and managers to better,
and quickly, understand the information contained within large and complex
data sets. The technology has been developed since 2000 by Prof Reiter and
Dr Sripada at the University of Aberdeen, supported by several EPSRC
grants. The Impact from the research has two dimensions.
As economic impact, a spinout company, Data2Text (www.data2text.com),
was created in late 2009 to commercialise the research. As of May 2013,
Data2Text had 14 employees. Much of Data2Text's work is collaborative with
another UK company, Arria NLG (www.arria.com),
which as of May 2013 had about 25 employees, most of whom were involved in
collaborative projects with Data2Text.
As impact on practitioners and professional services, case
studies have been developed in the oil & gas sector, in weather
forecasting, and in healthcare, where NLG provides tools to rapidly
develop narrative reports to facilitate planning and decision making,
introducing benefits in terms of improved access to information and
resultant cost and/or time savings. In addition the research led to the
creation of simplenlg (http://simplenlg.googlecode.com/),
an open-source software package which performs some basic natural language
generation tasks. The simplenlg package is used by several
companies, including Agfa, Nuance and Siemens as well as Data2Text and
Arria NLG.
Underpinning research
Data-to-text technology was developed by Professor Reiter, Dr Sripada
(originally Research Fellow, now Senior Lecturer), and Professor Jim
Hunter, at the University of Aberdeen from 2000. The work arose from
Reiter's interest in "language and the world", he wanted to explore how
real-world information is mapped onto language, and focused on situations
where human authors wrote English-language narrative summaries of complex
numeric data sets. Although the initial motivation was basic research in
computer and cognitive science, it became apparent that there was
considerable commercial interest in software which could automate the task
of writing such summaries.
The research began with the EPSRC project SumTime: Generating
Summaries of Time-Series Data (2000-2003) [GR/M76881/01]. Reiter and
Hunter were investigators, with Sripada employed as a research fellow. The
primary focus of SumTime was to generate automated narrative
weather forecasts from numerical weather prediction data, although the
group also worked on summarising data from gas turbines, and from
electronic patient records in a hospital. One striking scientific finding
were the large differences in words and language used by human weather
forecasters. Another major outcome was that an evaluation of the automated
weather forecasts generator (which was operationally deployed at the
Aberdeen office of Weathernews) showed that many forecast users preferred
computer-generated texts over human-written forecasts, in part because of
greater consistency in their language and word use. SumTime was
followed by an EPSRC CASE studentship (Ross Turner, supervised by Sripada,
by then a lecturer at Aberdeen, and Reiter), which developed Roadsafe,
a more sophisticated system which generated weather forecasts over a
spatial region (instead of just at one point); this was in conjunction
with another company, Aerospace and Marine International. This project
developed major advances in theory and algorithms for generating spatial
reference. The SumTime and Roadsafe work formed the basis
of commercial weather-forecasting work done by Data2Text, which now
employs Dr Turner.
After spending time on an ESRC-funded Paccit-Link project (Automatic
Generation of Personalised Basic Skills Summary Reports)
[L328253023], which summarised education assessment data for adults with
poor basic skills, the group then embarked on the Babytalk project
(2006-2012), supported by three EPSRC grants
[EP/D049520/1,EP/D05057X/1,EP/H042938/1] and two EPSRC DTA studentships;
this was a collaboration with NHS Lothian. Reiter, Hunter, and Sripada
were investigators on the main grants. Babytalk's goal was to
develop software which could automatically generate summaries of clinical
data about premature infants in neonatal intensive care for use by
doctors, nurses and parents. Babytalk was much more challenging
than SumTime, because the data was more complex (it included event
data such as medical interventions, as well as time-series data, and also
required extracting information from free text). The summaries also had to
be generated, using the same underlying architecture, for three very
different audiences. Probably the most important scientific finding of Babytalk
was that it proved the task could be done: it was possible to build, and
deploy in a hospital environment, a data-to-text system which could
generate useful summaries of a complex heterogeneous data set for diverse
audiences. Another result of this work was the development of an
architecture and software framework that could be used to generate
summaries of heterogeneous data sets across a range of potential
application areas. Thus Babytalk provided a basis for Data2Text's
work in the oil & gas industry.
Babytalk also allowed the group to refine an open source software
product simplenlg into a commercially valuable form. Earlier
versions of simplenlg had been used in Aberdeen since the late 1990s, but
during Babytalk simplenlg was much improved (in robustness and
documentation as well as functionality), and then released on an
open-source basis for free use for both academic research and commercial
applications.
The group also did some work on applying data-to-text technology to
assist people with disabilities; this work was supported by 2 EPSRC grants
[EP/F066880/1,EP/H022376/1]. The How Was School Today project,
which helped non-speaking children write stories about their school day,
was the subject of an EPSRC Impact study and also mentioned in several
EPSRC reports.
References to the research
** These papers describe key findings from the SumTime project. The first
focuses specifically on differences in word usage between different
authors, looking at several domains. The second focuses on the weather
forecast generation domain, including a more detailed analysis of
differences in word and language usage between different weather
forecasters, and an evaluation which showed that forecast users in some
cases preferred SumTime texts to texts written by human forecasters. Papers
[R1] and [R2] above best indicate the quality of underpinning research.
[R3] A Law, Y Freer, J Hunter, R Logie, N McIntosh, J Quinn (2005). A
Comparison of Graphical and Textual Presentations of Time Series Data to
Support Medical Decision Making in the Neonatal Intensive Care Unit. Journal
of Clinical Monitoring and Computing 19:183-194.
http://dx.doi.org/10.1007/s10877-005-0879-3
** This paper, which laid the foundation for much of the Babytalk work,
shows that doctors and nurses make better decisions from textual summaries
of clinical data than from visualisations, at least in some contexts.
[R4] F Portet, E Reiter, A Gatt, J Hunter, S Sripada, Y Freer, C Sykes
(2009). Automatic Generation of Textual Summaries from Neonatal Intensive
Care Data. Artificial Intelligence 173:789-816. [Sripada1
in the REF2 for this unit.]
[R5] J Hunter, Y Freer, A Gatt, E Reiter, S Sripada, C Sykes (2012).
Automatic generation of natural language nursing shift summaries in
neonatal intensive care: BT-Nurse. Artificial Intelligence in Medicine
56:157-172. [Reiter2 in the REF2 for this unit.]
** These papers describe key findings of the Babytalk doctor and nurse
systems (the parent system is still being evaluated), including
architecture and on-ward evaluation of the nurse system.
Paper [R4] above best indicates quality of underpinning research.
[R6] R Black, A Waller, R Turner, E Reiter (2012). Supporting Personal
Narrative for Children with Complex Communication Needs. ACM
Transactions on Computer-Human Interaction 19(2), Article
15. [Reiter4 in the REF2 for this unit.]
** This paper describes key findings of the How Was School Today?
project.
Details of the impact
Following interest in the results of the research, in late 2009 the
decision was taken to create a spinout company, Data2Text, to
commercialise the underpinning research. Data2Text essentially develops
bespoke data-to-text software applications for large organisations, based
on a generic data-to-text software library. The technology and expertise
developed in SumTime and BabyTalk are very much at the
core of Data2Text's activities. In addition to its commercial contracts
(which cannot be described in detail here because of commercial
confidentiality [S2,S3]), Data2Text was awarded a Smart Award from
Scottish Enterprise. The company currently employs 14 staff and has a
turnover of approximately £1M/year. In 2012 Data2Text formed a partnership
with Arria NLG, who acquired a minority shareholding in Data2Text [S1]. In
October 2013 Arria NLG acquired Data2Text, and in November 2013 an
application was made to the Alternative Investment Market (AIM) through
the London Stock Exchange for an Initial Public Offering for shares in
Arria NLG Plc. The expected size of this offer was £6.1M [S7], with a
likely valuation of £102M. Admission to AIM is expected in early December.
Beyond the economic impact of the spin out company creation, NLG
technology is also having an impact both economically and on practitioners
and professional services through commercial partnerships between
Data2Text/Arria NLG and other organisations. The Arria NLG website refers
to a number of case studies [S6], and refers strongly to the impact
derived from the original research at the University of Aberdeen. Also see
the description of the November 2013 Initial Public Offering on the London
Stock Exchange [S7], which states that "[t]he Group's core product is
known as the Arria NLG Engine, which originates from research at the
University of Aberdeen". NLG products are currently developed through
Data2Text and Arria NLG in two key application areas: oil & gas and
weather forecasting. The companies are also exploring applications in
financial services and healthcare.
In oil & gas, NLG applications have been used to monitor alerts
produced by rotating equipment on oil platforms. Operating continuously
through a 365 day cycle, breakdown of equipment results in lost production
time for oil & gas operators. A Data2text/Arria NLG system is being
used by a multinational oil company to automatically produce situation
analyses, based on data streams from turbines, compressors, pumps,
generators and engines, when a surveillance alert is triggered [S2].
Manually writing such analyses can take an experienced engineer several
hours; the NLG software does it in minutes. As mentioned above, this
system is inspired by the Babytalk research project; essentially Babytalk-inspired
ideas are used to monitor equipment on oil platforms instead of premature
infants. This has current and potential economic impacts on the
multinational oil company partner, Shell. Shell has stated "by adding the
Arria NLG Engine to traditional surveillance technologies to monitor their
global rotating equipment assets, a one percentage point uptick in
production uptime could be achieved". Further, "if the Arria NLG Engine
can be added to all of their global production assets [...this...] could
equate to billions of dollars of increased production per year" [S2,S6].
The description of Arria NLG in the IPO made in November 2013 states that
the software "is being used to analyse the performance data of large scale
industrial machinery located on [...] Oil & Gas platforms in the Gulf
of Mexico, producing real-time written reports for engineers at the [...]
centre for surveillance of offshore operations."
Application of NLG technology in weather forecasting began as part of the
research programme in 2000-9 with the Aberdeen offices of Weathernews and
Aerospace and Marine International. Since 2009, this has extended to
collaboration with a leading weather service, the Met Office. To date the
NLG technology has been used to generate site specific, on-demand,
detailed weather forecasts. The technology is capable of preparing
detailed 3 day weather forecasts for 5,000 different locations in less
than one minute, the equivalent for a human forecaster would require one
and a half months. "Right now, our NLG software tackles tasks that would
be impossible for forecasters to complete manually. It would take a
forecaster 1.5 months to create the equivalent of our system's one-minute
output" (detailed, 3-day weather forecasts for 5,000 different locations)
[S3,S6]. In this way, the Met Office can offer additional services to a
wide range of customers.
Researchers at the University of Aberdeen also created and released the
open source data-to-text resources simplenlg. simplenlg is
a Java software library (currently in version 4.4) for doing some Natural
Language Generation processing (surface realisation and a small amount of
microplanning) [S8]. It was initially released purely for research use,
but has been updated with significant enhancements and documentation since
2010 (from version 4.0) to be also available for commercial use on an
open-source basis (simplenlg.googlecode.com).
Since the release of version 4.2 (April 2011), almost 2,000 copies of simplenlg
have been downloaded, and an indicator of the increasing size of the user
community is the increase in download statistics through these versions:
version 4.2, 228; version 4.3, 733; and version 4.4, 1032. The simplenlg
library is currently being used by a number of commercial companies (as
well as academic groups) around the world, especially in
healthcare-related applications. A good example of an Open Source App that
uses simplenlg is the Augmentative and Alternative Communication
(AAC) Speech Communicator, "an Android application for people with speech
disabilities that forms sentences from a list of pictograms" [S9].
According to Google Play this App has been installed over 10,000 times,
and has an aggregate rating of 4/5 (25 reviews). Good examples of how simplenlg
has been integrated into products include Agfa, where it is used in a
cardiology clinical reporting application [S5], and Siemens, where it is
used it in healthcare software sold to US hospitals [S4]. It is also being
used commercially for assistive technology; for example Technabling
(another Aberdeen Computing Science spinout) use it in their portable sign
language translator [S10], which is to be launched as a product in Autumn
2013.
Sources to corroborate the impact
[S1] President, Arria NLG — will corroborate the partnership between
Data2Text and Arria NLG, the integration of underpinning research in the
Arria NLG engine, and economic impact
(including staff, turnover and acquisition).
[S2] Senior Surveillance Engineering Specialist, Shell — will corroborate
the economic impact and impact on practice due to the use of Data2Text
technologies.
[S3] Head of Customer Applications, Met Office — will corroborate the
economic impact and impact on practice due to the use of Data2Text
technologies.
[S4] Software Developer, Siemens Medical Solutions — will corroborate the
impact on practice due to the use of Data2Text technologies.
[S5] Software Engineer, Agfa Healthcare — will corroborate the impact on
practice due to the use of Data2Text technologies.
[S6] Arria NLG case studies: https://www.arria.com/case-studies-A230.php
— will corroborate breadth of sectors in which underpinning research is
applied through the Arria NLG engine.
[S7] Initial Public Offering of Arria NLG: http://www.londonstockexchange.com/exchange/prices-and-markets/stocks/new-and-recent-issues/new-recent-issue-details.html?issueId=8923
- corroborates the fact that the impact originates from research at the
University of Aberdeen, the acquisition of Data2Text by Arria NLG and the
IPO being issued, the deployment of the software on Oil & Gas
platforms in the Gulf of Mexico, and the size of the economic impact made
through the commercialisation of this research.
[S8] Simplenlg website: http://simplenlg.googlecode.com/
(includes link to simplenlg discussion group: http://groups.google.com/group/simplenlg)
— will corroborate size of user community and engagement of that community
in using simplenlg in practice.
[S9] The AAC Speech Communicator (Open Source App that uses simplenlg):
http://aacspeech.org/ — will
corroborate the use of simplenlg in this App.
[S10] The portable sign language translator: http://www.pslt.org/
— will corroborate a commercial use of simplenlg.