Log in
Based in the School of English, the Research and Development Unit for English Studies (RDUES) conducts research in the field of corpus linguistics and develops innovative software tools to allow a wide range of external audiences to locate, annotate and use electronic data more effectively. This case study details work carried out by the RDUES team (Matt Gee, Andrew Kehoe, Antoinette Renouf) in building large-scale corpora of web texts, from which examples of language use have been extracted, analysed, and presented in a form suitable for teaching and research across and beyond HE, including collaboration with commercial partners.
The Statistical Cybermetrics Research Group (SCRG) has developed web-based indictors and methods for use in research policy and research evaluation for governmental bodies and non- governmental organisations. The research has impact by providing tools and new types of indicators for policy-relevant evaluations for policy makers and decision makers. The research itself includes (a) the direct production and implementation of new indicators and (b) theoretical research into indicator foundations and tool performance, such as that of the web search engines used for indicator construction. The research has impact on policy making within the United Nations Development Programme by aiding evaluations of its initiatives, and within Oxfam and the BBC World Service Trust. It has impact on policy making at the national and international levels to aid the effective directing of funding to aid knowledge production. It has also has impact on public services by helping Nesta and Jisc to evaluate the success of some of their initiatives.
COnnecting REpositories (CORE) is a system for aggregating, harvesting and semantically enriching documents. As at July 2013, CORE contains 15m+ open access research papers from worldwide repositories and journals, on any topic and in more than 40 languages. In July 2013, CORE recorded 500k+ visits from 90k+ unique visitors. By processing both full-text and metadata, CORE serves four communities: researchers searching research materials; repository managers needing analytical information about their repositories; funders wanting to evaluate the impact of funded projects; and developers of new knowledge-mining technologies. The CORE semantic recommender has been integrated with digital libraries and repositories of cultural institutions, including the European Library and UNESCO. CORE has been selected to be the metadata aggregator of the UK's national open access services.
Research carried out at Sussex into the automatic grammatical analysis of English text has enabled and enhanced a range of commercial text-processing applications and services. These include an automatic SMS question-answering service and a computer system that grades essays written by learners of English as a second language. Over the REF period there has been substantial economic impact on a spin-out company, whose viability has been established through revenue of around £500k from licensing, development and maintenance contracts for these applications.
Biak (West Papua, Indonesia) is an endangered language with no previously established orthography. Dalrymple and Mofu's ESRC-supported project created the first on-line database of digital audio and video Biak texts with linguistically analysed transcriptions and translations (one of the first ever for an endangered language), making these materials available for future generations and aiding the sustainability of the language. Biak school-children can now use educational materials, including dictionaries, based on project resources. The project also trained local researchers in best practice in language documentation, enabling others to replicate these methods and empowering local communities to save their own endangered languages.
Extracting information and meaning from natural language text is central to a wide variety of computer applications, ranging from social media opinion mining to the processing of patient health-care records. Sentic Computing, pioneered at the University of Stirling, underpins a unique set of related tools for incorporating emotion and sentiment analysis in natural language processing. These tools are being employed in commercial products, with performance improvements of up to 20% being reported in accuracy of textual analysis, matching or even exceeding human performance (Zoral Labs). Current applications include social media monitoring as part of a web content management system (Sitekit Solutions Ltd), personal photo management systems (HP Labs India) and patient opinion mining (Patient Opinion Ltd). Impact has also been achieved through direct collaboration with other commercial partners such as Microsoft Research Asia, TrustPilot and Abies Ltd. Moreover, international organisations such as the Brain Sciences Foundation and the A*Star Institute for High Performance Computing have realised major impact by drawing upon our research.
Worldwide impact on language learners and others has been generated by the development at Lancaster of a ground-breaking natural language processing tool (CLAWS4), and an associated unique collection of natural language data (the British National Corpus, or BNC). Some highlights selected from the primary impacts are as follows:
The pathways to impact have been primarily via consultancy and via licencing of software IP. The impact itself is largely on the language learners—i.e. users of products such as the above. There is a secondary economic impact on a UK SME which has licenced our software.
GATE (a General Architecture for Text Engineering—see http://gate.ac.uk/) is an experimental apparatus, R&D platform and software suite with very wide impact in society and industry. There are many examples of applications: the UK National Archive uses it to provide sophisticated search mechanisms over its .gov.uk holdings; Oracle includes it in its semantics offering; Garlik Ltd. uses it to mine the web for data that might lead to identity theft; Innovantage uses it in intelligent recruiting products; Fizzback uses it for customer feedback analysis; the British Library uses it for environmental science literature indexing; the Stationery Office for value-added services on top of their legal databases. It has been adopted as a fundamental piece of web infrastructure by major organisations like the BBC, Euromoney and the Press Association, enabling them to integrate huge volumes of data with up-to-the-minute currency at an affordable cost, delivering cost savings and new products.
Professor David Crystal's world-leading research on language policy, diversity and usage, conducted at Bangor since 2000, has led to a transformation in terms of public and political attitudes, both nationally and internationally, towards the nature and use of language in public and private discourse. In particular, the research has led, since 2008, to an increased awareness of linguistic diversity, changes to governmental policies on language, and the development of the world's first targeted online advertising technology, which today indexes billions of impressions across 11 languages to provide real-time data services in the emerging online advertising world.
The Natural Language Toolkit (NLTK) is a widely-adopted Python library for natural language processing. NLTK is run as an open source project. Three project leaders, Steven Bird (Melbourne University), Edward Loper (BBN, Boston) and Ewan Klein (University of Edinburgh) provide the strategic direction of the NLTK project.
NLTK has been widely used in academia, commercial / non-profit organisations and public bodies, including Stanford University and the Educational Testing Service (ETS), which administers widely-recognised tests across more than 180 countries. NLTK has played an important role in making core natural language processing techniques easy to grasp, easy to integrate with other software tools, and easy to deploy.