UCREL (the University Research Centre for Computer Corpus Research on Language) has been pioneering advances in corpus linguistics for over 40 years, providing users with corpora (collections of written or spoken material) and the software to exploit them. Drawing together 8 researchers from the Department of Linguistics and English Language and 1 from the School of Computing and Communications at Lancaster University, it has enabled the UK English Language Teaching (ELT) industry to produce innovative materials which have helped the profitability and competitiveness of that industry, and assisted other, principally commercial, users to innovate in product design and development.

Underpinning research

The British National Corpus (BNC) Consortium worked from 1991-1994 to produce a 100-million word corpus of modern British English, for use in commercial, educational and academic research. The BNC contains 90% written and 10% spoken text, representing a wide cross-section of current British English. UCREL's work on the BNC was conducted by a team of 15-20 researchers led by Prof Geoffrey Leech (emeritus since 2002) and Roger Garside (senior lecturer, retired 2008) [R1]. The BNC consortium included Oxford University Press, Longman, Chambers Harrap, the British Library, and Oxford University Computing Services. It was published in 1994. A revised version was released worldwide in 2001. The BNC XML Edition (2007) is the version currently distributed by OUCS (

A major contribution of UCREL to the BNC was the development of (a) many levels of corpus annotation — interpretative information (e.g. syntactic, semantic, pragmatic, anaphoric) in the form of searchable codes or tags; and (b) software which can (semi-)automatically add annotation to a corpus [R2]. This annotation allows users to take advantage of the levels of meaning in a corpus, as well as the levels of form. For example, "set" as a noun has rather different meanings from "set" as a verb, so limiting a search by part-of-speech makes it possible to tap into different groups of meanings. The BNC was grammatically annotated at Lancaster University [G1, G2] using CLAWS (Constituent Likelihood Automatic Word-tagging System), which UCREL has developed continuously since the early 1980s. CLAWS consistently achieves 96-97% accuracy across various types of English text.

The experience of creating and annotating the BNC fed into UCREL's central role in establishing cross-linguistic standards for corpora and multiple levels of corpus annotation, as seen in Leech's leading contribution (with researcher Andrew Wilson — now senior lecturer) to the European Commission's 1993-1999 EAGLES initiative [R3].

From inception to completion, UCREL's research has been shaped by engagement with non-academic users especially relating to education. Our research underpinning dictionaries, grammars [R4] and learning materials has involved collaboration with, for example, a range of publishers including Cambridge University Press, Oxford University Press and Pearson, and, more recently, we have also worked with companies delivering multilingual dictionary content on new media, notably mobile phones [G4], and with Trinity to help them develop their language testing business [G6].

From the late 1990s, UCREL researchers led by Tony McEnery (lecturer - now Professor) have also extended best-practices developed while working with the BNC to corpora of other languages. Notable amongst these are (i) the EMILLE corpus, containing written and spoken data for fourteen South Asian languages [R5][G3] (developed by McEnery with Baker, lecturer now Professor, and Hardie, researcher now senior lecturer) and (ii) the Lancaster Corpus of Mandarin Chinese (developed by McEnery with Xiao, researcher now lecturer) extends the model of carefully-representative corpus design to Mandarin [R6]. The Lancaster Edition of the Callhome Chinese Corpus is one of very few openly-available corpora of spoken Chinese. More recently UCREL contributed expertise (McEnery and Hardie) to the creation and annotation of the Nepali National Corpus [G5].

Details of the impact

Corpora developed at Lancaster have had a notable impact on the English Language Teaching (ELT) industry. Most of this impact has been achieved via the BNC. The UK ELT industry has been estimated by the Department for Business, Innovation and Skills to be worth £1,996.2 million annually to the British economy while the ELT publishing industry adds another £304 million to that sum [C1]. The impact of our work on ELT is substantial and includes:

  1. The Oxford Advanced Learner's Dictionary (OALD), a major basis for which is the BNC, has sold over 35 million copies worldwide and is the world's best-selling advanced learner's dictionary. It remains the best-selling title at Oxford University Press [C2]. OALD is now available as an app, website and CD Rom and is on its 8th edition with two million (hard) copies sold since the edition's 2010 release. The BNC is also used as a basis for OUP's web based teaching resources.
  2. All Longman dictionaries are compiled using the Longman Corpus Network (which includes the BNC and the Lancaster/Longman corpus). This corpus network has been massively influential across the full range of Longman's ELT material including their learner dictionaries and Longman Language Activator series [C3]. Longman see the corpora as essential to their range, saying that they provide "the wealth of information for writing coursebooks and dictionaries that both accurately represent the English language and satisfy students' needs at every level." [C4] Longman-Pearson's ELT products are important to the company — in 2012 Longman-Pearson's ELT products led to its International Education division being both its source of its (i) highest growth in sales (10%) and (ii) greatest increase in operating profit (21%) [C5]. Between 1980 and 2011 Geoffrey Leech was the Vice Chair of Pearson's LINGLEX advisory group which advises Longman-Pearson on their ELT research and publication strategy.
  3. The BNC was used by Cambridge University Press to update its Cambridge ESOL Business English, Key English and Preliminary English Test Vocabulary List. These lists are vital preparation for students taking the Cambridge English exams [C6], an important part of the UK ELT industry — 2010 saw entries for the Cambridge English exams grow to nearly 3.5 million, with more than 11,500 universities, employers and government departments worldwide recognising and using Cambridge English qualifications.

Fostering innovation in other, principally commercial, users, including:

  1. The BNC is licensed to 1,581 institutions, including the following non-academic users: Toyota Motor Europe, Budweiser Budvar UK, Deutsche Bank, Canon Information Technology (Beijing) Co., BAE Systems and Ordnance Survey. Additionally 1,811 users access the BNC via Lancaster's online interface, BNCweb. 192 of these users originate from non-academic bodies, including Microsoft, Sony, HSBC and the British Council.
  2. In 2012 the Indian government launched a series of bilingual dictionaries which couple English with one of the following South Asian languages: Bengali, Hindi, Kannada, Malayalam, Oriya and Tamil. The 12,000 words and phrases covered by the dictionaries have been culled from the BNC. The South Asian language examples are furnished by corpora provided by the Central Institute of Indian Languages, who were helped in building the corpora by the Enabling Minority Language Engineering project at Lancaster. [G3][C7]
  3. Kielikone Oy, a Finnish company, has customers in over 100 countries and is one of the world's leading providers of digital dictionary solutions. [C8] Kielikone uses the semantic taggers developed by Lancaster in almost all current services provided by them, including their MOT dictionary [C9]. This software has also been utilised in mobile applications of the dictionary [C10].
  4. McEnery and Hardie have recently undertaken contracted work for RIM, manufacturers of the BlackBerry device, to solve specific language-related problems (further details cannot be disclosed due to a commercial confidentiality agreement).
  5. The Nelralec project [G5] has had impact parallel to these commercial impacts producing a popular dictionary of Nepali (see Given that Nepal is a developing country, this activity has been third-sector, not commercial.
  6. The Lancaster Edition of the Callhome Chinese Corpus has been distributed via the Linguistic Data Consortium to industrial research laboratories between 2009 and 2011 to enable the development of Natural Language Processing involving Chinese (e.g. machine translation). These users are based around the world and include: Autonomy Systems Ltd., UK; IBM, USA; NTT Communication Science Laboratories, Japan; and Loquendo SpA, Italy.

