AWL Words on this page from the academic word list

Show AWL words on this page.

Show sorted lists of these words.



  Twitter Facebook Linkedin
YouTube youku RSS iTunes Spotify Google Podcast PodoMatic Patreon Pinterest
Donate
Dictionary Look it up

Any words you don't know? Look them up in the website's built-in dictionary.

loading


Choose a dictionary.
 Wordnet
 OPTED
 both









Corpora for Academic English Improve language use through study of authentic texts

This page outlines some important corpora (plural of corpus) for academic English, which have been used elsewhere on the site. There is a brief explanation of what a corpus is, followed by an overview of the following corpora: BNC Baby (subsection of the BNC i.e. British National Corpus), BAWE (British Academic Written English corpus), and BASE (British Academic Spoken English corpus).


What is a corpus?

A corpus (plural corpora) is a collection of authentic texts, usually taken from a wide range of sources in order to give a representative sample of language. Texts can be written or spoken, and can be academic (reports, essays, lectures, seminars) or non-academic (fiction, recipes, tweets).


Corpora are usually large (commonly millions of words) and therefore require computer analysis. This is most often done using a concordancer, which allow users to view words in context and to extract information about frequency, range (i.e. how many different texts a word is used in), collocation (word combinations) and grammar. This allows students, teachers and researchers to make decisions based on actual usage, rather than relying on intuition.

BNC Baby

The BNC Baby is a four million word sample taken from the 100 million word BNC (British National Corpus), and is divided into four parts: academic writing, imaginative writing (i.e. fiction), spoken conversation and newspaper texts. The sample texts were chosen so that each sub-corpus was approximately equal in size, i.e. each contains approximately one million words.


The academic corpus consists of 30 texts, chosen randomly from different subject areas, and comprises journal articles as well as material from books.


The fiction corpus comprises 25 texts taken from books published between 1985 and 1994, written for an adult audience.


The spoken corpus contains 30 spoken texts, with a range of speakers in different situations. Just over half of the texts are for speakers aged 25-44, with the younger age range 0-24 comprising just under 20%, and the older age range of 45 and over comprising just under 30%. 59% of the speakers are female, 41% male.


The news section of the BNC Baby comprises a mix of national newspapers (60%) and local newspapers (40%), covering a wide range of topics, as well as a range of dates to maximise the variation of topics. The shorter nature of news articles compared to other types of text mean that number of texts is significant larger (97 texts in total).


All four sections of the BNC Baby are used in the word profiler in order to compare frequencies across different types of text. The academic sub-corpus of the BNC Baby is used in the concordancer.


BAWE (British Academic Written English) corpus

The British Academic Written English (BAWE) corpus was developed at the Universities of Warwick, Reading and Oxford Brookes under the directorship of Hilary Nesi, with Sheena Gardner, Paul Thompson and Paul Wickens, and funding from the ESRC (RES-000-23-0800).


The BAWE contains 2,761 pieces of proficient assessed student writing (6,506,995 words), which range in length from 500 to 5000 words (average length 2,357 words). Texts were written between 2000 and 2007.


The writing is evenly distributed across four levels of study (i.e. three years of undergraduate study and taught masters level study). It covers a total of 35 academic disciplines within four disciplinary areas, namely Arts and Humanities, Social Sciences, Life Sciences, and Physical Sciences. The texts are evenly distributed among those four areas.


The full corpus can be downloaded from the Oxford Text Archive (download link in the References below).

BASE (British Academic Spoken English) corpus

The British Academic Spoken English (BASE) corpus was developed at the Universities of Warwick and Reading, under the directorship of Hilary Nesi, with Paul Thompson, and funding from the Arts and Humanities Research Board (RE/AN6806/APN13545).


The BASE is a spoken companion to the BAWE and consists of 160 lectures and 39 seminars (1,644,942 words) recorded in a variety of university departments. As with the BAWE, texts are evenly distributed across different disciplines, namely Arts and Humanities, Life and Medical Sciences, Physical Sciences, and Social Studies and Sciences (40 lectures from each, plus 10 seminars from each except for Physical Sciences which has 9 seminars).


The full corpus can be downloaded from the Oxford Text Archive (download link in the References below).


References

Coventry University (n.d.a) (BASE) British Academic Spoken English Corpus. Available from: https://www.coventry.ac.uk/base (Access date: 9 October, 2022).


Coventry University (n.d.b) (BAWE) British Academic Written English Corpus. Available from: https://www.coventry.ac.uk/bawe (Access date: 9 October, 2022).


Lancaster University (2011) Part 1: Corpus Linguistics. Available from: http://corpora.lancs.ac.uk/clmtp/main-1.php (Access date: 9 October, 2022).


Lancaster University (n.d.) Unit 1 Corpus linguistics: the basics. Available from: https://www.lancaster.ac.uk/fass/projects/corpus/ZJU/xCBLS/chapters/A01.pdf (Access date: 9 October, 2022).


Lund University (2021) What is a corpus?. Available from: https://www.awelu.lu.se/language/corpora-resources-for-writer-autonomy/what-is-a-corpus/ (Access date: 9 October, 2022).


Oxford Text Archive (2019a) British Academic Spoken English Corpus [ox.ac.uk]. Available from: https://ota.bodleian.ox.ac.uk/repository/xmlui/handle/20.500.12024/2525 (Access date: 9 October, 2022).


Oxford Text Archive (2019b) British Academic Written English Corpus [ox.ac.uk]. Available from: https://ota.bodleian.ox.ac.uk/repository/xmlui/handle/20.500.12024/2539 (Access date: 9 October, 2022).





Sheldon Smith

Author: Sheldon Smith    ‖    Last modified: 19 October 2022.

Sheldon Smith is the founder and editor of EAPFoundation.com. He has been teaching English for Academic Purposes since 2004. Find out more about him in the about section and connect with him on Twitter, Facebook and LinkedIn.



Popular pages in the vocab sectionMost viewed pages