Dataset

Word Frequencies in Written & Spoken English from British National Corpus (100M-word)

Added By mrflip

by Geoffrey Leech, Paul Rayson, Andrew Wilson

Books of English word frequencies have in the past suffered from severe limitations of sample size and breadth. They have also tended to be restricted to word forms alone. Most importantly, almost all have dealt only with written language. This book overcomes these limitations. It is derived from the British National Corpus – a 100,000,000 word electronic databank sampled from the whole range of present-day English, spoken and written – and makes use of the grammatical information that has been added to each word in the corpus.

Key features:

  • Includes frequencies for present-day speech (including everyday conversation) as well as for writing
  • Rank-ordered and alphabetical frequency lists for the whole corpus and for various subdivisions: e.g. informative vs. imaginative writing, conversational vs. other varieties of speech
  • Entries take account of grammatical parts of speech (e.g. round as a preposition is listed separately from round as an adjective)
  • Includes discussions of a number of thematic frequency lists such as colour terms, female vs. male terms, etc

Here we provide plain text versions of the frequency lists contained in the book “Word Frequencies in Written and Spoken English: based on the British National Corpus”. These are raw unedited frequency lists produced by our software and do not contain the many additional notes supplied in the book itself. The lists are tab delimited plain text so can be imported into your prefered spreadsheet format. For the main lists we provide a key to the columns. More details on the process undertaken in the preparation of the lists can be found in the introduction to the book.

These lists are licensed under a Creative Commons Attribution-Share Alike 2.0 UK: England & Wales License.

The frequency data is based on the British National Corpus. The BNC project was carried out and is managed by an industrial/academic consortium lead by Oxford University Press, of which the other members are major dictionary publishers Addison-Wesley Longman and Larousse Kingfisher Chambers; academic research centres at Oxford University Computing Services, Lancaster University’s Centre for Computer Corpus Research on Language, and the British Library’s Research and Innovation Centre.

These lists show dispersion ranging between 0 and 1 rather than 0 and 100 as in the book. We multiplied the value by 100 and rounded to zero decimal places in the book for reasons of space. Log likelihood values are shown here to one decimal place rather than zero as in the book.

Please note, all frequencies are per million words. There are some extra notes explaining the dummy values (:, @, and %) in the lemmatised lists.

CHAPTER 1: Frequencies in the Whole Corpus (Spoken and Written English)

  • List 1.1: Alphabetical frequency list of the whole corpus (lemmatized)
  • List 1.2: Rank frequency list for the whole corpus (not lemmatized)

CHAPTER 2: Spoken and Written English

  • List 2.1: Alphabetical frequency list: speech v. writing (lemmatized)
  • List 2.2: Rank frequency order: spoken English (not lemmatized)
  • List 2.3: Rank frequency order: written English (not lemmatized)
  • List 2.4: Distinctiveness list: contrasting speech and writing (ordered by log likelihood)

CHAPTER 3: Two Main Varieties of Spoken English Compared

  • List 3.1: Alphabetical frequency list: conversational v. task-oriented speech (lemmatized)
  • List 3.2: Distinctiveness list: contrasting conversational v. task-oriented speech (not lemmatized)

CHAPTER 4: Two Main Varieties of Written English Compared

  • List 4.1: Alphabetical frequency list: imaginative v. informative writing (lemmatized)
  • List 4.2: Distinctiveness list: imaginative v. informative writing (not lemmatized)

CHAPTER 5: Rank Frequency Lists of Words within Word Classes (Parts of Speech) in the whole corpus

  • List 5.1: Frequency list of nouns (by lemma)
  • List 5.2: Frequency list of verbs (by lemma)
  • List 5.3: Frequency list of adjectives (by lemma)
  • List 5.4: Frequency list of adverbs (not lemmatized)
  • List 5.5: Frequency list of pronouns (not lemmatized)
  • List 5.6: Frequency list of determiners
  • List 5.7: Frequency list of determiner/pronouns
  • List 5.8: Frequency list of prepositions
  • List 5.9: Frequency list of conjunctions
  • List 5.10: Frequency list of interjections and discourse particles

CHAPTER 6: Frequency Lists of Grammatical Word Classes (based on the Sampler Corpus)
!

  • List 6.1.1: Alphabetical list: the whole sampler corpus (spoken and written English)
  • List 6.1.2: Rank frequency list: the whole sampler corpus
  • List 6.2.1: Alphabetical list: spoken v. written English
  • List 6.2.2: Rank frequency list: spoken English compared with written English
  • List 6.2.3: Rank frequency list: written English compared with spoken English
  • List 6.2.4: Distinctiveness list: spoken v. written English
  • List 6.3.1: Alphabetical list: conversation v. task-oriented speech
  • List 6.3.2: Distinctiveness list: conversation v. task-oriented speech
  • List 6.4.1: Alphabetical list: imaginative v. informative writing
  • List 6.4.2: Distinctiveness list: imaginative v. informative writing

Geoffrey Leech, Paul Rayson, Andrew Wilson (2001) pp. 320, Longman, London. ISBN0582-32007-0 (Paperback)

License

Creative Commons SA

Creative Commons SA