CommGap International Language Services


The Corpus: A Great Resource for Linguists

May 6th, 2014

If you’re a translator or an interpreter, you need to be an expert on how people use words everyday. In many cases, simple accuracy in your translation is not enough; you must carefully choose your words according to current and common usage.  Even though several words or word combinations might have similar meanings (like eccentric, strange, and quirky) the connotations they carry, the register they are appropriate for, and the tone they have differ.


While usage dictionaries help in many cases, there is another great tool on the Internet designed to help linguists like you know how words are being used. It’s called a corpus. Latin for “body,” a corpus is essentially a collection of text that can be analyzed statistically. Let’s say you’re doing a translation for a newspaper, and you need to know if newspapers say “to try to…” or “to try and…” more frequently. What could help you to decide which to use, aside from your gut feeling? Or how about “sheer/utter brilliance” versus “sheer/utter nonsense”? You could sit down and read all the newspapers written in the last five or so years and tally what you find, and then use the combination that appears most frequently. Obviously, that option is implausible.


This is where a corpus is handy. You can search for these combinations and do a statistical comparison based on the mode of usage.  The Corpus of Contemporary American English (COCA) contains 450 million words used from 1990 up through 2012. All of these words are tagged by part of speech, and the source they came from, whether it be from a newspaper, a book, a scholarly article, or some other source. And if that sample size isn’t big enough, or you need to check usage internationally,  you can use COCA’s international counterpart GLOWBE, which contains 1.9 billion words derived from 1.8 million different texts.


Using these resources would tell you that “to try to” is much more standard and common than “to try and”, which is colloquial, and not appropriate for published text. Using a corpus would also show you where and how the words eccentric, strange, and quirky are used. Analyzing collocates, or words commonly used around these words (eccentric millionaire, quirky nerd) also gives you insight into the connotations certain words carry in certain contexts. Corpus research tells us that sheer most commonly goes with brilliance while utter most commonly goes with nonsense.


The best part about using a corpus is that it’s free. And it isn’t only in English; if you need to check your Spanish or your Portuguese, there are corpora for those languages too. The resources below have been made publicly available by BYU professor Mark Davies, and they should be made an integral part of every translator’s toolbox.


The Corpus of Contemporary American English (COCA):


Corpus of Global Web-based English (GLOWBE):


Corpus Main Page:

One thought on “The Corpus: A Great Resource for Linguists”

    1. Pingback: Ten Words Derived from Greek and Roman Mythology – An Etymological Odyssey | Zahal IDF Blog News