Anne Hargreaves
- Mar 12, 2021
- 4 min read

Using corpora to inform translation

Updated: Mar 14, 2021

Tl:dr Translators can query large collections of natural language text to reveal patterns of language use for terminology search, to characterise a particular text type, or to contrast usage in different language pairs. Or just for fun and interest - corpus rocks!

What is a corpus (plural corpora)?

"A corpus is a collection of texts, selected and compiled according to specific criteria. The texts are held in electronic format, i.e. as computer files, so that various kinds of corpus tools, i.e. software, can be used to carry out analysis on them." (From "Introducing Corpora in Translation Studies " by Maeve Olohan (2004)). Texts must be ‘naturally occurring’, that is, they are taken from real examples of language in use, and in order to provide meaningful results and take advantage of the features of computer processing, they must be too large for manual analysis. Using corpus analysis software, patterns can be revealed in the texts being analysed which can assist a translator in various ways.

Where can I find corpora to try?

Corpora are available online - some are free to use, with or without registration, while others require payment. The language and type of text varies according to the particular corpus, for example the British National Corpus (BNC) "is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of British English from the later part of the 20th century, both spoken and written." Corpora also exist covering specialist fields, such as the Michigan Corpus of Academic Spoken English and the Brigham Young University specialist collection. For translators working into English, these corpora are useful as a supplement to the recurrent question "do we really say this?" - "how is this normally expressed?" for which an internet search is the usual resource. Online corpora such as the BNC don't require specialist software for querying the corpus as they provide a web-based interface with instructions for basic usage. Many "Parallel corpora" with matched translations of the same text in two languages are available online, in which instances of the same word or phrase can be compared in parallel, for example Linguateca COMPARA, a Portuguese and English parallel corpus, free to use on the web. Opus is an open-source collection of parallel corpora, tools and information provided by European universities and researchers, free to use but perhaps not for the beginner!

TenT/CAT tools and corpora

You have corpora already! By matching segments and adding them to a translation memory (TM), your translation environment tool (TenT) creates a parallel corpus. The more material you add to a TM in a particular subject area, the bigger your corpus and the more useful it is for corpus queries. Most TenTs have the ability to produce wordlists or termlists in various ways and do concordance searches (finding all instances of a word or phrase in context). The volume of text that you have translated, though, won't match up to the huge quantities in specially-produced corpora.

Tools to use

A number of tools are available for querying corpora. A popular web-based tool is Sketch Engine, which provides access to a range of corpora in various languages and also instructions and assistance for building your own. It has a YouTube channel with instruction videos and is available on 30-day trial or for a small monthly subscription, to freelancers.

WordSmith is a desktop corpus tool for one-off purchase, which also has some getting-started videos. AntConc is a freeware desktop tool. Corpus-Analysis.com gives a list of tools for corpus linguistics.

Make your own corpus

If you are working in a specialist field, or want to study one with the aim of specialising, it might be worth the time and effort involved to create your own small corpus of texts in the specialist area. There is no set size; the upper limits might be practical ones such as availability of suitable texts, time, or computing resources. It's important to choose texts that reflect the specialism of interest. And of course to make sure you have permission to use them for the purpose you have in mind. For my MSc dissertation, I made a patent corpus with French and English texts from espacenet, the European Patent Office's public patent database, which I queried to demonstrate some of the particular features of patent texts.

Examples from a specialist corpus of mechanical patent texts

The tool used was WordSmith - here the French and English corpora are loaded together to investigate the English verb that corresponds in use to the French verb "représenter". In English "illustrate" and "show" are used (although not "represent"). In French only one verb is used, "représenter". Of course in this case, the phrases seen are not direct translations of one another, but examples of usage. In the case of a parallel corpus, as with TM segments, matched translation pairs can be shown.

With a single language loaded at a time, these two screenshots show a comparison of the use of "en rotation" and "rotatably"in French and English. "Rotatably" is one of the oddities of mechanical patent texts in English, being used almost nowhere else.

The two lists below use the Word List function to show comparative frequency in the mechanical patent corpus compared with the BNC corpus of general texts.

Listing word frequency can give interesting insight into the particular characteristics of a specialist text.

Patent corpus

BNC

A flavour of what you can do with corpora

I hope these examples have given a snapshot of some of the ways that corpus analysis can help you analyse a text type and show language in use, either in a single language or as a comparison between two languages.

If anyone is interested in having a look at my dissertation, it's here:

AnneHargreavesMScDissertation

.pdf

Download PDF • 1.09MB