Parallel corpus

From Glottopedia
Jump to navigation Jump to search


A parallel corpus is a corpus that contains a collection of original texts in language L1 and their translations into a set of languages L2 ... Ln. In most cases, parallel corpora contain data from only two languages.

Closely related to parallel corpora are 'comparable corpora', which consists of texts from two or more languages which are similar in genre, topic, register etc. without, however, containing the same content.

Types and Compilation

Types of parallel corpora

Parallel corpora can be bilingual or multilingual, i.e. they consist of texts of two or more languages. They can be either unidirectional (e.g. an English text translated into German), bidirectional (e.g. an English text translated into German and vice versa), or multidirectional (e.g. an English text such as an EU regulation translated into German, Spanish, French, etc.).

Compilation of parallel corpora

The texts of a corpus are chosen according to specific criteria which depend on the purpose for which it is created. In particular, compilers have to decide whether to include a static or dynamic collection of texts, and entire texts or text samples. Questions of authorship, size, topic, genre, medium and style have to be considered we well. In any case, a corpus is intended to comply with the following requirements: (i) it should contain authentic (naturally occurring) language data; (ii) it should be representative, i.e. it should contain data from different types of discourse.

Alignment of a parallel corpus

In order to use a parallel corpus properly it is necessary to align the source text and its translation(s). This means that one has to identify the pairs or sets of sentences, phrases and words in the original text and their correspondences in the other languages. Parallel text alignment is important because during the translation process sentences might be split, merged, deleted, inserted or reordered by the translator in order to create a natural translation in the target language. In order to compare the original text and its translation(s), it is necessary to (re-)establish the correspondences between the texts. In the process of alignment, anchor points such as proper names, numbers, quotation marks etc. are often used as a points of orientation. The degree of correspondence between the texts of a parallel corpus varies depending on the text type. For example, a fictional text may allow the translator a greater freedom than a legal one.


Parallel corpora can be used for various practical purposes.

Contrastive linguistics

Parallel corpora are used to compare linguistic features and their frequencies in two languages subject to a contrastive analsis. They are also used to investigate similarities and differences between the source and the target language, making systematic, text-based contrastive studies at different levels of analysis possible. In this way, parallel corpora can provide new insights into the languages compared concerning language-specific, typological and cultural differences and similarities, and allow for quantitative methods of analysis.

Translation studies

Closely related to the use of parallel corpora in contrastive linguistics is their application in translation studies. Parallel corpora may help translators to find translational equivalents between the source and the target language. They provide information on the frequency of words, specific uses of lexical items as well as collocational and syntactic patterns. This procedure may help translators to develop systematic translation strategies for words or phrases which have no direct equivalent in the target language. On this basis, sets of possible translations can be identified and the translator can choose a translation strategy according to the specific register, topic and genre.

In recent times, parallel corpora have been increasingly used to develop resources for automatic translation systems.


Teachers are increasingly using parallel corpora in the classroom. In so doing they can determine the most frequent patterns of occurrence, enrich their personal knowledge of the language, design teaching materials and provide authentic data in their teaching. Parallel corpora may also be helpful in the planning of teaching units and the identification of specific, potentially problematic, patterns of use and are thus useful tools for syllabus design.

Moreover, parallel corpora can be used to identify translation difficulties and false friends. False friends are words or expressions of the target language that are similar in form to their counterpart in the source language but convey a different meaning. Even if words of the two languages have a similar meaning, they might belong to different registers or contexts, so that complete translational equivalence between source and target text is rare.

Teachers are increasingly encouraging students to make use of parallel corpora themselves in order to become aware of nuances of usage and subtle differences in meaning.


Parallel corpora are used more and more to design corpus-based (bilingual) dictionaries.

Examples of parallel corpora

  • English-German Translation Corpus
  • English-Norwegian Parallel Corpus (ENPC)
  • English-Swedish Parallel Corpus (ESPC)
    • cf. 'Contrastive linguistics and corpora' by S. Johansson
    • cf. The website of the English-Norwegian Parallel Corpus
    • started in 1993
    • has become an important resource for contrastive studies of English and Swedish
    • contains 64 English texts + translations, 72 Swedish texts + translations
    • contains 2.8 million words
    • containt a wide range of text types, authors, translators
    • texts have been matched as far as possible in terms of text type, subject, register
    • can therefore be used as a bidirectional parallel corpus and as a comparable corpus
    • current research: epistemic modality and adverbial connectors in English and Swedish
  • The International Telecommunications Union Corpus (English-Spanish)
  • The Intersect Parallel Corpus (English-French)
  • The Multilingual Parallel Corpus (Danish, English, French, German, Greek, Italian, Finnish, Portuguese, Spanish, Swedish texts)

See also

  • Learner´s corpus
  • Comparable corpus


  • Burnard, Lou, Tony McEnery (2000). “Rethinking Language Pedagogy from a Corpus Perspective.” In: Studies in Language. Ed. Barbara Lewandowska-Tomaszczyk and Patrick James Melia. Vol 2. Frankfurt am Main et. al.: Peter Lang.
  • Facchinetti, Roberta (2007). “Corpus Linguistics 25 Years on.” In: Language and Computers: Studies in Practical Linguistics. Ed: Christian Mair, Charles F. Meyer and Nelleke Oostdijk. Vol. 62. Amsterdam: Rodopi.
  • Hunston, Susan (2002). “Corpora in Applied Linguistics.“ In: The Cambridge Applied Linguistics Series. Ed. Michael H. Long and Jack C. Richards. Cambridge: CUP.
  • Johansson, Stig and Knut Hofland (1994). “Towards an English-Norwegian Parallel Corpus.” In: Creating and Using English Language Corpora, Papers from the Fourteenth International Conference on English Language Research on Computerized Corpora, Zürich 1993. Ed. Udo Fries, Gunnel Tottie and Peter Schneider. Amsterdam: Rodopi.
  • Kennedy, Graeme D. (1998). An Introduction to Corpus Linguistics. New York: Longman.
  • Lemnitzer, Lothar and Heike Zinsmeister (2006). Korpuslinguistik. Eine Einführung. Tübingen: Gunter Narr Verlag.
  • Mair, Christian, Marianne Hundt (2000). “Corpus Linguistics and Theory.” In: Language and Computers: Studies in Practical Linguistics. Ed. Jan Aarts and Willem Meijes. Vol. 33. Amsterdam: Rodopi.
  • McEnery, Tony, Richard Xiao and Yukio Tono (2007). Corpus-Based Language Studies. An Advanced Resource Book. 2nd eds. New York: Routledge.
  • Mukherjee, Joybrato (2009). Anglistische Korpuslinguistik. Eine Einführung. Grundlagen der Anglistik und Amerikanistik, Vol. 33. Berlin: Erich Schmidt Verlag.
  • Scherer, Carmen (2006). Korpuslinguistik. Heidelberg: Winter.
  • Tognini-Bonelli, Elena (1998). “Patterns and Meanings. Using Corpora for English Language Research and Teaching.” In: Studies in Corpus Linguistics. Vol. 2. Amsterdam: John Benjamins Publishing Company.