Home
Reading
Searching
Subscribe
Sponsors
Statistics
Posting
Contact
Spam
Lists
Links
About
Hosting
Filtering
Features Download
Marketing
Archives
FAQ
Blog
 
Gmane
From: Ralf Steinberger <ralf.steinberger <at> jrc.ec.europa.eu>
Subject: Release 2015 of DGT-TM (parallel corpus in 24 languages)
Newsgroups: gmane.science.linguistics.corpora
Date: Monday 4th May 2015 08:45:09 UTC (over 2 years ago)
We are happy to announce that the 2015 update release of the
DGT-Translation
Memory (DGT-TM) is now available for download. 

 

DGT-TM is an extraction of the translation memory of the European
Institutions for all 24 official EU languages, produced by the European
Commission’s Directorate General for Translation (DGT) and distributed by
the Joint Research Centre (JRC). Translation memories are sentences and
their manually produced translations.

 

The new release is called DGT-TM-2015. It follows the previous releases,
DGT-TM (2007), DGT-TM-2011, DGT-TM-2012, DGT-TM-2013 and DGT-TM-2014.
DGT-TM-2015 adds 4.7 million translation units, resulting in almost 90
million translation units in total (1.47 billion words). 

 

New features of DGT-TM-2015 are:

 

·         Croatian (HR) data almost doubles.

·         Mostly about 200K new translation units per language.

·         More data for language pairs involving Maltese are available on
request.

 

Languages:  All 276 language pairs involving the following 24 languages: 

 

                 Bulgarian, Croatian, Czech, Danish, Dutch, English,
Estonian, 

                 German, Greek, Finnish, French, Irish, Hungarian, Italian,


                 Latvian, Lithuanian, Maltese, Polish, Portuguese,
Romanian,


                 Slovak, Slovene, Spanish and Swedish.

            

URL:
https://ec.europa.eu/jrc/en/language-technologies/dgt-translation-memory
  

Creator:   European Commission - Directorate General for Translation (
<http://ec.europa.eu/dgs/translation/index_en.htm>
DGT)

 

 

WHAT IS DGT-TM

 

The ‘ <http://europa.eu/abc/eurojargon/index_en.htm>
Acquis Communautaire’
is the entire body of European legislation, comprising all the treaties,
regulations and directives adopted by the European Union (EU). Since each
new country joining the EU is required to accept the whole Acquis
Communautaire, this body of legislation has been translated into 23
official
languages. For the 24th official EU language, Irish, the Acquis has not
been
translated on a regular basis; which is why DGT-TM includes less data in
Irish. The Acquis Communautaire was split into sentences and aligned
automatically at sentence level, resulting in the DGT translation memory,
DGT-TM. The text data is accompanied by software that allows to extract all
sentences and their translations for any of the 276 possible language pair
combinations. 

 

 

MOTIVATION FOR THIS RELEASE

 

The public data release is in line with the general effort of the European
Commission to support multilingualism, language diversity and the re-use of
Commission information. It follows the release of a number of further
multilingual data sets:

 

·         the JRC-Acquis parallel corpus in 2006 (over 1 billion words in
22
languages), 

·         the DGT-TM Translation Memory in 2007, 

·         the multilingual named entity resource JRC-Names in 2011, 

·         the multilingual multi-label classification tool (and
accompanying
text data) JRC EuroVoc Indexer (JEX) (22 languages) in 2012, 

·         the ECDC-TM Translation Memory in 2012 (domain: Public Health)

·         the DGT-Acquis parallel corpus in 2012,

·         the EAC-TM Translation Memory in 2013 (domain: Education and
Culture),

·         the DCEP (Digital Corpus of the European Parliament) in 2014,

·         and further smaller multilingual resources. 

 

See https://ec.europa.eu/jrc/en/language-technologies
for more information
on these resources.

 

 

WHAT DGT-TM CAN BE USED FOR

                

DGT-TM can be fed into translation memory software to support human
translators in their work. As it is a large parallel corpus in electronic
form, it can furthermore be used by specialists in computational
linguistics
to train statistical machine translation software, to generate multilingual
dictionaries, to train and test multilingual information extraction
software, and more.

 

 

MORE INFORMATION ON DGT-TM 

 

At http://langtech.jrc.ec.europa.eu/JRC_Publications.html
, you find
detailed publications on the JRC
<http://langtech.jrc.ec.europa.eu/JRC_Publications.html>
’s multilingual
language technology activity. For details specifically on DGT-TM, you can
read:

 

      Steinberger Ralf, Andreas Eisele, Szymon Klocek, Spyridon Pilos 

      & Patrick Schlüter (2012). 

 
<http://langtech.jrc.ec.europa.eu/Documents/2012_LREC_DGT-TM_Final.pdf>
DGT-TM: A freely Available Translation Memory in 22 Languages. 

      Proceedings of the 8th international conference on Language 

      Resources and Evaluation (LREC'2012), Istanbul, 21-27 May 2012. 

 
<http://langtech.jrc.ec.europa.eu/Documents/2012_LREC_DGT-TM_Final.pdf>
http://langtech.jrc.ec.europa.eu/Documents/2012_LREC_DGT-TM_Final.pdf

 

The following recent article compares all freely available Language
Technology resources distributed by the JRC and provides comparative
background information:

 

     Steinberger Ralf, Mohamed Ebrahim, Alexandros Poulis, Manuel

     Carrasco-Benitez, Patrick Schlüter, Marek Przybyszewski & Signe
Gilbro
(2014). 
      <http://link.springer.com/article/10.1007/s10579-014-9277-0>
An
overview of the European Union's highly multilingual parallel corpora. 
     Language Resources and Evaluation Journal (LRE). 
     DOI: 10.1007/s10579-014-9277-0. 
     (
<http://langtech.jrc.ec.europa.eu/Documents/2014_08_LRE-Journal_JRC-Linguist
ic-Resources_Manuscript.pdf> Read manuscript).

 

 

 

Ralf Steinberger  <https://ec.europa.eu/jrc/en/person/ralf-steinberger>
 
European Commission - Joint Research Centre (JRC)
21027 Ispra (VA), Italy

URL – Resources:  <https://ec.europa.eu/jrc/en/language-technologies>
https://ec.europa.eu/jrc/en/language-technologies
 

URL – Publications:
<http://langtech.jrc.ec.europa.eu/JRC_Publications.html>
http://langtech.jrc.ec.europa.eu/JRC_Publications.html
 
CD: 4ms