Features Download

From: Rockwood, Trent R <trockwood <at> mitre.org>
Subject: ATA-AMTA Workshop on Users and Uses for Parallel Corpora
Newsgroups: gmane.science.linguistics.corpora
Date: Monday 14th June 2010 18:37:09 UTC (over 8 years ago)
Call for Papers:
Workshop:  Uses and Users for Parallel Corpora in the Translation Process
Association for Machine Translation in the Americas (AMTA)
November 4-5, 2010 Denver, Colorado
(in conjunction with the American Translators Association Conference)

The purpose of this workshop is to explore the uses that the translation
community is currently applying, and will apply, to parallel corpora.  A
parallel corpus generally refers to a large collection of translated text. 
These texts are often aligned at the sentence or phrase level and annotated
with a specific task in mind, motivating a markup schema.  Bilingual
parallel texts are referred to as bitext, whereas parallel corpora can be
multilingual (e.g. the many translations of the Bible.)

Submissions will address and explore the many reasons why people create
corpora, what corpora they would like to see created, how translators are
making use of corpora, how translations systems are utilizing corpora
according to type and structure, and what the privacy and copyright issues
are which accompany the many uses, both by machine and by people.

Collections of parallel corpora abound, whereas definitions and structuring
of corpora seem to vary across sites[1].  Examples of the kinds of
differences involve source text markup, transliteration, target text
markup, methods of associating source and target, and alignment.

Processing needed for different applications varies widely according to
context and function; for example, how granular do associations between
source and target need to be, how much tagging needs to occur
(morphological, syntactic, semantic), what types of alignment are needed
for which purposes, and how much of the markup is manual or automatic. 
Furthermore, given the wide range of preprocessing needs, what is the
quality check process as part of the overall workflow?

When using corpora to aid in human translation, especially in conjunction
with Translation Memory software, which  representation standards are
being, or should be applied ( for example tmx, tbx, srx, xml:tm, etc) and
what are some of the compatibility issues encountered.

Finally, what are the standard existing uses for various kinds of parallel
corpora, and what are some of the nascent needs that could only be explored
once massive amounts of data are collected.  Some of these uses and users
may simply need smaller amounts of data, but still require backup corpora
for validation and extension of data.  What can translators expect from
parallel corpora?  Of what use are these resources for others in the
translation industry, be it government or industry or academia.

Two of the issues addressed only gingerly in the translation community are
those of privacy and permissions for copyrighted text, particularly when
dealing with limited extraction of say technical terms and their
translations that could in no way be used to reconstruct the sources. A
liberal interpretation might claim that it does not constitute an invasion
of privacy (in corpora that consist of logs, chats, emails, etc), nor is it
an infringement of copyright. On the other hand, a more conservative
interpretation of privacy or infringement might claim that this use does
constitute misuse. Most people either overlook these issues or are blocked
from progress with the more cautious approach.

The types of questions that this workshop will address include:

*         How you create parallel corpora as part of your workflow?

*         In what ways do you use parallel corpora?

*         What techniques do you use to evaluate usefulness (for people or

*         How are your corpora processed (Aligned?  Markup? Standards?)

*         What kinds of quality ratings do you use?

*         What are your lessons learned?

Proposers are encouraged to participate by sharing their experiences,
projects, needs and findings as a single contributor or as a member of a
panel. Of interest is the workflow process of creating or finding,
processing, standardizing and using parallel or comparable corpora for
improving language training of humans and machines.  Furthermore, if
participants have developed a financial return on investment scenario for
using parallel corpora, those insights and justifications are also welcome
as presentation topics.


Judith L. Klavans - U.S. Government and University of Maryland
Elizabeth McGrath - MITRE Corporation
Trent Rockwood - MITRE Corporation

Important Dates and Schedule:

June 14, 2010 - send out call for papers
July 20, 2010 - papers due
August 20, 2010 - send reviews back to submitters
September 10, 2010 - revisions due back to AMTA for printing
November 4-5, 2010 - workshop dates

6 page max, 11pt minimum, 2 column, ACM format:

Submissions and questions to: Trent Rockwood,
[email protected]


[1] A few examples of major collections include the Linguistic Data
Consortium (www.ldc.upenn.edu<http://www.ldc.upenn.edu>),  the
British National Corpus (bnc.org), the JRC-Acquis corpora http://wt.jrc.it/lt/Acquis/,  the
ELRA MLCC Multilingual and Parallel Corpora http://catalog.elra.info,
Japanese-Chinese corpora www.nict.go.jp/<http://www.nict.go.jp/>, and many others.
CD: 15ms