2007-03-21 10:33 | Validation of Chinese translation updated, Chinese reference files updated, Chinese scores updated |
2007-02-23 9:59 | Validation of translations updated |
2007-02-19 20:35 | Reference files available |
2007-02-19 20:35 | Results updated |
2007-02-09 20:40 | Preliminary results available |
2007-01-31 10:00 | Test data available |
2007-01-31 10:37 | Chinese Verbatim test file updated |
2007-02-01 13:27 | Spanish ASR test file updated |
2007-02-02 12:30 | English ASR test file updated |
This section only summarises the more complete SLT Evaluation Plan document to be found here.
TC-STAR Evaluation Run #3 for SLT will take place from Jan 31, 2007 to Fev 07, 2007. The development data is already available (see developement data section). The complete schedule can be seen here, but we can outline the important dates for SLT:
Before the proper evaluation run, participants have access to training data and development data composed of parallel texts and transcriptions. See below training data and development data.
SLT evaluation will be run in 3 translation directions: English to Spanish, Spanish to English and Chinese Mandarin to English.
For each translation direction, three kinds of text data were used as input:
An example of the three kinds of inputs is shown below:
Text |
---|
I am starting to know what Frank Sinatra must have felt like, |
Verbatim |
I'm I'm I'm starting to know what Frank Sinatra must have felt like |
ASR output |
and i'm times and starting to know what frank sinatra must have felt like |
English to Spanish and Spanish to English are run on recording transcriptions from the European Parliament Plenary Sessions (EPPS), while Chinese to English is run from recording transcriptions of Voice of America.
For the Spanish to English direction, test data from the European Parliament (EPPS) and from the Spanish Parliament (Cortes) is used. However no distinction using document ID tags is allowed.
We propose to have the following tracks, which determine the training data allowed:Participants should be encouraged to participate in the Public Track.
For the participants, a submission guideline is available.
Direction | Input | Participants |
Zh-->En (VoA) | Single-best ASR | IRST, RWTH, UKA |
Verbatim | IRST, RWTH, UKA | |
Es-->En (EPPS + PARL) | Text | IBM, IRST, RWTH, UKA, UPC |
Single-best ASR | IBM, IRST, LIMSI, RWTH, UKA, UPC | |
Verbatim | IBM, IRST, LIMSI, RWTH, UKA, UPC | |
En-->Es (EPPS) | Text | IBM, IRST, RWTH, UKA, UPC |
Single-best ASR | IBM, IRST, LIMSI, RWTH, UKA, UPC | |
Verbatim | IBM, IRST, LIMSI, RWTH, UKA, UPC |
There is no text condition for Mandarin.
IBM: International Business Machines, Germany |
IRST:
Il Centro per la Ricerca Scientifica e Tecnologica, Italy |
LIMSI:
Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur, France |
RWTH:
Rheinisch-Westfaelische Technische Hochschule, Germany |
UKA:
Universitaet Karlsruhe, Germany |
UPC:
Universitat Politècnica de Catalunya, Spain |
External Participants
Direction | Input | Participants |
Zh-->En (VoA) | Single-best ASR | ICT, NICT, UDS XMU |
Verbatim | ICT, NICT, UDS XMU | |
Es-->En (EPPS + PARL) | Text | JHU, NICT, Translendium, UDS |
Single-best ASR | JHU, NICT, UDS | |
Verbatim | JHU, NICT, UDS | |
En-->Es (EPPS) | Text | UDS |
Single-best ASR | UDS | |
Verbatim | UDS |
ICT: Institute of Computing Technologies - Chinese Academy of Sciences, China |
JHU: The John Hopkins University, United States |
XMU: Institute of Artificial Intelligence - Xiamen University, China |
Translendium: Translendium SL, Spain |
UDS: Universität Des Saarlandes, Germany |
NiCT - ATR: National Institute of Information and Communications Technology - Advanced Telecommunications Research Institute International, Japan |
Training data are the same than for run #1 and run #2.
Direction | Description | Reference | Amount | IPR-owner | IPR-distrib | Comment |
Training | ||||||
Zh->En | FBIS Multilanguage Texts | LDC2003E14 | LDC | research | LDC membership 03 required | |
UN Chinese English Parallel Text Version 2 | LDC2004E12 | LDC | research | LDC membership 04 required | ||
Hong Kong Parallel Text | LDC2004T08 | LDC | research | LDC membership 04 required | ||
English Translation of Chinese Treebank | LDC2002E17 | LDC | research | LDC membership 02 required | ||
Xinhua Chinese-English Parallel News Text Version 1.0 beta 2 | LDC2002E18 | LDC | research | LDC membership 02 required | ||
Chinese English Translation Lexicon version 3.0 | LDC2002L27 | LDC | research | LDC membership 02 required | ||
Chinese-English Name Entity Lists version 1.0 beta | LDC2003E01 | LDC | research | LDC membership 03 required | ||
Chinese English News Magazine Parallel Text | LDC2005E47 | LDC | research | LDC membership 05 required | ||
Multiple-Translation Chinese (MTC) Corpus | LDC2002T01 | LDC | research | LDC membership 02 required | ||
Multiple Translation Chinese (MTC) Part 2 | LDC2003T17 | LDC | research | LDC membership 03 required | ||
Multiple Translation Chinese (MTC) Part 3 | LDC2004T07 | LDC | research | LDC membership 04 required | ||
Chinese News Translation Text Part 1 | LDC2005T06 | LDC | research | LDC membership 05 required | ||
Chinese Treebank 5.0 | LDC2005T01 | LDC | research | LDC membership 05 required | ||
Chinese Treebank English Parallel Corpus | LDC2003E07 | LDC | research | LDC membership 03 required | ||
Es->En | EPPS Spanish verbatim transcriptions May - Jan 2005 | UPC | ELRA | Transcribed by UPC | ||
EPPS Spanish final text edition April 1996 to Sept 2004 | EC | RWTH | Provided to TCSTAR by RWTH | |||
EPPS Spanish final text edition Dec 2004 - May 2005 | EC | ELRA | English and Spanish parallel texts are aligned. Verbatim transcriptions are also aligned with FTE by RWTH. |
|||
EPPS Spanish final text edition Dec 2005 - May 2006 | EC | ELRA | English and Spanish parallel texts are aligned. Verbatim transcriptions are also aligned with FTE by RWTH. |
|||
En->Es | EPPS English verbatim transcriptions May 2004- Jan 2005 | RWTH | ELRA | Transcribed by RWTH | ||
EPPS English final text edition April 1996 to Sept 2004 | EC | RWTH | Provided to TCSTAR by RWTH | |||
EPPS English final text edition Dec 2004 - May 2005 | EC | ELRA | English and Spanish parallel texts are aligned. Verbatim transcriptions are also aligned with FTE by RWTH. |
|||
EPPS English final text edition Dec 2005 - May 2006 | EC | ELRA | English and Spanish parallel texts are aligned. Verbatim transcriptions are also aligned with FTE by RWTH. |
|||
En<->Es | EU Bulletin Corpus | ELRA | The Bulletin of the European Union provides an insight into the activities of the European Commission and the other Community institutions." It is published in a monthly basis and parallel versions in Spanish and English are available up to 2004. The corpus is available in a raw version, with the html documents as downloaded from the pages of the European Union, or as a sentence aligned version. Sentence alignment provided by RWTH |
|||
JRC-Acquis Multilingual Parallel Corpus | research | Before joining the European Union (EU), the new Member States (NMS) needed to translate and approve the existing EU legislation, consisting of selected texts written between the 1950s and 2005. This body of legislative text, which consists of approximately eight thousand documents and which covers a variety of domains, is called the Acquis Communautaire (AC)." The original version of the corpus (in several languages) can be downloaded directly from the link above, as well as tool for paragraph alignment. The RWTH conducted an additional sentence level alignment of the corpus. Provided by RWTH |
||||
UN Parallel Corpus | LDC | The text files published in this corpus were provided to the LDC by the United Nations in New York, for use by the research community in developing machine translation technology. This material has been drawn from the UN's electronic text archives covering the period between 1988 and (portions of) 1993." We are not allowed to distribute this data, therefore a set of tools has been made available for carrying out the sentence alignment by each partner. Alignment tools provided by RWTH |
English and Spanish
You can use any of the training resources listed in the table above in addition to the EPPS training sets. To get these last sets on DVD, please contact Christian Gollan at RWTH.
Chinese
You can use any of the training resources listed in the table above excepted TDT3 audio files and transcriptions for the month of December 1998 (development and test sets will be built from these files).
Verbatim transcriptions of EPPS are common with ASR evaluation. The difference is that only 2 files are used for SLT (instead of 3 for ASR) as only 25,000 words are needed.
You can also find development data of the 2005 SLT evaluation on the SLT Run #1 page.
Direction | Files |
---|---|
Es-->En (cortes+EPPS) |
|
En-->Es (EPPS) |
|
Zh-->En (VoA) |
|
The translation guidelines for the translation agencies are available here (MS Word document).
Word graphs/lattices are regularly posted on WP2's own web page.
The scoring tools proposed by ELDA are Perl scripts. You can download them in this archive (zip) containing:
In this package, we propose sample files to check proper installation:
For the ASR task the alignment tool from RWTH is available here.
Source files
Direction | Input | Files |
---|---|---|
En-->Es (EPPS) | Final Text Edition |
|
Verbatim | English source set (NIST MT format) |
|
ASR |
|
|
Es-->En (EPPS + PARL) | Final Text Edition | Spanish source set (NIST MT format) |
Verbatim | Spanish source set (NIST MT format) |
|
ASR |
|
|
Zh-->En (VoA) | Verbatim | Chinese source set (NIST MT format, GB2312 encoded) (updated 2007/01/31 - 10:37) |
ASR |
|
|
|
Submissions will have to be sent by email to hamon@elda.org before Wednesday February 07th, 2007, 23h59 CET
Submitted files should use the NIST MT format for TSTSET:
<TSTSET SetID="..." SrcLang="..." TrgLang="...">
<DOC DocID="..." SysID="...">
<SEG id="1">
TRANSLATED TEXT
</SEG>
...
</DOC>
...
</TSTSET>
Output file and source file formats are the same with the following exceptions:
Recommendations:
Submit one test set per file, i.e. one file for English verbatim transcripts, one file for English ASR output, one file for English FTE, etc.
The SysID attribute must identify the organisation, the condition, and the system. For instance, if the organisation ORG submits one primary condition and two secondary conditions for the English verbatim transcripts (one with the same system as for the primary condition and another with a different system or system version), then 3 files will be sent for this setid, with the following SysID:
In the same manner, output files must identify the organisation, with the same constraints. For instance, if the organisation ORG translates the file "TC-STAR_RUN2_TEST06_EPPS_FTE_ENES_SRC.TXT", the translated file should be renamed "TC-STAR_RUN2_TEST06_EPPS_FTE_ENES_ORG-PRIMARY-system1.TXT".
(the use of "system1" can be omitted if there is only one system by condition)
About the ASR task, the SetID attribute should be:
Systems description:
For each experiment, a one page system description must be provided describing the data used, the approaches (algorithms), the configuration, the processing time, etc. The document should also contain references. The file should be named as "<SysID>.txt" .
Submission:
Submissions must be sent by email at the following address: hamon@elda.org
with the subject: "[TC-STAR] Submission <SysID>"
and with the archived files in attachment.
The deadline is Wednesday 07th of February, 23h59 CET. (5h59 pm for Pittsburgh and Yorktown)
A return receipt will be sent within 24 hours.
Direction | Input | Files | Validation | |
---|---|---|---|---|
Ref1 | Ref2 | |||
En-->Es (EPPS) | Final Text Edition |
|
OK | OK |
Verbatim / ASR | Spanish reference translations set (NIST MT format, utf-8 encoded, 2 reference translations) |
OK | OK | |
Es-->En (EPPS + PARL) | Final Text Edition | English reference translations set (NIST MT format, utf-8 encoded, 2 reference translations) |
OK | OK |
Verbatim / ASR | English reference translations set (NIST MT format, utf-8 encoded, 2 reference translations) |
OK | OK | |
EPPS | Final Text Edition | English reference translations set (NIST MT format, utf-8 encoded, 2 reference translations) |
OK | OK |
Verbatim / ASR | English reference translations set (NIST MT format, utf-8 encoded, 2 reference translations) |
OK | OK | |
PARL | Final Text Edition | English reference translations set (NIST MT format, utf-8 encoded, 2 reference translations) |
OK | OK |
Verbatim / ASR | English reference translations set (NIST MT format, utf-8 encoded, 2 reference translations) |
OK | OK | |
Zh-->En (VoA) | Verbatim / ASR | English reference translations set (NIST MT format, utf-8 encoded, 2 reference translations) (updated 2007/03/21) |
OK | OK |
Preliminary results are available: