This section only summarises the more complete SLT Evaluation Plan document to be found here.
TC-STAR Evaluation Run #1 for SLT took place from March 21, 2005 to April 8, 2005. The complete schedule can be seen here, but we can outline the important dates for SLT:
- Mid-February: ELDA disseminates SLT development data set through this web site
- March 11, 2005: ASR team prepare word graphs for SLT team
- March 18, 2005: End of SLT development phase
- March 21, 2005: Beginning of SLT run - ELDA sends source files to participants
- March 25, 2005: End of SLT run - translations are sent back to ELDA
- March 31, 2005: End of automatic scoring phase by ELDA - initial results
- April 1, 2005: Beginning of adjudication phase
- April 8, 2005: End of adjudication phase - results are definitive
Before the proper evaluation run, participants have access to training data and development data composed of parallel texts and transcriptions. See below training data and development data.
SLT evaluation was runned in 3 translation directions: English to Spanish, Spanish to English and Chinese Mandarin to English.
English to Spanish and Spanish to English were runned on recording transcriptions from the European Parliament Plenary Sessions (EPPS), while Chinese to English was runned from recording transcriptions of Voice of America.
External Participants
- Declare your interest to the project's coordinator.
- ELDA send you an End-Used Agreement to have access to the data. You must fill in, sign and return this agreement to ELDA.
- Get access to training and development data:
- EPPS training and development transcriptions can be downloaded (see section training data and development data)
- Chinese training data can be any of those listed as training in section Resources.
- Chinese development transcriptions can be downloaded (see section development data)
- ELDA will contact you as any other participant for the evaluation run.
- You must release the data after the end of the evaluation run, as said in the agreement.
Back to Top
Direction |
Input |
Participants |
Zh-->En |
Text |
IBM, IRST, RWTH, UKA |
Single-best ASR |
IBM, IRST, RWTH, UKA |
Verbatim |
IBM, IRST, RWTH, UKA |
Es-->En |
Text |
IBM, IRST, RWTH, UKA, UPC |
Single-best ASR |
IBM, IRST, RWTH, UKA |
Verbatim |
IBM, IRST, RWTH, UKA, UPC |
En-->Es |
Text |
IBM, RWTH, UKA, UPC |
Single-best ASR |
IBM, RWTH, UKA |
Verbatim |
IBM, RWTH, UKA, UPC |
External Participants
Direction |
Input |
Participants |
Zh-->En |
Text |
ATR |
Single-best ASR |
ATR |
Verbatim |
ATR |
Es-->En |
Text |
UPV |
Single-best ASR |
UPV |
Verbatim |
|
En-->Es |
Text |
UPV |
Single-best ASR |
UPV |
Verbatim |
|
John Hopkins has not given its commitment yet.
Back to Top
Training Data
English and Spanish
You can use any of the training resources listed in the table above in addition to the EPPS training sets. To get these last sets on DVD, please contact Christian Gollan at RWTH.
Chinese
You can use any of the training resources listed in the table above excepted TDT3 audio files and transcriptions for the month of December 1998 (development and test sets will be built from these files).
Back to Top
Development Data
Verbatim transcriptions of EPPS are common with ASR evaluation. The difference is that only 2 files are used for SLT (instead of 3 for ASR) as only 25,000 words are needed.
Lastest release of the transcriptions is 27jan2005. All files have been converted to UTF-8.
About Final Text Edition transcriptions published by the EC: we provide the original HTML file (covering all the day), plus an HTML file cut from the original one and covering only the portion we are interested in, and the raw text version of this last file (saved in UTF-8). Finally, we provide another UTF-8 text file generated from the HTML excerpt following the translation guidelines of NIST, which they provide translation agencies with together with the source documents (please note that this is a format for human translators, and it is more relaxed than the NIST MT SGML format, which we will use in fine).
Data Set |
Files |
EPPS English Verbatim Transcriptions |
- Complete archive (tgz) - same as ASR dev data - use only:
- 20041026_1505_1700_EN_SAT
- 20041027_1210_1310_EN_SAT
|
EPPS English Final Text Edition |
- Complete archive (tgz) contains:
- original HTML file from EC
- excerpt in HTML for the pertinent time range
- raw text (UTF-8) excerpt for the time range
|
EPPS Spanish Verbatim Transcriptions |
- Complete archive (tgz) - same as ASR dev data - use only:
- 20041026_1505_1700_ES_SAT
- 20041027_1210_1310_ES_SAT
|
EPPS Spanish Final Text Edition |
- Complete archive (tgz) contains:
- original HTML file from EC
- excerpt in HTML for the pertinent time range
- raw text (UTF-8) excerpt for the time range
|
EPPS English->Spanish Reference Translations |
- English source set (NIST MT format): SRCSET containing 4 documents:
- 20041026_1505_1700_EN_SAT
- 20041027_1210_1310_EN_SAT
- 20041026_1505_1700_EN_FTE
- 20041027_1210_1310_EN_FTE
- Spanish reference translations (NIST MT format): REFSET containing 2 reference translations (different sysid) for each document (i.e. 8 documents):
- 20041026_1505_1700_EN_SAT / tcstar-run1-epps-dev-enes-ref1
- 20041026_1505_1700_EN_SAT / tcstar-run1-epps-dev-enes-ref2
- 20041027_1210_1310_EN_SAT / tcstar-run1-epps-dev-enes-ref1
- 20041027_1210_1310_EN_SAT / tcstar-run1-epps-dev-enes-ref2
- 20041026_1505_1700_EN_FTE / tcstar-run1-epps-dev-enes-ref1
- 20041026_1505_1700_EN_FTE / tcstar-run1-epps-dev-enes-ref2
- 20041027_1210_1310_EN_FTE / tcstar-run1-epps-dev-enes-ref1
- 20041027_1210_1310_EN_FTE / tcstar-run1-epps-dev-enes-ref2
|
EPPS Spanish->English Reference Translations |
- Spanish source set (NIST MT format): SRCSET containing 4 documents:
- 20041026_1505_1700_ES_SAT
- 20041027_1210_1310_ES_SAT
- 20041026_1505_1700_ES_FTE
- 20041027_1210_1310_ES_FTE
- English reference translations (NIST MT format): REFSET containing 2 reference translations (different sysid) for each document (i.e. 8 documents):
- 20041026_1505_1700_ES_SAT / tcstar-run1-epps-dev-enes-ref1
- 20041026_1505_1700_ES_SAT / tcstar-run1-epps-dev-enes-ref2
- 20041027_1210_1310_ES_SAT / tcstar-run1-epps-dev-enes-ref1
- 20041027_1210_1310_ES_SAT / tcstar-run1-epps-dev-enes-ref2
- 20041026_1505_1700_ES_FTE / tcstar-run1-epps-dev-enes-ref1
- 20041026_1505_1700_ES_FTE / tcstar-run1-epps-dev-enes-ref2
- 20041027_1210_1310_ES_FTE / tcstar-run1-epps-dev-enes-ref1
- 20041027_1210_1310_ES_FTE / tcstar-run1-epps-dev-enes-ref2
|
EPPS English-Spanish DevData |
|
|
|
VOA Chinese Transcriptions |
- Chinese source set (NIST MT format, GB2312 encoded): SRCSET containing 3 documents:
- 19981201_0700_0730_VOA_MAN
- 19981202_0900_0930_VOA_MAN
- 19981203_0700_0730_VOA_MAN
- English reference translations (NIST MT format, UTF-8 encoded): REFSET containing 2 reference translations (different sysid) for each document (i.e. 6 documents):
- 19981201_0700_0730_VOA_MAN / tcstar-run1-voa-dev-ref1
- 19981201_0700_0730_VOA_MAN / tcstar-run1-voa-dev-ref2
- 19981202_0900_0930_VOA_MAN / tcstar-run1-voa-dev-ref1
- 19981202_0900_0930_VOA_MAN / tcstar-run1-voa-dev-ref2
- 19981203_0700_0730_VOA_MAN / tcstar-run1-voa-dev-ref1
- 19981203_0700_0730_VOA_MAN / tcstar-run1-voa-dev-ref2
|
VOA Chinese Text Version |
- Chinese source set (NIST MT format, GB2312 encoded): SRCSET containing 3 documents:
- 19981201_0700_0730_VOA_MAN_text
- 19981202_0900_0930_VOA_MAN_text
- 19981203_0700_0730_VOA_MAN_text
- English reference translations (NIST MT format, UTF-8 encoded): REFSET containing 2 reference translations (different sysid) for each document (i.e. 6 documents):
- 19981201_0700_0730_VOA_MAN_text / tcstar-run1-voa-dev-ref1
- 19981201_0700_0730_VOA_MAN_text / tcstar-run1-voa-dev-ref2
- 19981202_0900_0930_VOA_MAN_text / tcstar-run1-voa-dev-ref1
- 19981202_0900_0930_VOA_MAN_text / tcstar-run1-voa-dev-ref2
- 19981203_0700_0730_VOA_MAN_text / tcstar-run1-voa-dev-ref1
- 19981203_0700_0730_VOA_MAN_text / tcstar-run1-voa-dev-ref2
|
The translation guidelines for the translation agencies are available here (MS Word document).
Word graphs/lattices are regularly posted on WP2's own web page.
Back to Top
Scoring Tools
The scoring tools proposed by ELDA are Perl scripts. You can download them in this zip containing:
- BLEU/NIST v11a
- mWER: multiple reference word error rate
- mPER: multiple reference position-independent word error rate
- mCER: multiple reference character error rate
- WNM: Weighted N-gram Model
Back to Top
Test Data
Source files for SLT evaluation are to be downloaded here.
Submission guidelines:
Submitted files should use the NIST MT format for TSTSET:
<TSTSET setid="..." srclang="..." trglang="...">
<DOC docid="..." sysid="...">
<SEG id="1"> TRANSLATED TEXT </SEG>
...
</DOC>
...
</TSTSET>
Submit one test set per file, i.e. one file for English verbatim transcripts, one file for English ASR output, one file for English FTE, etc.
The sysid attribute must identify the organisation, the condition, and the system. For instance, if the organisation ORG submits one primary condition and two secondary conditions for the English verbatim transcripts (one with the same system than for primary and the other with another system or system version), then it will send 3 files for this setid, with the following sysid:
- ORG-PRIMARY-system1
- ORG-SECONDARY-system1
- ORG-SECONDARY-system2
Source files
Data Set |
Files |
EPPS English Verbatim Transcriptions |
- English source set (NIST MT format, UTF-8 encoded, without punctuation): SRCSET containing 3 documents
- 20041116_1505_1800_EN_SAT_verbatim
- 20041117_0905_1240_EN_SAT_verbatim
- 20041118_1000_1225_EN_SAT_verbatim
|
EPPS English ASR Output (single-best) |
- English source set (NIST MT format, UTF-8 encoded, without punctuation, case-insensitive): SRCSET containing 3 documents
- 20041116_1505_1800_EN_SAT_asr
- 20041117_0905_1240_EN_SAT_asr
- 20041118_1000_1225_EN_SAT_asr
- For systems needing case-sensitivity, here is an alternative English source set produced by LIMSI (NIST MT format, UTF-8 encoded, without punctuation, case-sensitive): SRCSET containing 3 documents
- 20041116_1505_1800_EN_SAT_limsi
- 20041117_0905_1240_EN_SAT_limsi
- 20041118_1000_1225_EN_SAT_limsi
|
EPPS English Final Text Edition |
- English source set (NIST MT format, UTF-8 encoded):
SRCSET containing 4 documents
- 20041115_1505_1735_EN_FTE
- 20041116_1505_1800_EN_FTE
- 20041117_0905_1240_EN_FTE
- 20041117_1500_1835_EN_FTE
|
EPPS Spanish Verbatim Transcriptions |
- Spanish source set (NIST MT format, UTF-8 encoded, without punctuation): SRCSET containing 5 documents
- 20041115_1705_1735_ES_SAT_verbatim
- 20041116_1505_1800_ES_SAT_verbatim
- 20041117_1500_1835_ES_SAT_verbatim
- 20041118_1000_1225_ES_SAT_verbatim
- 20041118_1500_1600_ES_SAT_verbatim
|
EPPS Spanish ASR Output (single-best) |
- Spanish source set (NIST MT format, UTF-8 encoded, without punctuation, case-insensitive): SRCSET containing 5 documents
- 20041115_1705_1735_ES_SAT_asr
- 20041116_1505_1800_ES_SAT_asr
- 20041117_1500_1835_ES_SAT_asr
- 20041118_1000_1225_ES_SAT_asr
- 20041118_1500_1600_ES_SAT_asr
Warning: some segments may be empty for alignment reasons with verbatim transcripts.
- For systems needing case-sensitivy, here is an alternative Spanish source set produced by LIMSI (NIST MT format, UTF-8 encoded, without punctuation, case-sensitive): SRCSET containing 5 documents
- 20041115_1705_1735_ES_SAT_limsi
- 20041116_1505_1800_ES_SAT_limsi
- 20041117_1500_1835_ES_SAT_limsi
- 20041118_1000_1225_ES_SAT_limsi
- 20041118_1500_1600_ES_SAT_limsi
|
EPPS Spanish Final Text Edition |
- Spanish source set (NIST MT format, UTF-8 encoded): SRCSET containing 4 documents
- 20041115_1505_1735_ES_FTE
- 20041116_1505_1800_ES_FTE
- 20041117_0905_1240_ES_FTE
- 20041117_1500_1835_ES_FTE
|
|
|
VOA Chinese ASR Output (single best) |
- Chinese source set (NIST MT format, GB2312 encoded, without punctuation): SRCSET containing 3 documents:
- 19981214_0700_0730_VOA_MAN_asr
- 19981215_0900_0930_VOA_MAN_asr
- 19981216_0700_0730_VOA_MAN_asr
Warning: some segments may be empty for alignment reasons with verbatim transcripts.
|
VOA Chinese Verbatim Transcriptions |
- Chinese source set (NIST MT format, GB2312 encoded, without punctuation): SRCSET containing 3 documents:
- 19981214_0700_0730_VOA_MAN_verbatim
- 19981215_0900_0930_VOA_MAN_verbatim
- 19981216_0700_0730_VOA_MAN_verbatim
|
VOA Chinese Text Version |
- Chinese source set (NIST MT format, GB2312 encoded): SRCSET containing 3 documents:
- 19981214_0700_0730_VOA_MAN_text
- 19981215_0900_0930_VOA_MAN_text
- 19981216_0700_0730_VOA_MAN_text
|
Reference Translations
Data Set |
Files |
EPPS English Verbatim Transcriptions |
|
EPPS English ASR Output (ROVERed) |
|
EPPS English ASR Output (LIMSI) |
|
EPPS English Final Text Edition |
|
|
|
EPPS Spanish Verbatim Transcriptions |
|
EPPS Spanish ASR Output (ROVERed) |
|
EPPS Spanish ASR Output (LIMSI) |
|
EPPS Spanish Final Text Edition |
|
|
|
VOA Chinese ASR Output (single best) |
|
VOA Chinese Verbatim Transcriptions |
|
VOA Chinese Text Version |
|
Output on test data
|
En->Es |
Es->En |
Zh->En |
ASR |
|
|
|
text |
|
|
|
Verbatim |
|
|
|
You can download all the output data in this zip file.
Back to Top