TC-Star Evaluation Information (WP4)

subglobal1 link | subglobal1 link | subglobal1 link | subglobal1 link | subglobal1 link | subglobal1 link | subglobal1 link
subglobal2 link | subglobal2 link | subglobal2 link | subglobal2 link | subglobal2 link | subglobal2 link | subglobal2 link
subglobal3 link | subglobal3 link | subglobal3 link | subglobal3 link | subglobal3 link | subglobal3 link | subglobal3 link
subglobal4 link | subglobal4 link | subglobal4 link | subglobal4 link | subglobal4 link | subglobal4 link | subglobal4 link
subglobal5 link | subglobal5 link | subglobal5 link | subglobal5 link | subglobal5 link | subglobal5 link | subglobal5 link
subglobal6 link | subglobal6 link | subglobal6 link | subglobal6 link | subglobal6 link | subglobal6 link | subglobal6 link
subglobal7 link | subglobal7 link | subglobal7 link | subglobal7 link | subglobal7 link | subglobal7 link | subglobal7 link
subglobal8 link | subglobal8 link | subglobal8 link | subglobal8 link | subglobal8 link | subglobal8 link | subglobal8 link

ASR Evaluation - Run #1

How To Participate?

This section only summarises the more complete ASR Evaluation Plan document to be found here.

TC-STAR Evaluation Run #1 for ASR will take place from March 1, 2005 to March 25, 2005. The complete schedule can be seen here, but we can outline the important dates for ASR:

  • Last week of February: ELDA sends a DVD with audio test data to participants.
  • March 1, 2005: Beginning of ASR run by participants
  • March 11, 2005: End of ASR run - participants send their output to ELDA
  • March 18, 2005: End of ASR scoring by ELDA - results are sent to participants
  • March 19, 2005: Beginning of adjudication phase
  • March 25, 2005: End of adjudication phase - results are definitive

Before the proper evaluation run, participants have access to training data and development data composed of audio files [DVD sent by RWTH? Sent by ELDA? or download?] and transcriptions. See below training data and development data.

ASR evaluation will be run in 3 languages: English, Spanish and Chinese Mandarin.

English and Spanish are run on recordings from the European Parliament Plenary Sessions (EPPS), while Chinese is run from recordings of Voice of America.

External Participants

  1. Declare your interest to the project's coordinator.
  2. ELDA send you an End-Used Agreement to have access to the data. You must fill in, sign and return this agreement to ELDA.
  3. Get access to training and development data:
    • EPPS training and development audio files will be sent to you by RWTH
    • EPPS training and development transcriptions can be downloaded (see section training data and development data)
    • Chinese training data can be any of those listed as training in section Resources.
    • Chinese development and test audio files have to be bought to LDC.
    • Chinese development transcriptions can be downloaded (see section development data)
  4. ELDA will contact you as any other participant for the evaluation run.
  5. You must release the data after the end of the evaluation run, as said in the agreement.

Back to Top

ASR Participants

  EPPS BN Open Public Restricted
LIMSI
En, Es
Zh
X
X
X
UKA
En
Zh
X
X
SONY
En
En
X
?
?
IBM
En, Es
X
X
NOKIA
En
X
IRST
En, Es
X
X
RWTH
En, Es
?
?
X

Back to Top

ASR Resources

Language Reference Amount IPR-owner IPR-distrib IPR-granted use IPR-royalty Actors / comments
Training
Zh Mandarin 1997 BN (Hub4-NE) LDC98S73 (audio) & LDC98T24 (transcr) ~30h ? LDC research LDC membership 98 required  
  Mandarin 2001 Call (Hub5) LDC98S69, LDC98T26 (transcr) ~40h ? LDC research LDC membership 98 required  
  Mandarin TDT2 LDC2001S93 & LDC2001T57 (transcr)   ? LDC research LDC membership 01 required  
  Mandarin TDT3 LDC2001S95 & LDC2001T58   ? LDC research LDC membership 01 required BLACKOUT ON DEC 98!!!
  Mandarin Chinese News Text LDC95T13 250M words ? LDC research LDC membership 95 required  
  Mandarin CALLHOME LDC96S34, LDC96T16 (transcr)   ? LDC research LDC membership 96 required  
  Chinese Gigaword LDC2003T09 1.1G words ? LDC research LDC membership 03 required  
  Hong Kong News Parallel Text LDC2000T46 (Zh/En) 18147 articles ? LDC research LDC membership 00 required  
Es EPPS_SP (audio): 3 May - 14 Oct 2004 40h CE ELRA research free in TCSTAR Transcribed by UPC
  EPPS_SP (transcriptions): 3 May - 14 Oct 2004 40h transcribed UPC ELRA research nominal fee (UPC) Provided to TCSTAR by UPC
  EPPS_SP (text): Apr 1996 - Jun 2004 >36M words RWTH ELRA research nominal fee (RWTH) Provided to TCSTAR by RWTH
  TC-STAR_P Spanish BN 10h transcribed ? UPC research free in TCSTAR Provided to TCSTAR by UPC
  Spanish LDC 1997, BN speech (Hub4-NE), LDC98S74   ? LDC research LDC 98 membership required  
  Spanish LDC CallHome, LDC96S35   ? LDC research LDC 96membership required  
En EPPS_EN (audio): 3 May - 14 Oct 2004 40h CE ELRA research free in TCSTAR Transcribed by RWTH
  EPPS_EN (transcriptions): 3 May - 14 Oct 2004 40h transcribed RWTH ELRA research nominal fee (RWTH) Provided to TCSTAR by RWTH
  EPPS_EN (text): Apr 1996 - Jun 2004 >36M words RWTH ELRA research nominal fee (RWTH) Provided to TCSTAR by RWTH
  TC-STAR_P English BN 10h transcribed RFI ELRA research free in TCSTAR Distributed by ELDA
  English LDC 1995 (CSR-IV Hub 4 Marketplace LDC96S31), 1996, 1997, official NIST Hub4 training sets, LDC97S44 and LDC98S71, USC Marketplace Broadcast News Speech (LDC99S82)   ? LDC research LDC 96, 98 and 99 membership required  
  English LDC TDT2 and TDT3 data with closed-captions, about 2000h, LDC99S84 and LDC2001S94   ? LDC research LDC 99 and 01 membership required  
  English LDC Switchboard 1, 2-I, 2-II, 2-III, LDC97S62, LDC98S75, LDC99S79   ? LDC research LDC 98, 98 and 99 membership required  
  English LDC Callhome, LDC97S42, LDC2004S05, LDC2004S09   ? LDC research LDC 97 and 04 membership required  
  English LDC Meeting corpora, ICSI LDC2004S02, ISL LDC2004S05, NIST LDC2004S09   ? LDC research LDC 04 membership required  
Development
Zh Mandarin TDT3 1 Dec 98 to 11 Dec 98 (LDC2001S95 & LDC2001T58) ~4h ? LDC research LDC membership 01 required selected by ELDA
Es EPPS_SP (audio + transcriptions): 25-28 Oct 2004 ~4h CE ELRA research free in TCSTAR selected by ELDA
  EPPS Final Text Edition for sessions 25-28 Oct 2004   ELRA ELRA research free in TCSTAR selected by ELDA
En EPPS_EN dev (audio + transcriptions): 25-28 Oct 2004 ~3h CE ELRA research free in TCSTAR selected by ELDA
  EPPS Final Text Edition for sessions 25-28 Oct 2004   CE ELRA research free in TCSTAR selected by ELDA
Test
Zh Mandarin TDT3 14 Dec 98 to 22 Dec 98 (LDC2001S95 & LDC2001T58) ~4h ? LDC research LDC membership 01 required selected by ELDA
Es EPPS_SP (audio): 15-18 Nov 2004 ~4h CE ELRA research free in TCSTAR selected by ELDA
En EPPS_EN (audio): 15-18 Nov 2004 ~3h CE ELRA research free in TCSTAR selected by ELDA

Back to Top

ASR Download area

Training Data

English and Spanish

You can use any of the training resources listed in the table above in addition to the EPPS training sets. To get these last sets on DVD, please contact Christian Gollan at RWTH.

Chinese

You can use any of the training resources listed in the table above excepted TDT3 audio files and transcriptions for the month of December 1998 (development and test sets will be built from these files).

Back to Top

Development Data

To get the corresponding audio files on DVD, contact Christian Gollan at RWTH.

Verbatim transcriptions of EPPS are common with SLT evaluation. The difference is that 3 files in total (i.e. 1 more than SLT) are used for ASR, in order to get around 4 hours of audio plus transcriptions.

Lastest release of the transcriptions is 27jan2005.

Data Set Files

EPPS English Verbatim Transcriptions

  • 2004-10-26/satellite/en/1505-1700
  • 2004-10-27/satellite/en/1210-1310
  • 2004-10-28/satellite/en/1000-1135

Complete archive (tgz)

EPPS Spanish Verbatim Transcriptions

  • 2004-10-26/satellite/es/1505-1700
  • 2004-10-27/satellite/es/1210-1310
  • 2004-10-28/satellite/es/1000-1135

Complete archive (tgz)

EPPS English + Spanish Verbatim Transcriptions
Complete archive (tgz)

The data can also be downloaded from RWTH's web site together with training data.

Back to Top

Test Data

To automatically produce WAV files, using the new filename convention, from the DVD distributed by ELDA, use the following procedure under Linux:

  • download FLAC, uncompress it and make sure that its bin directory appears in your $PATH environment variable;
  • download this Perl script. Launch it using perl norm_tcstar_dvd.pl <srcdir> <destdir> where "srcdir" is the directory where your DVD is mount (e.g. "/mnt/cdrom") and "destdir" is a directory (already existing) where to store the WAV files.

Download the PEM files that define the segmentation for ASR Run #1.

Download the CTM file to generate the English ROVER.

Download the 2004-11-18 English audio file (SPEX).

Back to Top