TC-Star Evaluation Information (WP4)

subglobal1 link | subglobal1 link | subglobal1 link | subglobal1 link | subglobal1 link | subglobal1 link | subglobal1 link
subglobal2 link | subglobal2 link | subglobal2 link | subglobal2 link | subglobal2 link | subglobal2 link | subglobal2 link
subglobal3 link | subglobal3 link | subglobal3 link | subglobal3 link | subglobal3 link | subglobal3 link | subglobal3 link
subglobal4 link | subglobal4 link | subglobal4 link | subglobal4 link | subglobal4 link | subglobal4 link | subglobal4 link
subglobal5 link | subglobal5 link | subglobal5 link | subglobal5 link | subglobal5 link | subglobal5 link | subglobal5 link
subglobal6 link | subglobal6 link | subglobal6 link | subglobal6 link | subglobal6 link | subglobal6 link | subglobal6 link
subglobal7 link | subglobal7 link | subglobal7 link | subglobal7 link | subglobal7 link | subglobal7 link | subglobal7 link
subglobal8 link | subglobal8 link | subglobal8 link | subglobal8 link | subglobal8 link | subglobal8 link | subglobal8 link

ASR Evaluation - Run #2

 

TC-STAR Evaluation Run #2 for ASR will take place from Feb. 1, 2006 to Feb. 24, 2006. The complete schedule can be seen here, but we can outline the important dates for ASR:

  • ELDA sends the test data to participants Feb. 1, 2006
  • Deadline for submitting results to ELDA Feb. 12, 2006 (23h59 Continental time)
  • ELDA sends preliminary results to participants with reference Feb. 15, 2006
  • ELDA sends FINAL results to participants with final reference Feb. 24, 2006

Here is the submission protocol for sending your results.

Before the proper evaluation run, participants have access to training data and development data composed of audio files and transcriptions. See below training data and development data.

ASR evaluation will be run in 3 languages: English, Spanish and Chinese Mandarin.

English and Spanish are run on recordings from the European Parliament Plenary Sessions (EPPS), while Chinese is run from recordings of Voice of America.

Back to Top

ASR Participants

 

  EPPS ENGLISH EPPS SPANISH CORTES SPANISH Broadcast news Mandarin Punctuation
LIMSI
X
X
X
X
Yes
UKA
X
X
?
SONY
X
?
IBM
X
X
X
?
NOKIA
X
?
IRST
X
X
X
Yes
RWTH
X
X
X
Yes

 

External Participants

  EPPS ENGLISH EPPS SPANISH CORTES SPANISH Broadcast news Mandarin Punctuation
University of Vigo
X
X
?

Back to Top

ASR Resources

Training data

The TC-STAR 2006 audio training set is composed of transcribed and non transcribed data. The total amount of data is 176 hours for English and 230 hours for Spanish. For Mandarin no specific training data was produced within TC-STAR.

For the RESTRICTED CONDITION the list of audio material that can be used is listed here

If you're participating to the ASR 2nd evaluation campaign and want to get a copy of the audio data, just send an email to Djamel Mostefa

 

 

Transcribed

Total transcribed

untranscribed

Total

Politicians

Interpreters

 

 

 

EPPS

English

21 h

70 h

101 h

75 h

176 h

EPPS

Spanish

10 h

51 h

 

 

 

100 h

90 h

190 h

 

PARL

Spanish

38 h

0

 

The following table gives a list of other resources helpful for training purposes.

Language Reference Amount IPR-owner IPR-distrib IPR-granted use IPR-royalty Actors / comments
Training
Zh Mandarin 1997 BN (Hub4-NE) LDC98S73 (audio) & LDC98T24 (transcr) ~30h ? LDC research LDC membership 98 required  
  Mandarin 2001 Call (Hub5) LDC98S69, LDC98T26 (transcr) ~40h ? LDC research LDC membership 98 required  
  Mandarin TDT2 LDC2001S93 & LDC2001T57 (transcr)   ? LDC research LDC membership 01 required  
  Mandarin TDT3 LDC2001S95 & LDC2001T58   ? LDC research LDC membership 01 required BLACKOUT ON DEC 98!!!
  Mandarin Chinese News Text LDC95T13 250M words ? LDC research LDC membership 95 required  
  Mandarin CALLHOME LDC96S34, LDC96T16 (transcr)   ? LDC research LDC membership 96 required  
  Chinese Gigaword LDC2003T09 1.1G words ? LDC research LDC membership 03 required  
  Hong Kong News Parallel Text LDC2000T46 (Zh/En) 18147 articles ? LDC research LDC membership 00 required  
ES EPPS_SP (text): Apr 1996 - May 2005 >36M words RWTH ELRA research nominal fee (RWTH) Provided to TCSTAR by RWTH
  TC-STAR_P Spanish BN 10h transcribed ? UPC research free in TCSTAR Provided to TCSTAR by UPC
  Spanish LDC 1997, BN speech (Hub4-NE), LDC98S74   ? LDC research LDC 98 membership required  
  Spanish LDC CallHome, LDC96S35   ? LDC research LDC 96membership required  
En EPPS_EN (text): Apr 1996 - May 2005 >36M words RWTH ELRA research nominal fee (RWTH) Provided to TCSTAR by RWTH
  TC-STAR_P English BN 10h transcribed RFI ELRA research free in TCSTAR Distributed by ELDA
  English LDC 1995 (CSR-IV Hub 4 Marketplace LDC96S31), 1996, 1997, official NIST Hub4 training sets, LDC97S44 and LDC98S71, USC Marketplace Broadcast News Speech (LDC99S82)   ? LDC research LDC 96, 98 and 99 membership required  
  English LDC TDT2 and TDT3 data with closed-captions, about 2000h, LDC99S84 and LDC2001S94   ? LDC research LDC 99 and 01 membership required  
  English LDC Switchboard 1, 2-I, 2-II, 2-III, LDC97S62, LDC98S75, LDC99S79   ? LDC research LDC 98, 98 and 99 membership required  
  English LDC Callhome, LDC97S42, LDC2004S05, LDC2004S09   ? LDC research LDC 97 and 04 membership required  
  English LDC Meeting corpora, ICSI LDC2004S02, ISL LDC2004S05, NIST LDC2004S09   ? LDC research LDC 04 membership required  

Back to Top

Development Data

To get the corresponding audio files on DVD, contact Djamel Mostefa.

Verbatim transcriptions of EPPS are common with SLT evaluation.

Data Set Files

EPPS English Verbatim Transcriptions:

20050606_1700_1915_OR_SAT
20050607_0900_1215_OR_SAT
20050607_1500_1840_OR_SAT
20050608_0900_1300_OR_SAT
20050608_1505_1815_OR_SAT
20050609_1000_1205_OR_SAT
20050609_1500_1650_OR_SAT

English development package version 3

(updated on January 17)

Validation report from SPEX

Statisitcs

EPPS Spanish Verbatim Transcriptions:

20050606_1700_1915_OR_SAT
20050607_0900_1215_OR_SAT
20050607_1500_1840_OR_SAT
20050608_0900_1300_OR_SAT
20050608_1505_1815_OR_SAT
20050609_1000_1205_OR_SAT
20050609_1500_1650_OR_SAT
20050704_1705_1915_OR_SAT
20050705_0900_1130_OR_SAT
20050705_1505_1920_OR_SAT
20050706_0900_1230_OR_SAT
20050706_1500_1755_OR_SAT
20050707_1000_1215_OR_SAT
20050707_1545_1750_OR_SAT

CORTES Spanish Parliament:

PARL_041201_01_ES
PARL_041201_02_ES
PARL_041201_03_ES
PARL_041201_04_ES
PARL_041201_05_ES
PARL_041202_01_ES

Spanish development package version 4

(updated on March 29j

Validation report from SPEX

Statisitcs

Back to Top

 

Data statistics

Here are some statisctics about the development and test sets for English and Spanish

Statistics on the English dev set

TOTAL

MALE

FEMALE

NATIVE

NONNATIVE

NATIVE

NONNATIVE

# Speakers

41

26

6

6

3

Duration

3h

31.34 %

43.92%

18.75%

5.99%

Perplexity

20.51

Statistics on the English test set

 

 

TOTAL

MALE

FEMALE

NATIVE

NONNATIVE

NATIVE

NONNATIVE

#Speakers

~40

~20

~5

~10

~5

Duration

3h

60%

12%

21%

6%

Perplexity

29.36

Statistics on the Spanish dev set

 

TOTAL

MALE SPEAKERS

FEMALE SPEAKERS

#Speakers

61

44

17

Duration

5.8h

79.69%

20.31%

Pexplexity

34.12

Back to Top

Results

Preliminary results are avaliable for English and Spanish:

  Results reference STM file
ENGLISH
English results (updated March 13th)
SPANISH
Spanish results (updated March 23rd)
CHINESE

eval06zh.tgz

 

Update History

2005-Nov-10 Development data available for EPPS English and EPPS Spanish
2005-Dec-15

EPPS development data updated

Spanish Parliament (Cortes Madrid) dev data added to the Spansih dev package

2005-Dec-22 Statistics on the dev./test data composition
2006-Jan-17 Scoring software updated in each devset package to handle the punctuation and the case sensitive issue
2006-Feb-24 English and Spanish results available
2006-Mar-10 Final results for Engilsh available
2006-Mar-20 Chinese results available
2006-Mar-29 Spanish develoment results updated for the Spanish Parliament only

Back to Top