Update History
TC-STAR Evaluation Run #3 for ASR will take place from Jan. 21, 2007 to Jan. 28, 2007. The development data is already available (see development data section). The complete schedule can be seen here, but we can outline the important dates for ASR:
- ELDA sends the test data to participants Jan. 21, 2007
- Deadline for submitting results to ELDA Jan. 28, 2007 (23h59 Central European Time)
- ELDA sends preliminary results to participants with reference Jan. 31, 2007
- ELDA sends FINAL results to participants with final reference Feb. 14, 2007
The 2007 evaluation protocol is now available.
Here is the submission protocol for sending your results.
Before the proper evaluation run, participants have access to training data and development data composed of audio files and transcriptions. See below training data and development data.
ASR evaluation will be run in 3 languages: English, Spanish and Chinese Mandarin.
English is run on recordings from the European Parliament Plenary Sessions (EPPS)
Spanish is run on recordings from the European Parliament Plenary Sessions (EPPS) and from the CORTES Spanish parliament .
Chinese is run from recordings of Voice of America.
Back to Top
|
EPPS ENGLISH |
EPPS SPANISH |
CORTES SPANISH |
Broadcast news Mandarin |
Punctuation |
LIMSI |
X |
X |
X |
X |
Yes |
UKA |
X |
|
|
X |
? |
IBM |
X |
X |
X |
|
? |
IRST |
X |
X |
X |
|
Yes |
UPC |
|
X |
X |
|
? |
RWTH |
X |
X |
X |
|
Yes |
This table has to be confirmed by TC-STAR partners.
External Participants
|
EPPS ENGLISH |
EPPS SPANISH |
CORTES SPANISH |
Broadcast news Mandarin |
Punctuation |
ATR |
X |
|
|
X |
? |
LIUM |
X |
X |
X |
|
|
DAEDALUS |
X |
X |
X |
|
|
UPV/EHU |
|
X |
X |
|
|
ATR: Advanced Telecommunications Research Institute International, Japan
LIUM: Laboratoire d'Informatique de l'Université du Maine (LIUM), France
DAEDALUS: Data, Decisions and Language, S. A, Spain
UPV/EHU: Universidad de País Vasco, Spain
Back to Top
Submissions
|
English |
Spanish |
Mandarin |
ATR |
|
|
|
DAEDALUS |
|
2P |
|
IBM |
1O+1P+1R |
1O+1R |
|
ITC-irst |
4P+1R |
2P+2R |
|
LIMSI |
1P* |
1R* |
1O |
LIUM |
1P+1R+1P* |
1R |
|
RWTH |
2P+2R |
2R |
|
UKA |
6P |
|
1O |
UPC |
|
1R |
|
UPV |
|
|
|
TC-STAR |
1P* |
2P* |
|
O=Open;P=Public;R=Restricted
*= late submission
Training data
The TC-STAR 2007 audio training corpus is made of transcribed and non transcribed data. The total amount of data is 300 hours for English and 330 hours for Spanish. For Mandarin no specific training data was produced within TC-STAR.
For the RESTRICTED CONDITION the lists of audio material that can be used for training are listed here:
For Spanish : EPPS07ES_TRAIN
For English : EPPS07EN_TRAIN
The HANSARD text corpus consists of debates of the U. K. Parliament from Nov 1999 to May 2006 and can be used for Language Models (see README )
|
Transcribed
|
Total transcribed
|
untranscribed
|
Total
|
Politicians
|
Interpreters
|
|
|
|
EPPS
English
|
21 h
|
70 h
|
101 h
|
200 h
|
301 h
|
EPPS
Spanish
|
10 h
|
51 h
|
100 h
|
230 h
|
330 h
|
PARL
Spanish
|
38 h
|
0
|
|
The following table gives a list of other resources helpful for training purposes.
Language |
Reference |
Amount |
IPR-owner |
IPR-distrib |
IPR-granted use |
IPR-royalty |
Actors / comments |
Training |
Zh |
Mandarin 1997 BN (Hub4-NE) LDC98S73 (audio) & LDC98T24 (transcr) |
~30h |
? |
LDC |
research |
LDC membership 98 required |
|
|
Mandarin 2001 Call (Hub5) LDC98S69, LDC98T26 (transcr) |
~40h |
? |
LDC |
research |
LDC membership 98 required |
|
|
Mandarin TDT2 LDC2001S93 & LDC2001T57 (transcr) |
|
? |
LDC |
research |
LDC membership 01 required |
|
|
Mandarin TDT3 LDC2001S95 & LDC2001T58 |
|
? |
LDC |
research |
LDC membership 01 required |
BLACKOUT ON DEC 98!!! |
|
Mandarin Chinese News Text LDC95T13 |
250M words |
? |
LDC |
research |
LDC membership 95 required |
|
|
Mandarin CALLHOME LDC96S34, LDC96T16 (transcr) |
|
? |
LDC |
research |
LDC membership 96 required |
|
|
Chinese Gigaword LDC2003T09 |
1.1G words |
? |
LDC |
research |
LDC membership 03 required |
|
|
Hong Kong News Parallel Text LDC2000T46 (Zh/En) |
18147 articles |
? |
LDC |
research |
LDC membership 00 required |
|
ES |
EPPS_SP (text): Apr 1996 - May 2006 |
>36M words |
RWTH |
ELRA |
research |
nominal fee (RWTH) |
Provided to TCSTAR by RWTH |
|
EPPS Verbatim transcriptions May 2004 - January 2005 |
102h |
|
|
|
|
Transcribed by UPC |
|
EPPS untranscribed data February 2005 - May 2006 |
160h |
|
|
|
|
|
|
TC-STAR_P Spanish BN |
10h transcribed |
? |
UPC |
research |
free in TCSTAR |
Provided to TCSTAR by UPC |
|
Spanish LDC 1997, BN speech (Hub4-NE), LDC98S74 |
|
? |
LDC |
research |
LDC 98 membership required |
|
|
Spanish LDC CallHome, LDC96S35 |
|
? |
LDC |
research |
LDC 96membership required |
|
En |
EPPS_EN (text): Apr 1996 - May 2006 |
>36M words |
RWTH |
ELRA |
research |
nominal fee (RWTH) |
Provided to TCSTAR by RWTH |
|
EPPS Verbatim transcriptions May 2004 - January 2005 |
100h |
|
|
|
|
Transcribed by RWTH |
|
EPPS untranscribed data February 2005 - May 2006 |
215h |
|
|
|
|
|
|
TC-STAR_P English BN |
10h transcribed |
RFI |
ELRA |
research |
free in TCSTAR |
Distributed by ELDA |
|
English LDC 1995 (CSR-IV Hub 4 Marketplace LDC96S31), 1996, 1997, official NIST Hub4 training sets, LDC97S44 and LDC98S71, USC Marketplace Broadcast News Speech (LDC99S82) |
|
? |
LDC |
research |
LDC 96, 98 and 99 membership required |
|
|
English LDC TDT2 and TDT3 data with closed-captions, about 2000h, LDC99S84 and LDC2001S94 |
|
? |
LDC |
research |
LDC 99 and 01 membership required |
|
|
English LDC Switchboard 1, 2-I, 2-II, 2-III, LDC97S62, LDC98S75, LDC99S79 |
|
? |
LDC |
research |
LDC 98, 98 and 99 membership required |
|
|
English LDC Callhome, LDC97S42, LDC2004S05, LDC2004S09 |
|
? |
LDC |
research |
LDC 97 and 04 membership required |
|
|
English LDC Meeting corpora, ICSI LDC2004S02, ISL LDC2004S05, NIST LDC2004S09 |
|
? |
LDC |
research |
LDC 04 membership required |
|
|
HANSARD TEXT CORPUS |
48 M words |
|
ELRA |
research |
|
|
Back to Top
Development Data
To get the corresponding audio files on DVD, contact Djamel Mostefa.
Verbatim transcriptions of EPPS are common with SLT evaluation.
Data Set |
Files |
EPPS English Verbatim Transcriptions:
20050606_1700_1915_OR_SAT
20050607_0900_1215_OR_SAT
20050607_1500_1840_OR_SAT
20050608_0900_1300_OR_SAT
20050608_1505_1815_OR_SAT
20050609_1000_1205_OR_SAT
20050609_1500_1650_OR_SAT
|
English development package version 3
Validation report from SPEX
Statistics |
EPPS Spanish Verbatim Transcriptions:
20050606_1700_1915_OR_SAT
20050607_0900_1215_OR_SAT
20050607_1500_1840_OR_SAT
20050608_0900_1300_OR_SAT
20050608_1505_1815_OR_SAT
20050609_1000_1205_OR_SAT
20050609_1500_1650_OR_SAT
20050704_1705_1915_OR_SAT
20050705_0900_1130_OR_SAT
20050705_1505_1920_OR_SAT
20050706_0900_1230_OR_SAT
20050706_1500_1755_OR_SAT
20050707_1000_1215_OR_SAT
20050707_1545_1750_OR_SAT
CORTES Spanish Parliament:
PARL_041201_01_ES
PARL_041201_02_ES
PARL_041201_03_ES
PARL_041201_04_ES
PARL_041201_05_ES
PARL_041202_01_ES
|
Spanish development package version 4
Validation report from SPEX
Statistics |
Back to Top
Data statistics
Here are some statisctics about the development and test sets for English and Spanish
Statistics on the English dev set
|
TOTAL |
MALE |
FEMALE |
NATIVE |
NONNATIVE |
NATIVE |
NONNATIVE |
# Speakers |
41 |
26 |
6 |
6 |
3 |
Duration |
3h |
31.34 % |
43.92% |
18.75% |
5.99% |
Perplexity |
20.51 |
Statistics on the Spanish dev set
|
TOTAL |
MALE SPEAKERS |
FEMALE SPEAKERS |
#Speakers |
61 |
44 |
17 |
Duration |
5.8h |
79.69% |
20.31% |
Pexplexity |
34.12 |
Back to Top