- 20 speeches of 3 minutes
- 1 TC-STAR speech, 1 interpreter speech
- 20 assessors which evaluate Adequacy (comprehension test) and Fluency (subjective test)
The complete protocol can be found here.
Subjective test:
Test | ||
---|---|---|
Understanding | ¿Cree que ha comprendido el mensaje? | 1: No, nada en absoluto => 5: Sí, completamente |
Fluently | ¿La salida del sistema es fluída? | 1: No, ¡es muy mala! => 5: Sí, ¡está en un castellano perfecto! |
Effort | Evalúe el esfuerzo requerido durante la escucha | 1: muy alto => 5: muy bajo, es habla natural |
Overall Quality | Evalúe la calidad general del sistema de traducción | 1: Muy malo, inutilizable => 5: Es muy útil |
Test data
Component | Input |
---|---|
ASR | ROVER
|
SLT | RWTH |
TTS | ITP |
UPC |
Preliminary results are available (access is restricted to participant only):
Subjective evaluation
System | Audio | Understanding (1: very bad; 5: perfect) |
Fluently (1: very bad; 5: perfect) |
Effort (1: very bad; 5: perfect) |
Overall Quality (1: very bad; 5: perfect) |
---|---|---|---|---|---|
ITP | Audio 1 | 5 |
5 |
4 |
4 |
Audio 2 | 4 |
3
| 2
| 4 |
|
Audio 3 | 5 |
5 |
5 |
4 |
|
4 |
5 |
4 |
5 |
||
Audio 4 | 4 |
5 |
4 |
5 |
|
Audio 5 | 3 |
3 |
3 |
3 |
|
3 |
5 |
3 |
4 |
||
Audio 6 | 2 |
1 |
1 |
1 |
|
1 |
1 |
1 |
1 |
||
Audio 7 | 2 |
3 |
3 |
2 |
|
3 |
3 |
2 |
4 |
||
Audio 8 | 4 |
4 |
4 |
5 |
|
Audio 9 | 2 |
2 |
2 |
2 |
|
Audio 10 | 5 |
5 |
4 |
5 |
|
Audio 11 | 3 |
4 |
2 |
3 |
|
Audio 12 | 2 |
1 |
5 |
1 |
|
3 |
3 |
4 |
4 |
||
Audio 13 | 3 |
1 |
3 |
2 |
|
2 |
4 |
2 |
3 |
||
Audio 14 | 3 |
3 |
3 |
3 |
|
3 |
2 |
1 |
2 |
||
Audio 15 | 4 |
4 |
4 |
5 |
|
5 |
5 |
5 |
5 |
||
Audio 16 | 3 |
1 |
2 |
2 |
|
4 |
4 |
3 |
4 |
||
Audio 17 | 4 |
4 |
4 |
4 |
|
5 |
5 |
5 |
5 |
||
Audio 18 | 3 |
4 |
4 |
4 |
|
Audio 19 | 4 |
4 |
3 |
4 |
|
Audio 20 | 5 |
5 |
4 |
5 |
|
4 |
4 |
3 |
4 |
||
mean |
3.45 |
3.48 |
3.19 |
3.52 |
|
TC-STAR | Audio 1 | 3 |
1 |
2 |
2 |
Audio 2 | 3 |
5 |
3 |
4 |
|
1 |
1 |
1 |
1 |
||
Audio 3 | 1 |
2 |
1 |
1 |
|
Audio 4 | 1 |
2 |
1 |
2 |
|
2 |
1 |
1 |
1 |
||
Audio 5 | 3 |
2 |
1 |
2 |
|
3 |
2 |
3 |
3 |
||
Audio 6 | 3 |
1 |
2 |
1 |
|
Audio 7 | 4 |
4 |
3 |
4 |
|
Audio 8 | 4 |
3 |
2 |
2 |
|
Audio 9 | 1 |
2 |
1 |
1 |
|
2 |
1 |
1 |
1 |
||
Audio 10 | 2 |
3 |
2 |
2 |
|
Audio 11 | 4 |
3 |
2 |
4 |
|
Audio 12 | 2 |
1 |
1 |
2 |
|
Audio 13 | 3 |
1 |
1 |
1 |
|
Audio 14 | 2 |
2 |
1 |
1 |
|
1 |
1 |
1 |
1 |
||
Audio 15 | 2 |
1 |
1 |
2 |
|
Audio 16 | 3 |
2 |
3 |
2 |
|
2 |
2 |
1 |
2 |
||
Audio 17 | 2 |
1 |
1 |
1 |
|
1 |
1 |
1 |
1 |
||
Audio 18 | 3 |
2 |
2 |
3 |
|
2 |
2 |
1 |
2 |
||
Audio 19 | 2 |
2 |
1 |
2 |
|
3 |
3 |
3 |
3 |
||
Audio 20 | 3 |
2 |
1 |
2 |
|
mean |
2.34 |
1.93 |
1.55 |
1.93 |
Comprehension evaluation
System | Audio (mean) | E2E Evaluation (0: bad; 1: good) |
ITP / TTS (0: bad; 1: good) |
SLT (0: bad; 1: good) |
ASR (0: bad; 1: good) |
Only ITP = 1.00 (0: bad; 1: good) |
---|---|---|---|---|---|---|
ITP | Audio 1 | 0.70 |
0.90 |
-- |
-- |
1.00 |
Audio 2 | 0.20 |
0.40 |
-- |
-- |
1.00 |
|
Audio 3 | 0.70 |
0.70 |
-- |
-- |
1.00 |
|
Audio 4 | 0.60 |
0.80 |
-- |
-- |
1.00 |
|
Audio 5 | 0.35 |
0.60 |
-- |
-- |
1.00 |
|
Audio 6 | 0.30 |
0.50 |
-- |
-- |
1.00 |
|
Audio 7 | 0.20 |
0.60 |
-- |
-- |
1.00 |
|
Audio 8 | 0.40 |
0.70 |
-- |
-- |
1.00 |
|
Audio 9 | 0.30 |
0.80 |
-- |
-- |
1.00 |
|
Audio 10 | 0.70 |
0.90 |
-- |
-- |
1.00 |
|
Audio 11 | 0.40 |
0.50 |
-- |
-- |
1.00 |
|
Audio 12 | 0.30 |
0.90 |
-- |
-- |
1.00 |
|
Audio 13 | 0.25 |
0.70 |
-- |
-- |
1.00 |
|
Audio 14 | 0.35 |
0.60 |
-- |
-- |
1.00 |
|
Audio 15 | 0.75 |
0.80 |
-- |
-- |
1.00 |
|
Audio 16 | 0.65 |
0.80 |
-- |
-- |
1.00 |
|
Audio 17 | 0.75 |
0.80 |
-- |
-- |
1.00 |
|
Audio 18 | 0.80 |
0.80 |
-- |
-- |
1.00 |
|
Audio 19 | 0.40 |
0.50 |
-- |
-- |
1.00 |
|
Audio 20 | 0.75 |
1.00 |
-- |
-- |
1.00 |
|
mean |
0.50 |
0.72 |
-- |
-- |
1.00 |
|
TC-STAR | Audio 1 | 0.80 |
1.00 |
1.00 |
1.00 |
1.00 |
Audio 2 | 0.90 |
1.00 |
1.00 |
1.00 |
1.00 |
|
Audio 3 | 0.50 |
0.90 |
0.90 |
1.00 |
0.86 |
|
Audio 4 | 0.55 |
0.90 |
0.90 |
0.90 |
0.88 |
|
Audio 5 | 0.70 |
0.90 |
0.90 |
1.00 |
1.00 |
|
Audio 6 | 0.70 |
0.90 |
0.90 |
0.90 |
1.00 |
|
Audio 7 | 0.50 |
0.80 |
0.90 |
0.90 |
0.83 |
|
Audio 8 | 0.80 |
0.90 |
0.90 |
1.00 |
0.88 |
|
Audio 9 | 0.30 |
0.90 |
0.90 |
1.00 |
0.88 |
|
Audio 10 | 0.50 |
0.50 |
0.60 |
0.60 |
0.56 |
|
Audio 11 | 0.35 |
0.90 |
0.90 |
0.90 |
1.00 |
|
Audio 12 | 0.50 |
0.90 |
0.90 |
0.90 |
1.00 |
|
Audio 13 | 0.60 |
0.60 |
0.60 |
0.60 |
0.88 |
|
Audio 14 | 0.55 |
0.60 |
0.60 |
0.70 |
0.67 |
|
Audio 15 | 1.00 |
1.00 |
1.00 |
1.00 |
1.00 |
|
Audio 16 | 0.60 |
0.70 |
1.00 |
1.00 |
1.00 |
|
Audio 17 | 0.25 |
0.70 |
0.70 |
0.80 |
0.88 |
|
Audio 18 | 0.65 |
0.80 |
0.90 |
0.90 |
1.00 |
|
Audio 19 | 0.60 |
0.70 |
0.80 |
1.00 |
0.80 |
|
Audio 20 | 0.40 |
0.90 |
0.90 |
1.00 |
0.90 |
|
mean |
0.58
|
0.83 |
0.86 |
0.91 |
0.90 |
The columns show the following information:
-2 evaluated systems: ITP for the interpreter version and TC-STAR for the automatic speech-to-speech translation system
- the identifier of the audio file (corresponding data for interpreter and TC-STAR)
- E2E Evaluation: the evaluation was done by the same assessors who did the subjective evaluation.
- ITP / TTS: as it was not foreseen that results would be better for TC-STAR than for ITP, the audio files had been validated to check whether they contained the answers to the questions. The first conclusions that can be drawn from this are: it was difficult for the assessors to find the answers ( questions too hard?) and as the interpreter selects and reformulates the information, missing some details, then the question becomes too specific and not appropriate.
- TTS, SLT, ASR: in order to determine where the information was lost for the TC-STAR system, files from each component (recognized files for ASR, translated files for SLT, synthethized files for TTS) have been checked. The overall loss is 15% of the information, 5% being lost at each step.
- Only ITP: in the end, we used the questions whose answers were included in the interpreter files. So the TC-STAR system lost 10% of the information regarding the ITP evaluation (instead of 15%).