TC-STAR WP4 - E2E Evaluation Run #2

Protocol

- 20 speeches of 3 minutes

- 1 TC-STAR speech, 1 interpreter speech

- 20 assessors which evaluate Adequacy (comprehension test) and Fluency (subjective test)

The complete protocol can be found here.

Subjective test:

Test
Understanding	¿Cree que ha comprendido el mensaje?	1: No, nada en absoluto => 5: Sí, completamente
Fluently	¿La salida del sistema es fluída?	1: No, ¡es muy mala! => 5: Sí, ¡está en un castellano perfecto!
Effort	Evalúe el esfuerzo requerido durante la escucha	1: muy alto => 5: muy bajo, es habla natural
Overall Quality	Evalúe la calidad general del sistema de traducción	1: Muy malo, inutilizable => 5: Es muy útil

Data

Test data

Component	Input
ASR	ROVER
SLT	RWTH
TTS	ITP
TTS	UPC

Results

Preliminary results are available (access is restricted to participant only):

Subjective evaluation

System	Audio	Understanding (1: very bad; 5: perfect)	Fluently (1: very bad; 5: perfect)	Effort (1: very bad; 5: perfect)	Overall Quality (1: very bad; 5: perfect)
ITP	Audio 1	5	5	4	4
	Audio 2	4	3	2	4
	Audio 3	5	5	5	4
		4	5	4	5
	Audio 4	4	5	4	5
	Audio 5	3	3	3	3
		3	5	3	4
	Audio 6	2	1	1	1
		1	1	1	1
	Audio 7	2	3	3	2
		3	3	2	4
	Audio 8	4	4	4	5
	Audio 9	2	2	2	2
	Audio 10	5	5	4	5
	Audio 11	3	4	2	3
	Audio 12	2	1	5	1
		3	3	4	4
	Audio 13	3	1	3	2
		2	4	2	3
	Audio 14	3	3	3	3
		3	2	1	2
	Audio 15	4	4	4	5
		5	5	5	5
	Audio 16	3	1	2	2
		4	4	3	4
	Audio 17	4	4	4	4
		5	5	5	5
	Audio 18	3	4	4	4
	Audio 19	4	4	3	4
	Audio 20	5	5	4	5
		4	4	3	4

	mean	3.45	3.48	3.19	3.52

TC-STAR	Audio 1	3	1	2	2
	Audio 2	3	5	3	4
		1	1	1	1
	Audio 3	1	2	1	1
	Audio 4	1	2	1	2
		2	1	1	1
	Audio 5	3	2	1	2
		3	2	3	3
	Audio 6	3	1	2	1
	Audio 7	4	4	3	4
	Audio 8	4	3	2	2
	Audio 9	1	2	1	1
		2	1	1	1
	Audio 10	2	3	2	2
	Audio 11	4	3	2	4
	Audio 12	2	1	1	2
	Audio 13	3	1	1	1
	Audio 14	2	2	1	1
		1	1	1	1
	Audio 15	2	1	1	2
	Audio 16	3	2	3	2
		2	2	1	2
	Audio 17	2	1	1	1
		1	1	1	1
	Audio 18	3	2	2	3
		2	2	1	2
	Audio 19	2	2	1	2
		3	3	3	3
	Audio 20	3	2	1	2

	mean	2.34	1.93	1.55	1.93

Comprehension evaluation

System	Audio (mean)	E2E Evaluation (0: bad; 1: good)	ITP / TTS (0: bad; 1: good)	SLT (0: bad; 1: good)	ASR (0: bad; 1: good)	Only ITP = 1.00 (0: bad; 1: good)
ITP	Audio 1	0.70	0.90	--	--	1.00
	Audio 2	0.20	0.40	--	--	1.00
	Audio 3	0.70	0.70	--	--	1.00
	Audio 4	0.60	0.80	--	--	1.00
	Audio 5	0.35	0.60	--	--	1.00
	Audio 6	0.30	0.50	--	--	1.00
	Audio 7	0.20	0.60	--	--	1.00
	Audio 8	0.40	0.70	--	--	1.00
	Audio 9	0.30	0.80	--	--	1.00
	Audio 10	0.70	0.90	--	--	1.00
	Audio 11	0.40	0.50	--	--	1.00
	Audio 12	0.30	0.90	--	--	1.00
	Audio 13	0.25	0.70	--	--	1.00
	Audio 14	0.35	0.60	--	--	1.00
	Audio 15	0.75	0.80	--	--	1.00
	Audio 16	0.65	0.80	--	--	1.00
	Audio 17	0.75	0.80	--	--	1.00
	Audio 18	0.80	0.80	--	--	1.00
	Audio 19	0.40	0.50	--	--	1.00
	Audio 20	0.75	1.00	--	--	1.00

	mean	0.50	0.72	--	--	1.00

TC-STAR	Audio 1	0.80	1.00	1.00	1.00	1.00
	Audio 2	0.90	1.00	1.00	1.00	1.00
	Audio 3	0.50	0.90	0.90	1.00	0.86
	Audio 4	0.55	0.90	0.90	0.90	0.88
	Audio 5	0.70	0.90	0.90	1.00	1.00
	Audio 6	0.70	0.90	0.90	0.90	1.00
	Audio 7	0.50	0.80	0.90	0.90	0.83
	Audio 8	0.80	0.90	0.90	1.00	0.88
	Audio 9	0.30	0.90	0.90	1.00	0.88
	Audio 10	0.50	0.50	0.60	0.60	0.56
	Audio 11	0.35	0.90	0.90	0.90	1.00
	Audio 12	0.50	0.90	0.90	0.90	1.00
	Audio 13	0.60	0.60	0.60	0.60	0.88
	Audio 14	0.55	0.60	0.60	0.70	0.67
	Audio 15	1.00	1.00	1.00	1.00	1.00
	Audio 16	0.60	0.70	1.00	1.00	1.00
	Audio 17	0.25	0.70	0.70	0.80	0.88
	Audio 18	0.65	0.80	0.90	0.90	1.00
	Audio 19	0.60	0.70	0.80	1.00	0.80
	Audio 20	0.40	0.90	0.90	1.00	0.90

	mean	0.58	0.83	0.86	0.91	0.90

The columns show the following information:

-2 evaluated systems: ITP for the interpreter version and TC-STAR for the automatic speech-to-speech translation system

- the identifier of the audio file (corresponding data for interpreter and TC-STAR)

- E2E Evaluation: the evaluation was done by the same assessors who did the subjective evaluation.

- ITP / TTS: as it was not foreseen that results would be better for TC-STAR than for ITP, the audio files had been validated to check whether they contained the answers to the questions. The first conclusions that can be drawn from this are: it was difficult for the assessors to find the answers ( questions too hard?) and as the interpreter selects and reformulates the information, missing some details, then the question becomes too specific and not appropriate.

- TTS, SLT, ASR: in order to determine where the information was lost for the TC-STAR system, files from each component (recognized files for ASR, translated files for SLT, synthethized files for TTS) have been checked. The overall loss is 15% of the information, 5% being lost at each step.

- Only ITP: in the end, we used the questions whose answers were included in the interpreter files. So the TC-STAR system lost 10% of the information regarding the ITP evaluation (instead of 15%).

TC-Star Evaluation Information (WP4)

End-to-End Evaluation - Run #2

E2E - Run #2

Protocol