Systems based on i–vectors represent the current state–of–the–art in text–independent speaker recognition. In this work we introduce a new compact representation of a speech segment, similar to the speaker factors of Joint Factor Analysis (JFA) and to i–vectors, that we call “e–vector”. The e–vectors derive their name from the eigenvoice space of the JFA speaker modeling approach. Our working hypothesis is that JFA estimates a more informative speaker subspace than the “total variability” i–vector subspace, because the latter is obtained by considering each training segment as belonging to a different speaker. We propose, thus, a simple “i–vector style” modeling and training technique that exploits this observation, and estimates a more accurate subspace with respect to the one provided by the classical i–vector approach, as confirmed by the results of a set of tests performed on the extended core NIST 2012 Speaker Recognition Evaluation dataset. Simply replacing the i–vectors with e–vectors we get approximately 10% average improvement of the Cprimary cost function, using different systems and classifiers. These performance gains come without any additional memory or computational costs with respect to the standard i–vector systems.

E--vectors: JFA and i--vectors revisited / Cumani, Sandro; Laface, Pietro. - STAMPA. - (2017), pp. 5435-5439. (Intervento presentato al convegno IEEE International Conference on Acoustics Speech and Signal Processing ICASSP 2017 tenutosi a New Orleans, USA nel March 5-9, 2017).

E--vectors: JFA and i--vectors revisited

CUMANI, SANDRO;LAFACE, Pietro
2017

Abstract

Systems based on i–vectors represent the current state–of–the–art in text–independent speaker recognition. In this work we introduce a new compact representation of a speech segment, similar to the speaker factors of Joint Factor Analysis (JFA) and to i–vectors, that we call “e–vector”. The e–vectors derive their name from the eigenvoice space of the JFA speaker modeling approach. Our working hypothesis is that JFA estimates a more informative speaker subspace than the “total variability” i–vector subspace, because the latter is obtained by considering each training segment as belonging to a different speaker. We propose, thus, a simple “i–vector style” modeling and training technique that exploits this observation, and estimates a more accurate subspace with respect to the one provided by the classical i–vector approach, as confirmed by the results of a set of tests performed on the extended core NIST 2012 Speaker Recognition Evaluation dataset. Simply replacing the i–vectors with e–vectors we get approximately 10% average improvement of the Cprimary cost function, using different systems and classifiers. These performance gains come without any additional memory or computational costs with respect to the standard i–vector systems.
2017
9781509041169
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/2670013
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo