Technical Program

Paper Detail

Paper ID	E-2-1.6
Paper Title	CROSS-LINGUAL VOICE CONVERSION USING A CYCLIC VARIATIONAL AUTO-ENCODER AND A WAVENET VOCODER
Authors	Hikaru Nakatani, Patrick Lumban Tobing, Kazuya Takeda, Tomoki Toda, Nagoya University, Japan
Session	E-2-1: Music Information Processing 2, Voice Conversion
Time	Wednesday, 09 December, 12:30 - 14:00
Presentation Time:	Wednesday, 09 December, 13:45 - 14:00 Check your Time Zone
	All times are in New Zealand Time (UTC +13)
Topic	Speech, Language, and Audio (SLA):
Abstract	We propose a novel, cross-lingual voice conversion (VC) method using a cyclic variational auto-encoder (CycleVAE). Voice conversion is the transformation of the voice of one speaker into the voice of another speaker, while cross-lingual VC performs voice conversion between speakers who speak different languages. When using VC methods based on parallel learning, it is necessary to prepare accented speech uttered by the source or target speaker, using the pronunciation system of the speaker's mother tongue. On the other hand, VC methods which use a non-parallel learning approach can utilize the natural speech data of both the source and target speakers, produced in their own native languages. It then becomes necessary, however, to deal with the issues of time-alignment and language mismatches. To address these issues, we apply CycleVAE to cross-lingual VC as a sophisticated, non-parallel method of VC. We also apply the WaveNet vocoder in the waveform generation process of CycleVAE-VC to improve overall conversion quality. Our objective and subjective experimental results when performing cross-lingual VC from a native English speaker to a native Japanese speaker confirm that the proposed method achieves a higher level of naturalness and speaker similarity than a conventional RNN-based parallel VC method using accented speech.