Paper ID | E-2-1.6 |
Paper Title |
CROSS-LINGUAL VOICE CONVERSION USING A CYCLIC VARIATIONAL AUTO-ENCODER AND A WAVENET VOCODER |
Authors |
Hikaru Nakatani, Patrick Lumban Tobing, Kazuya Takeda, Tomoki Toda, Nagoya University, Japan |
Session |
E-2-1: Music Information Processing 2, Voice Conversion |
Time | Wednesday, 09 December, 12:30 - 14:00 |
Presentation Time: | Wednesday, 09 December, 13:45 - 14:00 Check your Time Zone |
|
All times are in New Zealand Time (UTC +13) |
Topic |
Speech, Language, and Audio (SLA): |
Abstract |
We propose a novel, cross-lingual voice conversion (VC) method using a cyclic variational auto-encoder (CycleVAE). Voice conversion is the transformation of the voice of one speaker into the voice of another speaker, while cross-lingual VC performs voice conversion between speakers who speak different languages. When using VC methods based on parallel learning, it is necessary to prepare accented speech uttered by the source or target speaker, using the pronunciation system of the speaker's mother tongue. On the other hand, VC methods which use a non-parallel learning approach can utilize the natural speech data of both the source and target speakers, produced in their own native languages. It then becomes necessary, however, to deal with the issues of time-alignment and language mismatches. To address these issues, we apply CycleVAE to cross-lingual VC as a sophisticated, non-parallel method of VC. We also apply the WaveNet vocoder in the waveform generation process of CycleVAE-VC to improve overall conversion quality. Our objective and subjective experimental results when performing cross-lingual VC from a native English speaker to a native Japanese speaker confirm that the proposed method achieves a higher level of naturalness and speaker similarity than a conventional RNN-based parallel VC method using accented speech. |