Technical Program

Paper Detail

Paper IDF-3-2.6
Paper Title PERSONALIZED END-TO-END MANDARIN SPEECH SYNTHESIS USING SMALL-SIZED CORPUS
Authors Chenhan Yuan, Virginia Polytechnic Institute and State University, China; Yi-Chin Huang, National Pingtung University, Taiwan
Session F-3-2: Speech Synthesis
TimeThursday, 10 December, 15:30 - 17:15
Presentation Time:Thursday, 10 December, 16:45 - 17:00 Check your Time Zone
All times are in New Zealand Time (UTC +13)
Topic Speech, Language, and Audio (SLA):
Abstract Conventionally, voice conversion techniques are based on the source-filter model, which extracts acoustic features and transforms the spectrum distribution from the source speaker to the target. Parallel corpora are usually required to learn the transformation and the alignment of phone units has to be done manually to obtain the optimal conversion. These requirements are hard to achieve in daily use. Therefore, we proposed an end-to-end method for personalized speech synthesis system by combining the ideas to tackle these problems and try to make the data collection task attainable. We integrated the linguistic/acoustic feature extraction of the speech corpus by adopting suitable neural networks. In this way, the traditional linguistic feature extraction module which relies on the expert knowledge to build could be substituted. Then, for the personalized acoustic model, we adopted the variational auto-encoder, which focused on separating the speaker-related properties, such as timbre and speaker identity, from the underlying latent code, which assumed to be related to phoneme identity. Therefore, the requirement of manual alignment and parallel corpus could be overcome. Finally, experimental results showed that the proposed system indeed useful for personalized speech synthesis and provides comparable performance with the conventional system while easier to build.