Paper ID | F-3-2.5 |
Paper Title |
ExcitGlow: Improving a WaveGlow-based Neural Vocoder with Linear Prediction Analysis |
Authors |
Suhyeon Oh, Hyungseob Lim, Kyungguen Byun, Yonsei University, Korea (South); Min-Jae Hwang, Search Solutions, Incorporated, Korea (South); Eunwoo Song, Naver Corporation, Korea (South); Hong-Goo Kang, Yonsei University, Korea (South) |
Session |
F-3-2: Speech Synthesis |
Time | Thursday, 10 December, 15:30 - 17:15 |
Presentation Time: | Thursday, 10 December, 16:30 - 16:45 Check your Time Zone |
|
All times are in New Zealand Time (UTC +13) |
Topic |
Speech, Language, and Audio (SLA): |
Abstract |
In this paper, we propose ExcitGlow, a vocoder that incorporates the source-filter model of voice production theory into a flow-based deep generative model. By targeting the distribution of the excitation signal instead of the speech waveform itself, we significantly reduce the size of the flow-based generative model. To further reduce the number of parameters, we apply a parameter sharing technique in which a single affine coupling layer is used for several flow layers. To avoid quality degradation, we also introduce a closed-loop training framework to optimize the flow model for both the speech and excitation signal generation processes. Specifically, we choose negative log-likelihood (NLL) loss for the excitation signal and multi-resolution spectral distance for the speech signal. As a result, we are able to reduce the model size from 87.73M to 15.60M parameters while maintaining the perceptual quality of synthesized speech. |