Paper ID | E-3-2.1 |
Paper Title |
INTEGRATION OF SEMI-BLIND SPEECH SOURCE SEPARATION AND VOICE ACTIVITY DETECTION FOR FLEXIBLE SPOKEN DIALOGUE |
Authors |
Masaya Wake, Graduate School of Informatics, Kyoto University, Japan; Masahito Togami, LINE Corporation, Japan; Kazuyoshi Yoshii, Tatsuya Kawahara, Graduate School of Informatics, Kyoto University, Japan |
Session |
E-3-2: Speech Separation 2, Sound source separation |
Time | Thursday, 10 December, 15:30 - 17:15 |
Presentation Time: | Thursday, 10 December, 15:30 - 15:45 Check your Time Zone |
|
All times are in New Zealand Time (UTC +13) |
Topic |
Speech, Language, and Audio (SLA): |
Abstract |
Conventionally, speech separation (SS) and voice activity detection (VAD) have been investigated separately with a different criteria. In natural dialogue systems such as conversational robots, however, it is critical to accurately separate and detect user utterances even while system’s speaking. This study addresses the integration of semi-blind source separation (SS) and voice activity detection (VAD) using a single recurrent neural network under the condition that the speech source and voice activity of the system are given. This study investigates three methods of integrated networks where SS and VAD are processed simultaneously or sequentially prioritizing each. The proposed methods input a single-channel microphone observation spectrum, a speech source spectrum, and voice activity of the system, and then output a speech source spectrum and voice activity of the user. Each network adopts long short-term memory (LSTM) to take the dependency of speech into account. An experimental evaluation using a dataset of recorded dialogues between a user and the android ERICA shows the proposed method that conducts two tasks sequentially with SS first achieves the best performance for both SS and VAD. |