Technical Program

Paper Detail

Paper ID	E-3-2.1
Paper Title	INTEGRATION OF SEMI-BLIND SPEECH SOURCE SEPARATION AND VOICE ACTIVITY DETECTION FOR FLEXIBLE SPOKEN DIALOGUE
Authors	Masaya Wake, Graduate School of Informatics, Kyoto University, Japan; Masahito Togami, LINE Corporation, Japan; Kazuyoshi Yoshii, Tatsuya Kawahara, Graduate School of Informatics, Kyoto University, Japan
Session	E-3-2: Speech Separation 2, Sound source separation
Time	Thursday, 10 December, 15:30 - 17:15
Presentation Time:	Thursday, 10 December, 15:30 - 15:45 Check your Time Zone
	All times are in New Zealand Time (UTC +13)
Topic	Speech, Language, and Audio (SLA):
Abstract	Conventionally, speech separation (SS) and voice activity detection (VAD) have been investigated separately with a different criteria. In natural dialogue systems such as conversational robots, however, it is critical to accurately separate and detect user utterances even while system’s speaking. This study addresses the integration of semi-blind source separation (SS) and voice activity detection (VAD) using a single recurrent neural network under the condition that the speech source and voice activity of the system are given. This study investigates three methods of integrated networks where SS and VAD are processed simultaneously or sequentially prioritizing each. The proposed methods input a single-channel microphone observation spectrum, a speech source spectrum, and voice activity of the system, and then output a speech source spectrum and voice activity of the user. Each network adopts long short-term memory (LSTM) to take the dependency of speech into account. An experimental evaluation using a dataset of recorded dialogues between a user and the android ERICA shows the proposed method that conducts two tasks sequentially with SS first achieves the best performance for both SS and VAD.