Technical Program

Paper Detail

Paper ID	F-2-2.5
Paper Title	A PITCH-AWARE SPEAKER EXTRACTION SERIAL NETWORK
Authors	Yu Jiang, Meng Ge, Longbiao Wang, TianJin University, China; Jianwu Dang, Japan Advanced Institute of Science and Technology&Tianjin University, Japan; Kiyoshi Honda, TianJin University, China; Sulin Zhang, Bo Yu, Automotive Data of China Co., Ltd, China
Session	F-2-2: Speaker Recognition 2, Sound Classification
Time	Wednesday, 09 December, 15:30 - 17:00
Presentation Time:	Wednesday, 09 December, 16:30 - 16:45 Check your Time Zone
	All times are in New Zealand Time (UTC +13)
Topic	Speech, Language, and Audio (SLA):
Abstract	Despite deep learning has an excellent performance in monaural speaker extraction, it's still a challenge to extract speakers when facing the same gender, i.e., male-male and female-female. On the other hand, it has been proved that pitch tracking is effective for same-gender speech separation. In this study, we proposed a pitch-aware speaker extraction serial network (PSESNet) to improve extraction performance. We designed a serial system and compared it with multi-task learning, we tried to use the target speaker’s pitch information to optimize the loss function rather than as input to the extraction network. The extraction part uses SpeakerBeam-FE (SBF) with magnitude and temporal spectrum approximation loss (MTSAL) and speaker embedding concatenation. After extracting the spectrogram of the target speaker, we connected the spectrogram to predict the pitch information to do further optimization. Experimental results show that serial system performs better than multi-task learning and proposed method improves performance in both same and opposite gender conditions. On average, PSESNet achieves 4.7% and 3.8% relative improvements on WSJ0 dataset over the SBF-MTSAL-Concat baseline on signal-to-distortion ratio (SDR) under both closed and open condition.