Technical Program

Paper Detail

Paper ID	F-3-1.5
Paper Title	An Integrated CNN-GRU Framework for Complex Ratio Mask Estimation in Speech Enhancement
Authors	Mojtaba Hasannezhad, Zhiheng Ouyang, Wei-Ping Zhu, Concordia University, Canada; Benoit Champagne, McGill University, Canada
Session	F-3-1: Speech Enhancement 3
Time	Thursday, 10 December, 12:30 - 14:00
Presentation Time:	Thursday, 10 December, 13:30 - 13:45 Check your Time Zone
	All times are in New Zealand Time (UTC +13)
Topic	Speech, Language, and Audio (SLA):
Abstract	In this paper, we propose a novel neural network-based speech enhancement approach, where a convolutional neural network (CNN) and a gated recurrent unit (GRU) are integrated to estimate a modified complex ratio mask (MCRM.) The new CNN structure comprised of frequency dilated convolution layers is employed to extract speech features while benefiting from the global contextual information of input speech. The CNN incorporates the skip connection and residual learning techniques to facilitate the training and accelerate the convergence. The GRU network is exploited to map the CNN-extracted features to the MCRM, which is used to enhance both magnitude and phase of the input speech. We compare the enhancement performance of the proposed method using features extracted by CNN with that of the GRU network using some conventional acoustic features, showing the advantage of the proposed CNN-GRU model. We also demonstrate that the GRU outperforms other recurrent neural network variations within the proposed model for mask estimation in terms of separated speech quality, memory footprint, and the number of model parameters in the presence of highly non-stationary noises.