Technical Program

Paper Detail

Paper IDF-3-1.5
Paper Title An Integrated CNN-GRU Framework for Complex Ratio Mask Estimation in Speech Enhancement
Authors Mojtaba Hasannezhad, Zhiheng Ouyang, Wei-Ping Zhu, Concordia University, Canada; Benoit Champagne, McGill University, Canada
Session F-3-1: Speech Enhancement 3
TimeThursday, 10 December, 12:30 - 14:00
Presentation Time:Thursday, 10 December, 13:30 - 13:45 Check your Time Zone
All times are in New Zealand Time (UTC +13)
Topic Speech, Language, and Audio (SLA):
Abstract In this paper, we propose a novel neural network-based speech enhancement approach, where a convolutional neural network (CNN) and a gated recurrent unit (GRU) are integrated to estimate a modified complex ratio mask (MCRM.) The new CNN structure comprised of frequency dilated convolution layers is employed to extract speech features while benefiting from the global contextual information of input speech. The CNN incorporates the skip connection and residual learning techniques to facilitate the training and accelerate the convergence. The GRU network is exploited to map the CNN-extracted features to the MCRM, which is used to enhance both magnitude and phase of the input speech. We compare the enhancement performance of the proposed method using features extracted by CNN with that of the GRU network using some conventional acoustic features, showing the advantage of the proposed CNN-GRU model. We also demonstrate that the GRU outperforms other recurrent neural network variations within the proposed model for mask estimation in terms of separated speech quality, memory footprint, and the number of model parameters in the presence of highly non-stationary noises.