Technical Program

Paper Detail

Paper IDF-1-3.4
Paper Title FREQUENCY GATING: IMPROVED CONVOLUTIONAL NEURAL NETWORKS FOR SPEECH ENHANCEMENT IN THE TIME-FREQUENCY DOMAIN
Authors Koen Oostermeijer, Qing Wang, Jun Du, University of Science and Technology of China, China
Session F-1-3: Speech Enhancement 1
TimeTuesday, 08 December, 17:15 - 19:15
Presentation Time:Tuesday, 08 December, 18:00 - 18:15 Check your Time Zone
All times are in New Zealand Time (UTC +13)
Topic Speech, Language, and Audio (SLA):
Abstract One of the strengths of traditional convolutional neural networks (CNNs) is their inherent translational invariance. However, for the task of speech enhancement in the time-frequency domain, this property cannot be fully exploited due to a lack of invariance in the frequency direction. In this paper we propose to remedy this inefficiency by introducing a method, which we call Frequency Gating, to compute multiplicative weights for the kernels of the CNN in order to make them frequency dependent. Several mechanisms are explored: temporal gating, in which weights are dependent on prior time frames, local gating, whose weights are generated based on a single time frame and the ones adjacent to it, and frequency-wise gating, where each kernel is assigned a weight independent of the input data. Experiments with an autoencoder neural network with skip connections show that both local and frequency-wise gating outperform the baseline and are therefore viable ways to improve CNN-based speech enhancement neural networks. In addition, a loss function based on the extended short-time objective intelligibility score (ESTOI) is introduced, which we show to outperform the standard mean squared error (MSE) loss function.