Paper ID | F-1-3.4 |
Paper Title |
FREQUENCY GATING: IMPROVED CONVOLUTIONAL NEURAL NETWORKS FOR SPEECH ENHANCEMENT IN THE TIME-FREQUENCY DOMAIN |
Authors |
Koen Oostermeijer, Qing Wang, Jun Du, University of Science and Technology of China, China |
Session |
F-1-3: Speech Enhancement 1 |
Time | Tuesday, 08 December, 17:15 - 19:15 |
Presentation Time: | Tuesday, 08 December, 18:00 - 18:15 Check your Time Zone |
|
All times are in New Zealand Time (UTC +13) |
Topic |
Speech, Language, and Audio (SLA): |
Abstract |
One of the strengths of traditional convolutional neural networks (CNNs) is their inherent translational invariance. However, for the task of speech enhancement in the time-frequency domain, this property cannot be fully exploited due to a lack of invariance in the frequency direction. In this paper we propose to remedy this inefficiency by introducing a method, which we call Frequency Gating, to compute multiplicative weights for the kernels of the CNN in order to make them frequency dependent. Several mechanisms are explored: temporal gating, in which weights are dependent on prior time frames, local gating, whose weights are generated based on a single time frame and the ones adjacent to it, and frequency-wise gating, where each kernel is assigned a weight independent of the input data. Experiments with an autoencoder neural network with skip connections show that both local and frequency-wise gating outperform the baseline and are therefore viable ways to improve CNN-based speech enhancement neural networks. In addition, a loss function based on the extended short-time objective intelligibility score (ESTOI) is introduced, which we show to outperform the standard mean squared error (MSE) loss function. |