Technical Program

Paper Detail

Paper IDE-3-2.4
Paper Title Self-Attention for Multi-Channel Speech Separation in Noisy and Reverberant Environments
Authors Conggui Liu, Yoshinao Sato, Fairy Devices, Japan
Session E-3-2: Speech Separation 2, Sound source separation
TimeThursday, 10 December, 15:30 - 17:15
Presentation Time:Thursday, 10 December, 16:15 - 16:30 Check your Time Zone
All times are in New Zealand Time (UTC +13)
Topic Speech, Language, and Audio (SLA):
Abstract Despite recent advances in speech separation technology, there is much to be explored in this field, especially in the presence of noise and reverberation. One of the significant difficulties is that locations where relevant context information is incorporated vary in the time, frequency, and channel directions. To overcome this problem, we investigated the use of self-attention for multi-channel speech separation with time-frequency masking. Our base model is a temporal convolutional network that is the same as Conv-TasNet, except it works in the frequency domain with the short-time Fourier transformation and its inverse. We combined this basis with a self-attention network. We explored nine different types of self-attention network for this purpose. To investigate the effects of the self-attention networks, we evaluated the performance of the proposed model, which we refer to as a confluent self-attention convolutional temporal audio separator network (CACTasNet), on a noisy and reverberant version of the wsj0-2mix dataset. We found that several different self-attention networks substantially improved the performance measured by scale-invariant signal-to-noise ratio and signal-to-distortion ratio. The results indicate that a self-attention mechanism can efficiently locate context information relevant to speech separation.