Paper ID | E-3-2.4 |
Paper Title |
Self-Attention for Multi-Channel Speech Separation in Noisy and Reverberant Environments |
Authors |
Conggui Liu, Yoshinao Sato, Fairy Devices, Japan |
Session |
E-3-2: Speech Separation 2, Sound source separation |
Time | Thursday, 10 December, 15:30 - 17:15 |
Presentation Time: | Thursday, 10 December, 16:15 - 16:30 Check your Time Zone |
|
All times are in New Zealand Time (UTC +13) |
Topic |
Speech, Language, and Audio (SLA): |
Abstract |
Despite recent advances in speech separation technology, there is much to be explored in this field, especially in the presence of noise and reverberation. One of the significant difficulties is that locations where relevant context information is incorporated vary in the time, frequency, and channel directions. To overcome this problem, we investigated the use of self-attention for multi-channel speech separation with time-frequency masking. Our base model is a temporal convolutional network that is the same as Conv-TasNet, except it works in the frequency domain with the short-time Fourier transformation and its inverse. We combined this basis with a self-attention network. We explored nine different types of self-attention network for this purpose. To investigate the effects of the self-attention networks, we evaluated the performance of the proposed model, which we refer to as a confluent self-attention convolutional temporal audio separator network (CACTasNet), on a noisy and reverberant version of the wsj0-2mix dataset. We found that several different self-attention networks substantially improved the performance measured by scale-invariant signal-to-noise ratio and signal-to-distortion ratio. The results indicate that a self-attention mechanism can efficiently locate context information relevant to speech separation. |