Paper ID | E-2-3.3 |
Paper Title |
ATTENTIVE FUSION ENHANCED AUDIO-VISUAL ENCODING FOR TRANSFORMER BASED ROBUST SPEECH RECOGNITION |
Authors |
Liangfa Wei, Jie Zhang, Junfeng Hou, Lirong Dai, University of Science and Technology of China, China |
Session |
E-2-3: Speech Recognition |
Time | Wednesday, 09 December, 17:15 - 19:15 |
Presentation Time: | Wednesday, 09 December, 17:45 - 18:00 Check your Time Zone |
|
All times are in New Zealand Time (UTC +13) |
Topic |
Speech, Language, and Audio (SLA): |
Abstract |
Audio-visual information fusion enables a performance improvement in speech recognition performed in complex acoustic scenarios, e.g., noisy environments. It is required to explore an effective audio-visual fusion strategy for audio-visual alignment and modality reliability. Different from the previous end-to-end approaches where the audio-visual fusion is performed after encoding each modality, in this paper we propose to integrate an attentive fusion block into the encoding process. It is shown that the proposed audio-visual fusion method in the encoder module can enrich audio-visual representations, as the relevance between the two modalities is leveraged. In line with the transformer-based architecture, we implement the embedded fusion block using a multi-head attention based audio-visual fusion with one-way or two-way interactions. The proposed method can sufficiently combine the two streams and weaken the over-reliance on the audio modality. Experiments on the LRS3-TED dataset demonstrate that the proposed method can increase the recognition rate by 0.55%, 4.51% and 4.61% on average under the clean, seen and unseen noise conditions, respectively, compared to the state-of-the-art approach. |