Paper ID | E-1-2.6 |
Paper Title |
Deep Semantic Encoder-Decoder Network for Acoustic Scene Classification with Multiple Devices |
Authors |
Xinxin Ma, Jiangsu Normal University, China; Yunfei Shao, Tsinghua University, China; Yong Ma, Jiangsu Normal University, China; Wei-Qiang Zhang, Tsinghua University, China |
Session |
E-1-2: Music Information Processing 1, Audio Scene Classification |
Time | Tuesday, 08 December, 15:30 - 17:00 |
Presentation Time: | Tuesday, 08 December, 16:45 - 17:00 Check your Time Zone |
|
All times are in New Zealand Time (UTC +13) |
Topic |
Speech, Language, and Audio (SLA): |
Abstract |
In this paper, we proposed Mini-SegNet, a simplified decoder-encoder SegNet model to capture deep semantic information in sound events. The semantic information can effectively discriminate the acoustic segments in different scenes. We also applied spectrum correction to combat mismatched frequency response. In order to prevent overfitting, we adopted mixup augmentation, ImageDataGenerator and temporal crop augmentation for data augmentation. Our best single system achieved an average accuracy of 65.15% on different devices in the DCASE2020 Development dataset, more than 10% improvement over the baseline system. The results indicate that our approach can achieve great classification performance, without use of any supplementary data from outside the official challenge dataset. |