Paper ID | C-3-3.1 |
Paper Title |
Speaker Verification System Based on Deformable CNN and Time-Frequency Attention |
Authors |
Yiming Zhang, Beijing University of Posts and Telecommunications, China; Hong Yu, Ludong University, China; Zhanyu Ma, Beijing University of Posts and Telecommunications, China |
Session |
C-3-3: Machine Learning for Small-sample Data Analysis |
Time | Thursday, 10 December, 17:30 - 19:30 |
Presentation Time: | Thursday, 10 December, 17:30 - 17:45 Check your Time Zone |
|
All times are in New Zealand Time (UTC +13) |
Topic |
Machine Learning and Data Analytics (MLDA): Special Session: Machine Learning for Small-sample Data Analysis |
Abstract |
Speaker verification (SV), especially short utterances SV, needs to be robust under complex noisy and far-field conditions. Majority of recent works apply attention mechanism on aggregation of frame-level speaker embeddings which are extracted by deep neural network. In this paper, a novel speaker verification system based on the deformable convolution module and the time-frequency attention module has been proposed. In the deformable convolution module, the convolutional sampling locations are adaptively adjusted by additional offsets which are learnt from the spectrogram. Meanwhile, in order to extract the more effective speaker discrimination information for short utterances, the time-frequency attention module is used to help the system focus on the important regions of the short utterances along the time and the frequency domain. Experiments on the HI-MIA database show that the proposed modules can improve the equal error rate (EER) of speaker verification system by relatively 24% compared to the baseline model, at a result of 8.51%. |