Paper ID | F-1-1.3 |
Paper Title |
SPEAKER-INVARIANT PSYCHOLOGICAL STRESS DETECTION USING ATTENTION-BASED NETWORK |
Authors |
Hyeon-Kyeong Shin, Hyewon Han, Kyunggeun Byun, Hong-Goo Kang, Yonsei University, Korea (South) |
Session |
F-1-1: Emotion, Dialect, and Age Recognition |
Time | Tuesday, 08 December, 12:30 - 14:00 |
Presentation Time: | Tuesday, 08 December, 13:00 - 13:15 Check your Time Zone |
|
All times are in New Zealand Time (UTC +13) |
Topic |
Speech, Language, and Audio (SLA): |
Abstract |
When people get stressed in nervous or unfamiliar situations, their speaking styles or acoustic characteristics change. These changes are particularly emphasized in certain regions of speech, so a model that automatically computes temporal weights for components of the speech signals that reflect stress-related information can effectively capture the psychological state of the speaker. In this paper, we propose an algorithm for psychological stress detection from speech signals using a deep spectral-temporal encoder and multi-head attention with domain adversarial training. To detect long-term variations and spectral relations in the speech under different stress conditions, we build a network by concatenating a convolutional neural network (CNN) and a recurrent neural network (RNN). Then, multi-head attention is utilized to further emphasize stress-concentrated regions. For speaker-invariant stress detection, the network is trained with adversarial multi-task learning by adding a gradient reversal layer. We show the robustness of our proposed algorithm in stress classification tasks on the Multimodal Korean stress database acquired in [1] and the authorized stress database Speech Under Simulated and Actual Stress~(SUSAS) [2]. In addition, we demonstrate the effectiveness of multi-head attention and domain adversarial training with visualized analysis using the t-SNE method. |