Technical Program

Paper Detail

Paper IDE-3-1.5
Paper Title IMPACT OF MINIMUM HYPERSPHERICAL ENERGY REGULARIZATION ON TIME-FREQUENCY DOMAIN NETWORKS FOR SINGING VOICE SEPARATION
Authors Neil Shah, Dharmeshkumar Agrawal, TCS Research, Tata Consultancy Services Pvt. Ltd., Pune, India, India
Session E-3-1: Speech Separation 1
TimeThursday, 10 December, 12:30 - 14:00
Presentation Time:Thursday, 10 December, 13:30 - 13:45 Check your Time Zone
All times are in New Zealand Time (UTC +13)
Topic Speech, Language, and Audio (SLA):
Abstract The task of singing voice separation requires the model to maintain a trade-off between signal quality, interference introduced by music accompaniment and algorithmic artifacts. A time domain-based singing voice separation system offers a challenge in designing for low latency and in minimizing computational cost. To overcome this problem, we propose to use Gammatone auditory features for the Time-Frequency (T-F) mask-based singing voice separation task. Minimum Hyperspherical Energy (MHE) regularization in the time-domain network has recently produced the state-of-the-art result in singing voice separation (our baseline). In this work, we apply MHE to the T-F domain networks. The MHE regularized T-F domain network significantly improves the separation performance over the baseline. The MHE regularized Wasserstein Generative Adversarial Network (GAN) achieves 0.21 dB improvement in mean Signal-to-Distortion Ratio (SDR) over the baseline. Our best performing T-F domain un-regularized GAN provides an improvement of 0.75 dB and 0.63 dB in SDR over the baseline and the GAN-MHE, respectively. We experimentally show the failure of MHE regularized T-F domain networks with respect to their un-regularized versions and have shown the need of designing a suitable adversarial objective function. We report that modifying the GAN-MHE's objective function with reconstruction loss and adapting Wasserstein GAN, results in a 0.45 dB improvement in mean SDR over its un-regularized version.