DEMO-1A: Show and Tell Demos I-A
Tue, 16 Apr, 13:10 - 15:10 (UTC +9)
Location: Hall D2: Podium Pitch Room A

DEMO-1A.1: Any-to-Many Streaming Voice Conversion with Interactive Speaker Interpolation

Hyeongju Kim, Hyeong-Seok Choi
We showcase a cutting-edge voice conversion demonstration operating seamlessly in real-time. Main features of our demonstration are as follows: - Any-to-many voice conversion: Anyone's voice is transformed into the tone of a chosen target speaker. - Speaker interpolation: A novel voice identity can be generated by interpolating the source speaker and the target speaker in real-time. The interpolation ratio can be tuned within the range of 0 to 1. - Output pitch control: The pitch-level of the output voice is adjustable. - Low latency: Real-time conversion is performed under the algorithmic latency of 47ms. - High quality: High-quality conversion is achieved by employing various discriminators. - Noise robustness: The model operates effectively even in noisy environments. To implement the aforementioned features, we considered the following techniques: - Large amount of data: Over 10,000 hours of speech and singing data were gathered from open-sourced datasets and our internal database - Synthetic target data: A pre-trained zero-shot voice conversion model was employed to produce speaker-interpolated voices. - Pitch augmentation of target data: The pitch statistics of target data was perturbed to enhance the pitch controllability of the output voice. - Diverse discriminators: Multiple discriminators were used to improve the fidelity of the generated voice output. - Degenerative augmentation of input data: Diverse augmentation techniques were employed to degenerate input waveforms. The proposed system demonstrates consistent performance in real-world environments, even with common input devices, achieved through training the model with corrupted inputs. We will offer a real-time interactive program that demonstrates attractive functionalities for visitors.

DEMO-1A.2: Low-bitrate redundancy coding of speech for packet loss concealment in teleconferencing

Marcin Ciołek, Michał Sulewski, Mihailo Kolundzija, Rafał Pilarczyk, Raul Casas, Samer Hijazi
The proposed system significantly improves quality and intelligibility under packet loss in video-conferencing applications. We introduced a novel neural codec for low-bitrate speech coding at 6 kbit/s, with long 1 kbit/s redundancy, that also enhances speech by suppressing noise and reverberation. Transmitting large amounts of redundant information allows for speech reconstruction on the receiver side during severe packet loss – see ICASSP paper ID 7175: “Ultra low bitrate loss resilient neural speech enhancing codec”. The novelty of the proposed demo is combining the neural codec with the Viterbi algorithm and entropy coding to compress redundant information by 45% down to ~0.55 kbit/s with minimum loss in audio quality. The codec comprises three neural components: encoder, vector quantizer, and decoder. The vector quantizer outputs a sequence of symbols. High compression is achieved by applying entropy coding to the sequence of symbols modified by the Viterbi algorithm. The efficiency of the proposed scheme comes from incorporating transition probabilities between symbols. Objective and subjective metrics confirmed a minor difference in audio quality between the 1 kbit/s and 0.55 kbit/s schemes. The demo is related to the ICASSP Audio Deep Packet Loss Concealment Grand Challenge and fits the theme of Signal Processing - the Foundation of True Intelligence. The interaction with the ICASSP audience will be based on playing a live scene with a packet loss simulated on captured audio and demonstrating the capability of the proposed codec and compression scheme to mend speech during long packet losses. The demo software will process the noisy audio recorded on the spot. Next, everyone can listen to their enhanced voice transmitted through a lossy network channel and compare it to the input. Our demo will be supported with a poster and a videocast. We want to emphasize that the demo prepared by our R&D team isn’t a commercial product.

DEMO-1A.3: Demonstration of Amphion - An Open-Source Audio, Music and Speech Generation Toolkit

Xueyao Zhang, Liumeng Xue, Yuancheng Wang, Yicheng Gu, Chaoren Wang, Jun Han, Haizhou Li, Zhizheng Wu
Amphion is an open-source toolkit designed for audio, music, and speech generation. It aims to facilitate reproducible research and aid education in these fields. Amphion's north-star objective is to provide a comprehensive platform for studying how various inputs transform into audio. Now, it supports several generation tasks, including TTS, TTA, and SVC, and comes equipped with multiple vocoders and evaluation metrics to ensure high-quality generation. This demonstration provides a high-level overview of Amphion's design. It will also walk through how to perform reproducible research and how to use Amphion for scientific research and product model training step by step. This demonstration will also include four interactive Gradio demos built on top of Amphion, singing voice conversion, text to speech, zero-shot text-to-speech synthesis and text to audio. The preliminary demos are already in the HuggingFace space to try out, and we are working on better pretrained models for a better experience in ICASSP 2024.

DEMO-1A.4: Fast Streaming Zero-shot Keyword Spotting System: Inference using a Web Browser

Jong Gak Seo,Yong-Hyeok Lee,Namhyun Cho,Sun Min Kim
Here, we propose a zero-shot keyword spotting demo running on a web browser, operating in both Korean and English. Anyone can enter their desired keyword on-site and immediately see whether it can be detected. A novel zero-shot user-defined keyword spotting system is presented in this study, [1] aiming to address several challenges when deployed on personalized platforms like digital humans. These challenges include a lack of training data for user-defined keywords, the need to minimize computation and latency, and the inability to immediately train and test the model. To tackle these issues, the system employs a zero-host approach based on a speech embedding model, a streaming system eliminating redundant computations, and real-time inference using WebAssembly in a web browser. The streaming speech embedding model [2] significantly reduces redundant calculations in overlapping frames by about 90%, resulting in a notable 33.4% increase in CPU processing speed. This improvement achieved an additional speedup of approximately 340 times the non-streaming model. Transitioning from a few-shot to a zero-shot model aimed to enhance user experience by removing the need for user speech recordings and few-shot learning during keyword changes, enabling instant response. This model uses the audio-phoneme relationship, merging utterance and phoneme-level data, incorporating advanced architectures, and showcasing strong performance across diverse pronunciation scenarios. In experiments, it outperforms baseline models and competes effectively with full-shot keyword spotting, notably improving EER and AUC by 67% and 80%, respectively, across datasets with varied word types and pronunciations.

DEMO-1A.5: Real-Time Polyphonic Sound Event Localization and Detection on Raspberry Pi 4

Ee-Leng Tan, Jun-Wei Yeow, Santi Peksi, Jisheng Bai, and Woon-Seng Gan
Polyphonic sound event localization and detection (SELD) has many practical applications in acoustic sensing and monitoring. However, the realization of real-time SELD systems has been limited by the demanding computational requirement of most recent SELD systems. In this show-and-tell proposal, a real-time demo of polyphonic SELD running on a Raspberry Pi 4 model B is detailed. One key aspect of our implementation is selecting a computationally efficient and effective feature for polyphonic SELD. The selected feature is the Spatial Cue-Augmented Log-SpectrogrAm (SALSA)-Lite, which is a lightweight variation of a previously proposed SALSA feature for polyphonic SELD. SALSA-Lite adopts normalized inter-channel phase differences as spatial features and achieves a 30-fold speedup in feature computation as compared to SALSA. SALSA-Lite is combined with a SELD network architecture based on ResNet22, a two-layer BiGRU, and fully connected layers. To achieve real-time inference on an embedded platform, each ResNet block is replaced with a computationally more efficient ResNet bottleneck block, and the convolution layers in each ResNet block are replaced with depthwise and pointwise convolutions. This optimized model is then deployed on the Raspberry platform using the open neural network exchange (ONNX) runtime engine. Quantization is performed on the ONNX model to reduce the memory requirements of the model, which also speeds up the inference. This series of optimizations on the model combined with SALSA-Lite produces over 60-fold speedup compared to the original SALSA feature when running on the Raspberry platform at a clock speed of 1.5GHz and achieving real-time inference with minimum degradation of SELD performance. [1] TNT Nguyen et al., “SALSA-Lite: A Fast and Effective Feature for Polyphonic Sound Event Localization and Detection with Microphone Arrays,” in the International Conference on Acoustic, Speech and Signal Processing, May 2022, Singapore.

DEMO-1A.6: Real-Time Low-Latency Music Source Separation

Satvik Venkatesh, Arthur Benilov, Philp Coleman, Frederic Roskam
In recent years, Deep Learning approaches have significantly improved the quality of Music Source Separation (MSS). However, there has been little attention given to how these neural networks can be adapted for real-time low-latency applications, which could be helpful for hearing aids, remixing audio streams and live shows. In a paper accepted to ICASSP 2024, we explored deep learning for real-time low-latency demixing. The paper ID is 3927 and the title is ‘Real-Time Low-Latency Music Source Separation using Hybrid Spectrogram-TasNet’. In this demonstration, we present our real-time MSS models as a VST plugin, which would be hosted in a Digital Audio Workstation (DAW). We demonstrate the feasibility of real-time MSS and address various challenges involved in translating deep learning research into production. All demixing models are causal and have a latency of 23 ms (1024 samples) at a sampling rate of 44.1 kHz. They are exported to ONNX format and loaded through a C++ runtime. Participants can select demixing models from a drop-down menu and compare their separation qualities by considering factors such as audio quality, artifacts, and interference between sources. We present causal low-latency adaptations of existing architectures such as CrossNet-Open-Unmix (X-UMX) and TasNet, and also our proposed model called Hybrid Spectrogram-TasNet (HS-TasNet). Furthermore, individuals could connect their MP3 players, share their own WAV files, or use locally available songs to stream into the real-time demixer.

DEMO-1A.7: Target Speech Spotting and Extraction Based on ConceptBeam

Yasunori Ohishi, Marc Delcroix, Tsubasa Ochiai, Shoko Araki, Daiki Takeuchi, Daisuke Niizumi, Akisato Kimura, Noboru Harada, Kunio Kashino
We demonstrate a target speech spotting and extraction system based on ConceptBeam. ConceptBeam is a technology that extracts a speech signal that matches the target concept, or a topic of interest, specified by the user by means of spoken words, images or their combination in a multi-talker mixture. It was first presented at ACM Multimedia 2022 [1] where it was awarded Best Paper Runner-Up Award [2]. Traditionally, sound source separation methods have been mainly based on the physical properties of a signal, such as the direction of arrival, fundamental frequency, and the independence of the signal. In contrast, ConceptBeam uses the semantic clues. It measures the semantic distance between each section of the input signal components and potentially multimodal semantic target information (termed here concept) given by the user to locate and extract the relevant signal. In this live demonstration, the input signal is chosen from pre-recorded mixed and concatenated speech signals containing multiple speakers and a variety of topics. Then we take (or select) a photo or record (or select) a spoken audio to specify the target concept. When we press the start button, the system searches for the relevant sections and extracts the target speech from the input, if any. The system runs on a laptop PC on the fly and takes a few seconds to show the result. ConceptBeam solves an interesting signal processing problem, but the proposed interactive demo has not yet been presented at ICASSP or other conferences. We believe that Show and Tell session will be a good opportunity to introduce its behavior to the signal processing community in an interactive manner and to discuss technical details as well as potential innovative applications. [1] Y. Ohishi, M. Delcroix, T. Ochiai, S. Araki, D. Takeuchi, D. Niizumi, A. Kimura, N. Harada, K. Kashino, ConceptBeam: Concept Driven Target Speech Extraction, Proc. ACM Multimedia, 2022. [2] https://2022.acmmm.org/best-paper-award/

DEMO-1A.8: Preprocessing solution for real-world robust speech recognition combining multichannel acoustic echo cancellation and beamforming

Byung Joon Cho, Chang-Min Kim, and Hyung-Min Park
While the performance of speech recognition (SR) models has increased dramatically with the advancement of artificial intelligence, it is still difficult to realize the high performance of distant SR in real-world applications such as smart homes and smart cars. This is because in real-world noisy environments, speech far from the microphone is vulnerable to distortion due to noise and reverberation. Therefore, robust SR (RSR) remains an important challenge in the field of SR. In this demo, we present an RSR preprocessing solution in real-world complex noisy environments. Specifically, the system employs an integrated algorithm that seamlessly combines multichannel acoustic echo cancellation (MAEC) to remove monaural or multichannel echo signals, and beamforming to enhance the target speech selectively. The MAEC technology eliminates the need for commonly used double-talk detection, and the beamforming technology eliminates the need for tuning whenever the number of microphones and their positions change as the algorithm automatically estimates the optimized parameters from microphone input signals. In addition, speech distortion is rare enough to improve recognition performance after processing, even for signals acquired in a quiet environment, so it can be used in a variety of SR models without tuning. Above all, since it is an online method with very fast convergence and can be operated in real time even in a resource-constrained system, such as Raspberry Pi 4, it stably shows high recognition performance in various environments, and its performance has been demonstrated through comparative experiments. In addition, a USB-interface hardware has been developed to effectively process multichannel reference inputs and multiple microphone signals. The RSR preprocessing solution has been developed in a user-friendly manner so that users can just apply it after development without worrying about the preprocessing when implementing a real-world SR system.

DEMO-1A.9: Audio Fingerprinting: Implementing a Music Recognition System

Christos Nikou, Antonia Petrogianni, Theodoros Psallidas, Ellie Vakalaki, Theodoros GIannakopoulos
Song identification is one of the oldest and perhaps one of the most popular Music Information Retrieval tasks. It has received a lot of attention in both academic research and the industry. One of the most well-known commercial applications of song identification is Shazam. Shazam’s algorithm is based on a technique called audio fingerprinting. It extracts a compact representation by hashing the spectral peaks on the spectrogram of the audio fragments. This way, Shazam achieves a representation that is robust to noise distortions that may occur in realistic scenarios where music is captured from portable devices. A non-exhaustive technique based on inverted lists allows for fast and efficient retrievals. However, in 2017, Google launched a totally different music recognition system by utilizing the pioneering ideas of deep learning. Instead of inventing sophisticated audio representations, they trained a deep neural network to automatically extract robust and meaningful audio representations. In this show and tell demo, we delve into the details of how one can implement a music recognition system based on these ideas. We present all key functionalities of such system, like the feature extraction from the raw audio signals, the architecture of the model, and the training pipeline with contrastive learning as well as the indexing of the database by leveraging the techniques of approximate nearest neighbor search and product quantization to reduce the memory footprint and allow for fast and efficient retrievals. At the same time, we introduce a novel data augmentation pipeline that further improves the performance of the overall system prior to already existing systems employing this approach. This way, our system is ideal for real time applications. We showcase the effectiveness of our system in predicting the correct query audio fragment through a database consisting of more than 20,000 songs with a live demonstration.

DEMO-1A.10: Audio Processing for Real-Time Speech Communication​ in the Wild

Sebastian Braun, Hannes Gamper, David Johnston, Ivan Tashev
The user experience and meeting productivity of real-time speech communication largely depends on the quality of the transmitted and reproduced audio. We showcase two research prototypes for improving users' audio experience. Demo 1: DNN-based speech enhancement and acoustic echo cancellation: Adoption of deep learning techniques for audio signal processing has led to tremendous performance improvements. We show a real-time running prototype of a modular neural network system that separately cancels acoustic echo and removes background noise and reverberation. By designing our pipeline resembling traditional modular processing blocks, crafting strong large training data simulation and optimizing the network architecture, the resulting neural network system is able to run with low compute footprint (~1% standard CPU usage) in real-time (10 ms algorithmic latency) while still delivering outstanding audio quality. In the demo, people will be able to listen to the live processed sound in the noisy demo environment and hear the effect of switching on noise suppression and echo cancellation. Demo 2: Spatial audio for multi-party video calls: Meeting user experience can be enhanced by playing back the audio of remote meeting participants from their corresponding spatialized locations, aligning the auditory and visual perception. This can reduce listening effort and alleviate so-called “meeting fatigue”. The demo showcases delivering spatial audio within the constraints of a video conferencing system, that is, with minimal overhead in terms of compute and battery power, supporting the large variety of audio scenarios in real meetings in terms of speech quality and number of voices, and optimized for both headphone and open-speaker playback. In the demo, listeners can turn spatial audio rendering on and off to experience the effect of spatially separating voices to align audio and video playback in a video call.