ST-2: Show and Tell Demo 2
Tue, 5 May, 16:30 - 18:30 (UTC +2)
Location: Exhibition Hall
ST-2.1: Flow Matching for Real-Time Joint Speech Enhancement and Bandwidth Extension
Diffusion-based speech enhancement is a popular and active research topic. In our demo, we present a real-time generative system for joint speech enhancement and bandwidth extension with flow matching, a method closely related to diffusion. The system receives a noisy and reverberant single-channel input on a consumer GPU laptop, which can optionally be low-pass filtered at a configurable frequency cutoff before being fed to the system. With an efficiently cached frame-wise inference scheme and an optimized causal DNN, our system achieves a total latency of only 48ms (32ms algorithmic latency + 16ms computational latency), bringing low-latency, high-quality generative speech restoration with generative flow matching models to consumer hardware for the first time. The underlying real-time flow matching backbone is described in our accepted 2026 ICASSP paper (ID: 16059).
We combine a predictive network with a generative flow network in a joint predictive-generative scheme, outputting a clean bandwidth-extended speech estimate with up to 24 kHz bandwidth (48kHz sampling rate). The graphical user interface allows three interactive changes: (1) the flow network can be toggled on/off to switch between predictive and predictive-generative speech restoration; (2) using a graphical slider, attendees can set the frequency cutoff of the low-pass filter (4-16 kHz) to simulate a lower sampling frequency; (3) either one or multiple generative sampling steps can be chosen to show how the generative model behaves in each scenario.
The demo lets attendees switch between unprocessed speech and three possible variants of enhanced speech on the fly, allowing them to explore the advantages and downsides of predictive and generative speech restoration in a real-time setting. We use one omnidirectional microphone placed in an open conference area, and run our models on a laptop with a NVIDIA RTX 5090 Laptop GPU. The laptop is connected to a soundcard and headphone amplifier. Up to five active noise-canceling headphones can be connected, so that multiple attendees can listen and interact simultaneously.
Our demonstration offers an interactive experience, illustrating how modern generative methods can be used for real-time single-channel speech enhancement and bandwidth extension in a real conference environment, and how they differ qualitatively from predictive methods.
ST-2.2: NPU-Accelerated Real-Time Voice Conversion for Customizable Digital Identities
Real-time voice conversion has seen widespread adoption by millions of users within gaming and digital identity ecosystems. Historically, however, these systems have been restricted to low-complexity models due to limited CPU overhead and the need to maintain stability alongside concurrent, high-demand applications.
To overcome these limits, we present a high-fidelity, low-latency voice conversion system optimized for Neural Processing Units (NPUs). This solution leverages Transformer-based models specifically architected for edge-device acceleration, moving beyond cloud or GPU reliance.
Our demo prototype showcases two recent areas of research of our team: moving towards larger models that leverage NPUs on-device, and ways to control the speaker identity and voice characteristics using higher level, intuitive controls - such as age, gender, depth, and breathiness - via sliders or text prompts.
Main Novelty and Innovations
While NPUs are becoming standard in modern chipsets, their application for real-time, stream-based signal processing is a new frontier. Our innovation lies in:
- Architecture Optimization: A transition from low complexity recurrent models suitable for CPU to highly parallelizable, high complexity models suitable for NPU. This approach maps workloads directly to NPU instruction sets to maximize throughput while offering superior synthesis quality.
- Accelerator-Agnostic Inference: Transitioning complex neural audio tasks from GPUs to dedicated NPU silicon, enabling professional-grade AI audio on consumer desktops, laptops and mobile platforms.
Impact on Signal Processing
This demonstrator attempts also to bridge the gap between describing voices via perceptual characteristics that can be extracted via DSP algorithms and common approaches in Deep Learning-based generative voice conversion models (e.g. a speaker embedding that is learned or estimated from audio).
By using signal processing-based annotators to map speakers into a 5-D space, we provide a framework for a parametrized exploration of voice timbres using perceptually meaningful controls of neural vocal transformation in timing-critical environments.
Interactivity for Attendees
- Generative Voice Design: Create bespoke identities using 5-D descriptors or text prompts (e.g., "a deep, gravelly, yet smooth voice").
- Real-Time Identity Swap: Experience live vocal transformation with an algorithmic latency of 45 ms for the low-complexity model, ensuring a seamless feedback loop without cognitive dissonance.
ST-2.3: Speaking rate control in the stream
We introduce an online speaking-rate control mechanism for streaming text-to-speech (TTS) that adjusts duration at the frame level while audio is being generated. A continuous control signal is provided as an additional model input and is consumed causally, enabling the system to smoothly speed up or slow down the emitted speech. The controller supports gradual transitions, so rate changes do not introduce audible discontinuities. Unlike prior duration-control approaches that work only offline or use post-processing, the proposed method changes the speaking rate online as frames are being produced, enabling true speaking-rate control for streaming TTS.
This work contributes to the signal processing community in two ways. First, it introduces causal, frame-level duration control for streaming TTS, enabling low-latency, real-time adaptation of speaking rate. The system can dynamically slow down or speed up based on user preference or text buffer size, mimicking how humans regulate speech flow under different conditions. This enables new research on adaptive, feedback-driven audio generation under strict latency constraints.
Second, the method improves rate-dependent speech realism. Our analysis shows that speaking rate affects not only timing but also content and articulation: slow speech includes fillers (e.g., “uhm,” “yeah”), while fast speech reduces fillers and increases articulation speed. These effects are often overlooked in modern TTS systems. By enabling online rate control, our approach helps close this gap and moves streaming synthesis closer to natural human speech.
The demo is implemented as a Gradio app. Users upload a short reference clip (3-5 s) of a target speaker and enter text. The system begins emitting an audio stream after a short initial delay (~150 ms). While audio is playing, users can repeatedly adjust the speaking rate (speed up or slow down) and immediately hear the effect on the continuing stream, evaluating naturalness and voice similarity across rates. Any language can be used for the reference voice.
This demo extends ICASSP 2026 paper 4854, “VoXtream: Full-Stream Text-to-Speech with Extremely Low Latency,” recently accepted for presentation.
ST-2.4: Real-Time Demo of Single-Channel Target Speaker Extraction Using State-Space Modeling
Target speaker extraction (TSE) aims to extract the voice of a pre-enrolled speaker from a single-channel audio mixture that may contain competing talkers and background noise. While recent TSE models demonstrate strong offline performance, practical deployment is often constrained by latency, compute, and stability under continuously changing acoustic conditions. This demo showcases an on-device, real-time single-channel TSE system that runs entirely on a laptop CPU and produces low-latency enhanced audio suitable for live listening. The core novelty lies in the adoption of state-space sequence modeling for streaming acoustic modeling. Specifically, we introduce a new state-space modeling (SSM) -based architecture into Conv-TasNet-based TSE, which has been shown to efficiently capture long-term temporal dependencies. By leveraging SSM, the proposed model requires fewer dilated convolutional layers to model temporal context, resulting in a reduction in overall model complexity. Consequently, the proposed method achieves a more favorable trade-off between computational efficiency and extraction performance.
Demo description: In the demo, we perform online TSE using real recordings captured during the demonstration session. First, the target speaker is enrolled by recording a short voice prompt of approximately 10 seconds. Then, the target speaker talks into the microphone while an interferer speaks nearby, and ambient noise from the demo environment is simultaneously captured. The system processes audio in real time and outputs the extracted target speech to headphones. Participants can directly compare the processed and unprocessed audio streams in real time. In addition, an optional visualization panel displays input and output waveforms as well as basic runtime statistics (e.g., real-time factor / latency) to facilitate understanding of the relationship between perceptual quality and system behavior.
Interactivity: Participants can actively vary conditions such as speaking style, distance to mic, overlap ratio, etc., and immediately hear how extraction quality changes. This hands-on experience fosters deeper discussion on the current state of TSE technology and highlights the gap between academic benchmark evaluations and real-world streaming constraints.
Impact: Overall, this demo provides a concrete reference for the signal processing community regarding the current practicality of TSE systems, and is expected to stimulate further discussion on streaming architectures and on-device efficiency.
ST-2.5: Semantic-Aware Speech Anonymization via Neural Codec Editing
This demo presents a novel Content Speaker Anonymization Pipeline designed to redact Personally Identifiable Information (PII) from speech while preserving prosodic continuity and naturalness. The system integrates an efficient Whisper-based Automatic Speech Recognition (ASR) module leveraging precise word-level forced alignment with a robust BERT-based Named Entity Recognition (NER) system to locate the timestamp of sensitive information in terms of semantic meaning. After detecting the identifiable information, unlike traditional obfuscation methods that rely on destructive signal masking (e.g., beeping) or artifact-prone copy-paste concatenation, our system utilizes a neural codec language model for speech editing. This architecture treats speech synthesis as a token prediction task, allowing the system to generate pseudonymized speech segments that blend seamlessly with the unedited surrounding context. The pipeline supports flexible replacement strategies, allowing users to switch between rule-based substitution and generative infilling, effectively editing the audio waveform through text manipulation.
For the signal processing community, this system directly addresses the critical privacy-utility trade-off in creating public datasets. By removing sensitive semantic content without degrading signal coherence, it enables the ethical sharing of speech data for downstream tasks such as ASR training and sentiment analysis. It demonstrates a shift from signal-level anonymization to semantic-level editing, setting up a new standard for intelligibility in privacy-preserving speech processing.
The demonstration offers a real-time, hands-on experience. Attendees will be invited to record live speech containing mock sensitive information (e.g., names, locations, phone number). They will visualize the pipeline in action via a dashboard that displays the ASR transcription and highlights detected PII entities. Users can then interactively select replacement methods (e.g., manually type in the replacement or let the system decide) and immediately listen to the anonymized output. This allows for a direct comparison between our neural editing approach and traditional baselines, showcasing the system’s ability to maintain smooth transitions and high speech quality. Currently we support 2 languages: Korean and English for the users to choose from.
ST-2.6: Electrolaryngeal Speech Enhancement Based on Any-to-Many Voice Conversion
A common vocalization alternative for Laryngectomees (individuals who have undergone laryngectomy) is the use of the electrolarynx (EL), a handheld device that generates mechanical vibrations to enable speech production. However, EL speech sounds unnatural due to its monotonous pitch and mechanical excitation, which reduces communicative efficiency. In this demo, we will show an online (on-site) electrolaryngeal-to-normal (EL2NL) voice conversion (VC) system which is based on any-to-many DNN-based VC. We fine-tuned an existing VC model using a small amount of collected EL speech to synthesize EL speech from large-scale NL speech datasets. Using the synthetic EL speech and their NL counterparts as pairs, we fine-tuned an VC for NL2NL to adapt to EL2NL VC. The resulting system can restore reasonably natural intonation and improve intelligibility in the EL speech. Participants may try the EL to experience our EL2NL VC system.