ST-3: Show and Tell Demo 3
Wed, 6 May, 09:00 - 11:00 (UTC +2)
Location: Exhibition Hall

ST-3.1: Seamlessly Upgrading On-Device Speech Recognition System with More Recent Foundation Models

Sheng Li (Institute of Science Tokyo)
Recent advances in automatic speech recognition (ASR) have been driven by foundation models trained on massive datasets with large-scale parameters. However, deploying these models on edge AI devices, especially robotic platforms, remains a significant challenge due to limited computational resources. Existing solutions often rely on cloud-based APIs or traditional DNN-HMM frameworks, which may raise privacy concerns or fall short of state-of-the-art performance. This paper presents a novel solution that enables ASR decoders, originally designed for GMM-HMM, DNN-HMM, or End-to-End CTC-Attention architectures, to support modern foundation models such as wav2vec2, HuBERT, Whisper, and recent speech LLMs. Our approach enables seamless integration of cutting-edge, highly accurate speech recognition capabilities into edge AI systems, including ROS-based robotic and smart glass platforms, without compromising user privacy or sacrificing performance. We will provide two interactive demonstrations on robotic platforms and smart glass platforms. This work is a joint project between the School of Engineering, the Institute of Science, Tokyo (previously TokyoTech), and the Department of Informatics, Kyoto University, aiming to integrate recent EdgeAI and spoken-language processing technologies.

ST-3.2: NVIDIA NeMo Voice Agent: An Open-Source, Multi-Model Framework for Building Your Own Real-Time Conversational AI

Taejin Park (NVIDIA), He Huang (NVIDIA), Kunal Dhawan (NVIDIA), Jagadeesh Balam (NVIDIA) and Boris Ginsburg (NVIDIA)
We propose a demonstration of the NVIDIA NeMo Voice Agent framework, a comprehensive open-source toolkit designed for building low-latency, real-time voice-to-voice agents. While traditional voice systems often rely on fragmented pipelines, this framework provides a unified, modular architecture that orchestrates the entire conversational loop—from STT(ASR) and speaker diarization to LLM and TTS—within a single high-performance ecosystem. A core innovation is the framework's commitment to a plug-and-play open ecosystem. Beyond support for NVIDIA’s models, it is fully compatible with widely adopted open-source models, including LLaMA and Qwen for text-based instruction following, and Kokoro and ChatterBox for TTS. For ASR, we provide a rich selection of NeMo endpointing and speaker diarization tools. This flexibility ensures developers can swap components to meet specific performance or hardware needs. In addition, we support a simulated human evaluator to benchmark other voice agents. Users can fully customize this evaluator by plugging in specific ASR, LLM, and TTS models, allowing for automated, end-to-end testing that is tailored to unique domain requirements or real-life scenarios, not to mention comparisons with other industry-leading voice agents. Crucially, this initiative democratizes Voice AI, ensuring advanced conversational technology is no longer the exclusive domain of large corporations. By providing this open framework, we empower small startups, students, and researchers to experience, investigate, and innovate on equal footing with industry giants, lowering the barrier to entry for high-quality voice agent development. Interactive Component: Attendees will interact with a live, ultra-low latency agent and view a real-time "Under the Hood" dashboard. Participants in our Show and Tell session can expect the following: - Live Model Interoperability: Experience the "plug-and-play" nature of the framework by hot-swapping models on the fly to hear immediate changes in latency and reasoning. - Task-oriented agent-to-agent Evaluation: Evaluations that focus on achieving real-life tasks, highlighting the practical deployment of voice agents. - Architectural Transparency: Students and researchers can inspect how the system handles complex audio signals and manages state across the unified pipeline. - Hardware Agnostic Insights: View performance data across a range of profiles, from local consumer-grade GPUs to data-center infrastructure, offering a wide variety for academic experimentation.

ST-3.3: Lightweight End-to-end Spoken Language Understanding System for Speech-controlled Video games

Alex Peiró-Lilja, Barcelona Supercomputing Center and Universitat de Barcelona Rodolfo Zevallos, Barcelona Supercomputing Center Iván Cobos, Universitat Politècnica de Catalunya Xin Lu, Universitat Politècnica de Catalunya Javier Hernando, Universitat Politècnica de Catalunya and Barcelona Supercomputing Center
We are developing a cross-platform, voice-controlled video game. In this video game, the player is a construction site manager who must instruct worker robots to install the ornamental elements of a chapel’s façade in the correct positions. Players are encouraged to speak commands naturally and can even adapt their speaking style. The game engine uses the SLU to obtain labels, enabling the robots to obey commands accordingly and respond with synthetic voices trained in Catalan. To map natural speech to specific labels, we previously used a cascaded Spoken Language Understanding (SLU) system based on Whisper-Large and a BERT model, both adapted to Catalan. This system was very expensive, and the video game had to remain constantly connected to a server providing inference. To address this issue, we fine-tuned Whisper-Tiny as an end-to-end SLU system, achieving a solution that is more than 20 times lighter while maintaining similar performance. This allows us to integrate the SLU locally, enabling devices to run inference on their own. The game mechanics are original, and therefore no existing data was available to train the model. To solve this, based on samples created by a group of humans, combinations of natural sentences useful for the game mechanics were designed. A text-to-speech system trained in Catalan was then used to synthesize these sentences using different voices. In total, more than 475k labeled samples were generated. In the demo, the video game will be presented ready to play fully offline, using only a laptop and a headset with a microphone to interact with the robots. If the player is not a Catalan speaker, we can interact by translating the player’s intended command ourselves. Moreover, we will share the knowledge required to reproduce the system for other video games or interactive applications that use speech in any language.

ST-3.4: Toward Realistic Multimodal Speech Processing Benchmarks Using a Multi-Talker Audio-Visual Conversational Corpus

Bryony Buck, Edinburgh Napier University & University of Dundee Lorena Aldana, University of Edinburgh Ondrej Klejch, University of Edinburgh Peter Bell, University of Edinburgh, Michael Akeroyd, University of Nottingham
This demonstration presents a novel audio-visual testbed corpus designed for realistic evaluation of multimodal speech enhancement, speech separation, and speech intelligibility systems. The corpus consists of free-flowing three-person conversations recorded under controlled quiet and noisy conditions, including both normal-hearing participants and experienced hearing aid users. Sessions were captured using synchronised lapel microphones and multi-angle video, enabling multimodal signal processing, feature fusion, and joint audio-visual analysis in complex multi-talker environments. Attendees will engage in immersive speech-in-noise intelligibility assessment scenarios derived from the corpus, experiencing conversational excerpts with varying acoustic interference and audiovisual cues. The demonstration will showcase the application of an established keyword-based data mining evaluation framework (Valentini-Botinhao et al., 2023), previously applied to scripted speech with synthetic noise (Blanco et al., 2023), extending it to spontaneous conversational speech recorded in real noisy environments. This enables scalable intelligibility evaluation without reliance on scripted material while directly testing model robustness and generalisation to ecologically valid conditions. Comparative examples using established speech corpora will illustrate improvements in lexical diversity, reduced repetition, and increased conversational realism afforded by the presented dataset. The presented corpus is the first audio-visual dataset of free-flowing small-group conversations recorded directly in realistic noisy environments with mixed hearing abilities among interlocutors. Unlike conventional scripted or synthetically corrupted datasets commonly used for benchmarking, it captures natural turn-taking, overlapping speech, lexical variability, and visual articulatory cues critical for multimodal signal processing under real-world communication conditions. The demonstration further introduces a novel layered real-world soundscape, incorporating competing talkers and multi-level environmental interference known to challenge hearing-impaired listeners. As such, the corpus provides ecologically valid, highly realistic validation conditions for multimodal speech enhancement and separation algorithms. This demo addresses a critical gap in current evaluation practices (see Buck et al., 2024 for review) by providing realistic, out-of-domain conversational data for benchmarking and generalisation testing of multimodal systems. Attendees can actively engage in immersive listening tasks, exploring the influence of audiovisual cues on intelligibility first-hand. They are invited to provide feedback on dataset usability and evaluation design, contributing to future corpus development and supporting community-driven, realistic benchmarking standards for multimodal communication technology evaluation.

ST-3.5: Modular, Safe Granite Speech Conversation with Multiple Speakers

IBM Research: Nathaniel Mills, George Saon, Zvi Kons, Hagai Aronowitz, Avihu Dekel, Slava Shechtman, Raul Fernandez, David Haws, Samuel Thomas, Alexander Brooks, Sashi Novitasari, Tohru Nagano, Takashi Fukuda, Ron Hoory, Brian Kingsbury, Luis Lastras IBM Software: Richie Verma, Leonid Rachevsky
This demonstration presents the Granite Speech framework as a novel, modular platform for conversational voice interaction, highlighting its ability to support fluid, natural, and contextually grounded exchanges between humans and AI systems. The framework tightly orchestrates high quality, low latency Granite Speech-based transcription, language reasoning with integrated safety guardrails, and expressive speech synthesis within a coordinated runtime. This design enables responsive, low latency interactions while maintaining the flexibility and interpretability characteristic of modular architectures. The demonstration aims to illustrate how such an approach can deliver an end-to-end conversational experience comparable to monolithic speech-to-speech models, yet retain the adaptability needed for the continuous integration of innovations across individual components, such as improved ASR, enhanced LLM reasoning, updated guardrails, or more advanced speech generation. One such innovation highlighted in the demonstration is the system’s ability to support structured multiparty dialogue through speaker-attributed automatic speech recognition (SA-ASR), a Granite Speech capability that appends explicit speaker labels to the ASR transcript. This mechanism enables the framework to manage conversations involving multiple human speakers while preserving clear attribution, continuity, and contextual grounding across turns. In a typical demonstration scenario, the interaction begins with an initial user exchange in which each participant introduces themselves. As the dialogue progresses, the system continues to detect the inputs of individual speakers and generate responses tailored to the appropriate participant, for example by referencing the corresponding speaker’s name. Prompt-based contextual biasing, a new Granite Speech capability that injects bias keywords into the ASR prompt, can improve the recognition of foreign or otherwise rare names. Impact to signal processing communities: 1. Contributing mindshare on the design of modern modular spoken conversation systems and their components. 2. Demonstrating the use of open weights models such as Granite Speech, released on Hugging Face, and providing guidance on how to employ these models and their newly released features. This will be an interactive demonstration in which participants converse with an AI system. Most of the interaction will be conducted by IBM demonstrators, with attendee participation enabled when technically feasible.

ST-3.6: Reliable Real-Time Meeting Transcription through Multimodal Speaker Detection and Emotion Recognition

Ran Han (Electronics and Telecommunications Research Institute), Jeom-ja Kang (Electronics and Telecommunications Research Institute), Kiyoung Park (Electronics and Telecommunications Research Institute), Woo Yong Choi (Electronics and Telecommunications Research Institute), Changhan Oh (University of Science and Technology, Electronics and Telecommunications Research Institute), Yeeun Jo (University of Science and Technology, Electronics and Telecommunications Research Institute), Hwa Jeon Song (Electronics and Telecommunications Research Institute)
We present a reliable real-time meeting transcription system that integrates multimodal speaker detection and emotion recognition using video and circular array microphone signals. The proposed Show & Tell demo targets realistic multi-party meeting environments, where background noise and overlapping speech often degrade conventional speech-only transcription systems. By combining acoustic processing from a circular array microphone with visual information from a 360-degree camera, the system enables robust speaker diarization and real-time speaker-aware transcription. The system processes synchronized audio-visual streams captured during meetings. On the acoustic side, beamforming and noise reduction enhance target speech signals recorded by a six-channel circular array microphone. In parallel, a video-based Active Speaker Detection (ASD) module estimates speaker locations and speaking activity using visual cues. The integration of acoustic and visual modalities allows recognized speech segments to be accurately associated with individual speakers, even under overlapping speech and spatial ambiguity. Beyond speaker-aware transcription, the system incorporates a multimodal emotion recognition module that analyzes visual cues, acoustic characteristics, and linguistic context derived from recognized speech. Facial expressions, acoustic features, and semantic information are jointly used to estimate the emotional states underlying each utterance. This allows users to understand not only what was said and by whom, but also how participants expressed themselves emotionally during the meeting. To support an interactive Show & Tell experience, the demo provides a live visualization interface where attendees can observe, in real time, who is speaking and how emotional states are reflected alongside the transcribed speech. Participants can engage in spontaneous discussions and observe how speaker overlaps, turn-taking dynamics, and emotional shifts are captured and visualized by the system. After the meeting concludes, users can generate a summary of the entire conversation upon request, including speaker-wise summaries for structured review of individual contributions. By integrating array microphone–based acoustic processing, video-based speaker detection, automatic speech recognition, multimodal emotion recognition, and post-meeting summarization into a unified pipeline, the proposed demo demonstrates the feasibility of multimodal signal processing for real-time meeting transcription and offers an intuitive Show & Tell experience for robust multimodal meeting understanding in multi-speaker environments.