IEEE ICASSP 2026 || Barcelona, Spain || 4-8 May 2026

ST-4: Show and Tell Demo 4

Wed, 6 May, 14:00 - 16:00 (UTC +2)

Location: Exhibition Hall

ST-4.1: DeepAudioX: An Open-Source Python Library for Audio Learning and Rapid Prototyping with Pretrained Models

Christos Nikou, National Centre for Scientific Research Demokritos Stefanos Vlachos, National Centre for Scientific Research Demokritos Ellie Vakalaki, National Centre for Scientific Research Demokritos Theodoros Giannakopoulos, National Centre for Scientific Research Demokritos

We present DeepAudioX, an open-source PyTorch-based library that enables rapid development, training, evaluation, and deployment of audio classification systems using pretrained audio foundation models as feature extractors. Unlike existing toolkits that require extensive boilerplate code or impose rigid workflows, DeepAudioX combines plug-and-play pretrained backbones, modular pooling and classifier components, and unified high-level training and evaluation loops, offering an end-to-end solution from model development to deployment. Additionally, the library is designed to be easily extensible and customizable, enabling users to integrate their own backbones, datasets, and pooling methods while leveraging the remaining pipelines of the framework. In this way, DeepAudioX meets the needs of both novice and expert users. The demo will feature a series of interactive Jupyter notebooks showcasing end-to-end workflows on representative benchmark tasks, including speech emotion recognition, music genre classification, language identification, and sound event classification. Attendees will observe and directly interact with dataset construction, integration of pretrained backbones (e.g., BEATs), classifier configuration, and training and evaluation through concise APIs. Live performance metrics and efficiency indicators — including coding effort, training time, and accuracy — will be displayed to illustrate the effectiveness of pretrained audio representations combined with modular pooling and classifier architectures. Source code and documentation are publicly available at: https://github.com/magcil/deepaudio-x

ST-4.2: Smart Passive Acoustic Monitoring: Embedding a Classifier on AudioMoth Microcontroller

Louis Lerbourg, CEA Grenoble, UGA Paul Peyret, Biophonia Juliette Linossier, Biophonia Marielle Malfante, CEA Grenoble, UGA

Passive Acoustic Monitoring (PAM) is an efficient and non-invasive method for monitoring ecosystems by allowing the acquisition of large bioacoustic datasets during lengthy deployment campaigns. The AudioMoth is a standard among autonomous recorders allowing such studies by providing means to record data insitu at a reduced cost. In this demonstration, we show two concepts of smart-AudioMoth by extending the original system capabilities with AI functions without adding any hardware components. Specifically, the first device firmware is updated to allow a continuous analysis of the soundscape in real time, along side its recording. The second device is flashed with a firmware starting a recording only if the targetted bird specy has been detected in the soundscape. To the best of the authors knowledge such a contribution displaying real time classification of the soundscape directly on the AudioMoth has never been published. The same neural network is used in both cases and fits under the remaining 10kB of RAM available on the AudioMoth. It is based on a 1d-CNN architecture (12 layers) and is trained on more than 10.000 calls of Scopoli’s Shearwater seabird (male, female, chicks and background noise & non-target species) recorded at 24kHz with 16bits resolution using ten different recorders dispatched in the Pelagie islands. The model is validated with 91% accuracy on the test set and 100% accuracy in experimental conditions. Both devices can be manupulated by the audience. Their analysis capabilities are illustrated by playing real world recordings within earshot of the devices. Furthermore, the energy consumption and latency are visible in real time for the audience with the experimental set-up that was built for this purpose. A five minutes video illlustrating the demonstration is also avaible. Our aim with this demonstration for the signal processing community is to show the possibility to develop but also deploy models performing a continuous analysis of acoustic data, in real time and with strong memory, computation and energy constraints. The success of this contribution lies within the multi-disciplinary skills of the authors which gathers: signal processing and machine learning, data and eco-acoustics, and embedded devices expertises.

ST-4.3: An Interactive Music Analysis Platform for Pedagogy and Audio Organization

Parampreet Singh: Indian Institute of Technology Kanpur Sumit Kumar: Indian Institute of Technology Kanpur Vipul Arora: Indian Institute of Technology Kanpur, Katholieke Universiteit Leuven

We present an interactive platform for Indian Art Music (IAM) that is useful for music pedagogy and automatic analysis of music audio in terms of aspects such as raga, ornamentation, and melody. Pedagogy: The teaching module[1] enables music teachers to digitally record structured lessons by selecting specific tonic and tāla. The complete lesson package can then be shared with students. Students can import lessons via a dropdown menu, set their preferred tonic and tempo, practice repeatedly, and submit their final recording to the teacher. An AI-based automatic mistake recognition module compares the student’s performance with the teacher’s reference audio and highlights mistakes, and assigns an overall score. Organisation: The analysis module enables users to either upload an audio file or provide a YouTube link. The system analyses the audio and identifies its raga[2], various ornamentations[3], and main melody[4]. Some components enable interactive corrections. Novelty: The demo uniquely integrates novel pedagogical and analysis workflows with signal-processing-driven analysis, bridging education, MIR, and explainable AI. Impact: The platform demonstrates how signal processing and machine learning can be embedded into culturally rich, real-world music analysis and education workflows, opening new directions for applied MIR, educational signal processing, and human-centered AI systems. Interactivity: The booth will feature -fully interactive "Music Classroom" where attendees can wear headsets and act as students or teachers. They can listen and sing. The system will provide immediate visual feedback on their singing, highlighting specific mistakes and assigning a score. -an analysis system where attendees can provide IAM clips from YouTube or their own devices to know the musicological aspects, such as raga, melody contours and ornamentations in real-time. References: [1] doi.org/10.36227/techrxiv.23269502.v2 [2] doi.org/10.1109/TASLPRO.2025.3574839 [3] doi.org/10.1109/TASLPRO.2025.3639738 [4] doi.org/10.1109/TASLP.2024.3399614

ST-4.4: A System-Integrated Parametric Array Loudspeaker Prototype for Controllable and Localized Sound Field Regulation

Jun Yang, Institute of Acoustics, Chinese Academy of Sciences Yunxi Zhu, Institute of Acoustics, Chinese Academy of Sciences Xiaoyi Shen, Institute of Acoustics, Chinese Academy of Sciences

This Show & Tell demo showcases a parametric array loudspeaker (PAL)-based sound field control system developed by the research group led by Prof. Jun Yang from the Institute of Acoustics, Chinese Academy of Sciences. The demo is rooted in the group’s long-term, systematic research on the theoretical analysis and engineering implementation of PAL technology—findings comprehensively summarized in their recent monograph Parametric Array Loudspeakers: From Theory to Application (Yang & Ji, 2025, Springer Nature)—presenting a fully integrated PAL sound field control prototype. The prototype encompasses core components including ultrasonic transducer arrays, driving electronics, modulation and control modules, as well as real-time measurement and visualization tools. It enables the generation of highly directional audible sound and localized sound field regulation in free space, exhibiting practical sound field control capabilities that surpass conventional loudspeaker systems. The key innovation of this demo lies in its systematic integration of PAL theory and engineering applications. Leveraging rigorous nonlinear acoustic modeling, the demo illustrates how theoretical findings are translated into actionable engineering design decisions, such as modulation strategies, carrier frequency selection, and array configuration optimization. Distinct from typical PAL demonstrations that merely focus on perceptual effects, this work emphasizes stable, controllable, and repeatable sound field regulation—an indicator of the maturity of its underlying theoretical and engineering framework. This pioneering integration breaks through the limitations of traditional discrete PAL setups, which often suffer from instability and poor repeatability, and establishes a standardized engineering paradigm for directional sound technology. This demo provides the signal processing community with a representative case study on the practical implementation of nonlinear acoustic signal processing and array-based control techniques in PAL systems. It serves as a valuable reference for researchers and engineers engaged in spatial audio, sound field control, and the engineering application of advanced acoustic signal processing. During the demonstration, attendees will have the opportunity to interact with the system by adjusting modulation parameters and playback signals. They can observe real-time variations in sound directivity and spatial confinement through both perceptual experience and on-site acoustic measurements, fostering an intuitive understanding of PAL-based sound field control technology.

ST-4.5: SciPhi - a spatial audio language model that understands real multi-source audio scenes

Sebastian Braun (Microsoft), Dimitra Emmanouilidou (Microsoft), David Johnston (Microsoft), Xilin Jiang (Columbia University New York), Hannes Gamper (Microsoft)

Spatial audio records sound sources and their directions, allowing humans and machines to hear not just what happens when, but also where. Audio-language models (ALMs) play a crucial role in bridging the gap between audio and language understanding, expanding the modalities of human-computer interaction. While most established ALMs are only monaural, offering no spatial understanding, our IEEE OJSP paper presented here at ICASSP 2026 “Sci-Phi: A Large Language Model Spatial Audio Descriptor” introduces the first ALM with spatial audio support that shows generalization to real recordings beyond synthetic audio data. Sci-Phi describes sound events, direction, time occurrence, loudness and acoustic attributes such as reverb and room characteristics. In this demo, we showcase Sci-Phi with an improved and more robust spatial audio encoder, and extensive Question and Answering (Q&A) capabilities trained on a new Q&A dataset. Attendees see and hear real spatial sound scenes recorded with a first-order Ambisonics microphone array (and a 3D camera for reference). The analysis output of the Sci-Phi ALM is overlaid onto the panoramic video at the corresponding time and location, while the model input can be heard binaurally via headphones. The interactive Question and Answering mode allows attendees to query the language model about specific sound objects and their relations in the scene. This demo illustrates the capabilities and limitations of spatial audio understanding and spark future developments. Spatial ALMs can have great impact on the signal processing community as generalist tools for data analysis and curation, enablers for spatially aware agents and hearing assistive or augmented technologies.

ST-4.6: VisionSFX: Cross-Shot Consistent Video-to-Audio Generation with Depth-Aware Binaural Audio Rendering

Dayeon Ku, Gwangju Institute of Science and Technology Jung Hyuk Lee, Gwangju Institute of Science and Technology Hwa-Young Park, Gwangju Institute of Science and Technology Jongyeon Park, Gwangju Institute of Science and Technology Hong Kook Kim, Gwangju Institute of Science and Technology

Problem definition. As Video-to-Audio (V2A) technology advances toward real-world deployment in film post-production and automated video editing, perceptual consistency across shots becomes critical. However, current V2A methods [1] operate on a shot-by-shot basis, treating each shot as an isolated unit. This leads to two critical limitations: (1) ambient sounds disappear in shots where the source is not visible, and (2) the same sound effects (SFX) exhibit inconsistent characteristics across different shots. Moreover, existing methods ignore spatial positioning, producing audio without directional correspondence to on-screen objects. These limitations severely degrade perceptual continuity, spatial realism, and overall production quality. Key Challenges. . Cross-shot Consistency: Grouping shots that share the same physical space despite different camera angles, ensuring consistent ambient sound. . Binaural Audio Rendering: Estimating sound source positions and listener perspective from visual cues alone, without explicit depth or 3D scene data. Methodology. We demonstrate VisionSFX, a working system that addresses both challenges. For cross-shot consistency, our system feeds all shot keyframes to a vision-language model (VLM) at once, detecting objects and their visual attributes [2]. The VLM groups shots by shared visual features, identifies sound-producing objects, and generates SFX for each using TangoFlux [3]. For binaural audio rendering, each video frame is converted to a monocular depth map to estimate where sound sources are located in 3D space. The system then binaurally renders each sound—anchoring dialogue to the center and placing SFX across the binaural field based on their on-screen positions. System Pipeline. Our system comprises two stages. For SFX generation: (1) TransNetV2 [4] for shot boundary detection, and (2) parallel VLM inference for shot grouping. For binaural audio rendering: (3) Grounded-SAM [5] for object detection, (4) Depth Anything 3 [6] for depth estimation, and (5) HRTF-based binaural rendering [7]. What to Show at ICASSP. We demonstrate VisionSFX running fully offline on a laptop and an edge device. Attendees can explore the pipeline with prepared videos or upload their own to experience cross-shot consistency and binaural audio rendering firsthand. Additional materials at [https://drive.google.com/drive/folders/10BqWfxXRZTwPcwy_3lX2ObSaLSW4vLx3?usp=sharing]. [1] https://github.com/hkchengrex/MMAudio [2] https://doi.org/10.48550/arXiv.2511.21631 [3] https://github.com/declare-lab/TangoFlux [4] https://github.com/soCzech/TransNetV2 [5] https://github.com/IDEA-Research/Grounded-Segment-Anything [6] https://github.com/ByteDance-Seed/Depth-Anything-3 [7] https://www.isca-archive.org/interspeech_2019/lee19b_interspeech.html