ST-1: Show and Tell Demo 1
Tue, 5 May, 14:00 - 16:00 (UTC +2)
Location: Exhibition Hall

ST-1.1: Nkululeko 1.0: A Python package to predict speaker characteristics with a high-level interface

Felix Burkhardt: audEERING, Bagus Tris Atmaja: NAIST, Florian Eyben: audEERING, Björn Schuller: audEERING, TUM, Imperial CL
The Nkululeko demo showcases a cutting-edge, open-source Python toolkit designed to simplify audio-based machine learning tasks, particularly in speech processing. Aimed at users with varying levels of expertise, Nkululeko eliminates the need for coding by leveraging a command-line interface (CLI) and configuration files. Built on scikit-learn and PyTorch, it provides a powerful yet user-friendly framework for training, evaluating, and analyzing speech databases using advanced machine learning methods and acoustic features. Novelty and Innovations The key innovation of Nkululeko lies in its ability to empower users—whether novices or experienced researchers—to easily experiment with speech processing tasks without deep technical knowledge. With version 1.0, Nkululeko introduces several significant advancements, such as: * Transformer model fine-tuning: Users can now fine-tune pre-trained transformer models, enabling them to achieve state-of-the-art performance with minimal data and computation. * Ensemble learning: This feature allows users to combine multiple models to improve prediction accuracy and robustness. * Linguistic feature modeling: Nkululeko also supports advanced linguistic feature extraction, enabling the incorporation of higher-level language characteristics into speech analysis. These innovations make it an invaluable tool for quickly testing hypotheses and deploying machine learning models, especially for those working with speech data and acoustic features. Impact to Signal Processing Communities Nkululeko has the potential to make a significant impact on various fields within the signal processing and machine learning communities. By simplifying complex workflows, it lowers the barrier for entry to speech processing research and application, making it accessible to a broader range of users, from educators to researchers. Additionally, its ability to detect biases in speech data (e.g., correlations between speaker characteristics and target labels) provides a novel approach to addressing fairness in AI-driven speech processing. Interactivity for Attendees At the ICASSP 2020 demo session, attendees will have the opportunity to engage with live demonstrations of key Nkululeko features, including model training, database analysis, and bias detection. The demo is designed to be highly interactive, allowing participants to explore various machine learning experiments in real-time on a laptop. This hands-on experience will give attendees a practical understanding of how Nkululeko can be used in both academic and industry settings.

ST-1.2: Interactive Spectrogram-Based Rhythm and Melody Annotation for Speech Analysis

Shreevatsa G. Hegde, Department of Computing and Software Systems, University of Washington Bothell Min Chen, Department of Computing and Software Systems, University of Washington Bothell
This demo presents MeTILDA (https://metilda.net/), an interactive, cloud-based, and open-access signal processing platform for endangered language documentation and education. The system supports human-centered speech signal analysis by enabling hands-on exploration of rhythm and melody through direct interaction with audio representations. The demo showcases a complete, end-to-end speech analysis workflow, including spectrogram-based rhythm annotation, melody analysis, and pitch visualization. It also demonstrates the integration of our proposed MeT perceptual pitch scale, a key innovation that allows users to focus on relative melodic contours by normalizing speaker-dependent pitch variation caused by age, gender, or physiological differences. Attendees can inspect spectral content while controlling audio playback for improved perception of rhythmic and melodic features. Interactive zooming and time navigation enable close inspection of short temporal regions, supporting precise analysis of rapidly changing acoustic events. For rhythm analysis, attendees can place vertical markers as taps on the spectrogram to annotate perceived rhythmic boundaries. The system provides multiple playback modes with rate control, enabling focused exploration of temporal structure and alignment of rhythmic annotations with acoustic events. The demo also highlights melody analysis workflows in which users select regions of the spectrogram and apply different pitch extraction strategies, including region-based averaging, contour-based extraction, and manual frequency selection. Each method generates pitch data that are mapped to interactive Pitch Art visualizations, which abstract pitch movement patterns while remaining grounded in the underlying signal representation. Users can label syllables, apply time normalization, vertically center pitch ranges, and mark primary and secondary accent positions. The system further supports multi-speaker prosodic analysis by overlaying pitch representations from multiple speakers within a single Pitch Art chart, enabling direct visual comparison of pronunciation and intonation patterns. The main novelty of the demo lies in its integration of interactive spectrogram manipulation, rhythm annotation, and melody visualization within a single end-to-end workflow with human-in-the-loop. The demo is highly interactive, with attendees directly manipulating speech signals and receiving immediate auditory and visual feedback. It demonstrates the broader impact of interactive and perceptually grounded signal processing tools for researchers in speech and audio processing, prosody analysis, signal visualization, and human-centered and explainable signal processing systems.

ST-1.3: An Interactive Demonstration of the Open ASR Leaderboard

Eric Bezzam (Hugging Face), Steven Zheng (Hugging Face), Eustache Le Bihan (Hugging Face)
With the proliferation of automatic speech recognition (ASR) systems, selecting the right model for a given application can be challenging. We present a live demonstration of the Open ASR Leaderboard, a community-driven benchmarking platform that enables transparent, reproducible, and continuously updated comparison of ASR systems: https://huggingface.co/spaces/hf-audio/open_asr_leaderboard The leaderboard evaluates a wide range of ASR models across standardized datasets and metrics, and aggregates both open and closed-source systems. For open-source models, links to their Hugging Face model cards provide example code and implementation details. For closed-source systems, links point to the corresponding API documentation. In addition, an open-source GitHub repository provides evaluation scripts to reproduce leaderboard results: https://github.com/huggingface/open_asr_leaderboard The Open ASR Leaderboard has seen strong adoption across academia and industry, with participation from major speech toolkits and companies (SpeechBrain, NVIDIA NeMo, ElevenLabs, IBM, Microsoft, etc.), and significant community engagement (550K+ total visits, 37K+ in the last month, 48 merged GitHub PRs). To move beyond static benchmarking, the ICASSP demonstration will include a Reachy Mini desktop companion robot, enabling live speech interaction. Leveraging the open-source and rapid prototyping nature of Reachy Mini, different ASR and text-to-speech (TTS) models can be used interchangeably. This allows attendees to directly experience how offline benchmark scores translate into perceptual quality, latency, and robustness in realistic human-machine interactions. Main Novelty and Innovations - A unified, public benchmark comparing open and closed-source ASR systems. - Community-driven, continuously evolving evaluation framework. - Human-in-the-loop, embodied evaluation to translate offline metrics into live interactions with Reachy Mini. Impact to the Signal Processing Community - Promotes transparent and reproducible evaluation practices. - A shared reference point for comparing ASR systems across academia and industry. Interactivity for Attendees - Attendees can speak directly to the Reachy Mini robot, select different ASR/TTS backends, observe real-time transcriptions, latency differences, and qualitative behavior. They can compare these observations with leaderboard results, making the leaderboard metrics more tangible.

ST-1.4: Speech Enhancement Intelligence - Inspecting a Model Under Controlled Degradation

Yair Amar (Technion - Israel Institute of Technology) Amir Ivry (Technion - Israel Institute of Technology) Israel Cohen (Technion - Israel Institute of Technology)
This Show and Tell demonstration presents an interactive system for speech enhancement intelligence: observing, probing, and interpreting how a speech enhancement model responds as noise conditions change. Rather than treating the model as a black box, the demo provides an interface that exposes how internal representations evolve under increasing noise, controlled by the user. Attendees begin by speaking a short utterance into a microphone. This recording is treated as a clean reference. Artificial noise is then added in a controlled manner using an SNR slider, allowing users to smoothly move from clean to highly noisy conditions while keeping the underlying speech fixed. At each noise level, the clean and noisy signals are processed through a speech enhancement model, and internal activations from selected layers are extracted. The interface visualizes how the activations evolve under increasing noise and evaluates how closely the model’s representations under noise resemble those elicited by clean speech. These similarities are shown layer by layer using Centered Kernel Alignment (CKA), revealing which parts of the model remain stable, which become noise-sensitive, and which recover as noise conditions improve. These measures are summarized via linearization of the CKA versus SNR trend. Alongside these internal indicators, standard enhancement performance metrics such as PESQ, STOI, and SI-SDR are updated in real time. By interacting with the noise controls, attendees can observe how internal representation stability degrades and recovers, and how these internal changes align with variations in output quality. This enables inspection of model behavior beyond post-hoc evaluation of enhanced signals alone. The demo offers an intuitive, hands-on view of how speech enhancement models internally respond to noise. It is relevant to the ICASSP community, as it illustrates how signal processing, learning-based models, and interpretability tools can be combined to better understand the internal behavior of modern speech systems.

ST-1.5: SCRIBAL: A Multilingual Transcription Platform for Academic Lectures and Impaired Speech Accessibility

Pol Pastells (1,2), Javier Román (1), Mauro Vázquez (1), Clara Puigventós (1), Montserrat Nofre (1), Mariona Taulé (1,2), Mireia Farrús (1,2) 1 - Centre de Llenguatge i Computació (CLiC), Universitat de Barcelona, Spain 2 - Institut de Recerca en Sistemes Complexos (UBICS), Universitat de Barcelona, Spain
SCRIBAL is a comprehensive web-based transcription and translation ecosystem comprising three integrated products. SCRIBAL provides real-time multilingual transcription and translation for university lectures and conferences, supporting most major languages through Whisper-based models, with specialized domain-optimized terminology currently available for Catalan. SCRIBAL-Social specializes in transcribing and translating impaired speech from Catalan speakers with Down syndrome and cerebral palsy, addressing critical accessibility needs. Additionally, the platform offers file-based transcription for post-processing scenarios. SCRIBAL exemplifies how speech processing can bridge digital divides across linguistic and ability spectrums. It demonstrates practical solutions for low-resource language ASR, domain adaptation in academic contexts, and impaired speech transcription. The key innovation is a modular architecture that seamlessly integrates general-purpose multilingual ASR, domain-adapted academic models, and specialized impaired speech recognition within a unified platform. The impaired speech component represents pioneering work in Catalan speech processing, an under-resourced language for accessibility applications. This dual-focus approach—combining broad multilingual coverage with deep specialization for underserved populations—sets SCRIBAL apart from conventional transcription services while maintaining real-time performance suitable for live deployment. Participants will actively engage with all three SCRIBAL modalities through multiple interaction modes. Using either our demonstration laptop or their own smartphones, attendees can: speak directly into the system in their native language to experience live multilingual transcription, upload audio files to test batch processing capabilities, and observe specialized impaired speech recognition through pre-recorded Catalan samples. This hands-on experience allows attendees to compare transcription accuracy across academic and general domains, experiment with various acoustic conditions and speaking styles, and discuss potential deployment strategies for their own institutions or research applications. This work has been funded by the Generalitat de Catalunya (2024 PROD 00016 grant). It is also part of the FairTransNLP-Language project (PID2021-124361OB-C33), funded by MICIU/AEI/10.13039/501100011033/FEDER, UE.

ST-1.6: Tahlil: An Interactive Toolkit for Standardized ASR Evaluation and Error Analysis

Yousseif Alshahawy, Daniel Izham, Aljawharah Bin Tamran, Ahmed Ali HUMAIN, Riyadh, Saudi Arabia
Demo Overview Tahlil is a stand-alone web application for standardized automatic speech recognition (ASR) evaluation and error analysis, designed to improve the transparency, interpretability, and reproducibility of reported ASR results. The system provides both single-utterance inspection and large-scale batch evaluation, with asynchronous processing to support realistic experimental workflows. Its architecture combines a Nuxt 4 frontend with a FastAPI backend, enabling responsive interaction and scalable evaluation. A key component of Tahlil is a custom Rust-based alignment module that enables efficient token-level alignment, detailed error inspection, and confusion statistics. By unifying ASR hypotheses and human annotations within a single evaluation framework, Tahlil allows systematic comparison across annotators, models, datasets, and normalization settings. The toolkit was initially motivated by inconsistencies observed in Arabic ASR reporting, where divergent text normalization practices, such as diacritic handling and letter-form variants that can substantially influence reported error rates. However, the framework itself is language-agnostic and applicable to a wide range of ASR evaluation scenarios. Novelty and Innovation Tahlil transforms ASR evaluation from a single aggregate score into a structured and reproducible analysis workflow. It integrates a custom Rust extension into the JiWER evaluation stack, enabling fast and consistent alignment with optional custom-cost or weighted alignment strategies. The resulting RapidFuzz-compatible opcode streams form a single source of truth from which all metrics, visualizations, and confusion statistics are derived. This design allows users to directly trace how normalization choices, alignment parameters, and input annotations affect final WER/CER values and error distributions. In addition, Tahlil provides built-in tools for text cleaning and normalization, supports both single and batch evaluation with asynchronous job tracking, and enables export of analysis artifacts. Together, these features standardize evaluation practices across models, datasets, and annotators. Impact on the Signal Processing and ASR Community By combining standardized scoring with interactive, alignment-based error analysis, Tahlil enables researchers and engineers to move beyond reporting a single WER or CER figure. The toolkit facilitates identification of systematic failure patterns, such as consistent substitutions, deletion bursts, and normalization-sensitive errors, in a manner that is transparent and easy to communicate. Its support for batch evaluation and self-contained deployment enables scalable, reproducible comparisons across ASR systems, improving experimental reliability and accelerating iteration cycles. In multilingual and morphologically rich settings, Tahlil provides a shared reference point for fairer benchmarking and clearer reporting of ASR performance. Interactivity for Attendees During the demo, attendees will interactively upload ASR hypotheses and references, adjust normalization and alignment settings in real time, visualize token-level errors and confusion statistics, compare multiple systems side-by-side, and export reproducible evaluation artifacts.