IET-3: Speech and Audio AI Systems
Wed, 6 May, 14:00 - 16:00 (UTC +2)
Location: Auditorium

IET-3.1: Why Blind Audio Processing Fails: Edge Intelligence for Content-Aware Audio Processing in Streaming Media

Sunil Bharitkar, Samsung
Streaming platforms such as YouTube, Vimeo, and Youku host a diverse mix of content, including movies, music videos, news, documentaries, and advertisements, with hundreds of hours of video uploaded every minute. While this diversity enables scale, it introduces a fundamental challenge for consumer electronics (CE) devices: content-agnostic audio and video post-processing can degrade user experience and violate artistic intent. A concrete example arises in audio rendering. Movies are typically authored in multichannel formats such as spatial 5.1 or 7.1.4, while music content is intentionally produced in stereo. To conserve transmission bandwidth, multichannel movie audio is often downmixed to stereo for streaming. Edge devices such as TVs, soundbars, smartphones/tablets, and audio-video receivers rely on post-processing DSP pipelines to upmix the stereo to spatial audio for movies. However, when stereo music is blindly upmixed using the same processing chain, audible artifacts are introduced, and artistic intent is compromised. This is just one example of content-agnostic signal processing that can degrade the quality of experience. This motivates the need for real-time multimedia content classification directly on edge devices to guide appropriate post-processing decisions by identifying the content type. This talk presents an industry-driven view of multimedia classification for edge deployment, focusing on real-world constraints rather than algorithmic benchmarks alone. We will briefly review current state-of-the-art deep learning approaches for audio-visual classification and explain why many frame-level, audio- or vision-centric models—while accurate—are impractical for deployment on CE hardware due to latency, memory footprint, and power constraints. Model compression techniques such as pruning and quantization help, but often at the cost of degraded classification reliability in real-time settings. We also address why server-side classification is not a viable alternative at scale. Embedding content class metadata upstream would require changes to existing MPEG standards and would be incompatible with billions of legacy decoders already deployed worldwide. These realities shift the problem decisively toward edge-based intelligence. The core of the presentation introduces a low-latency, low-memory edge deep-learning classifier that leverages linguistic metadata in the MPEG standard, specifically video titles rather than raw audio or video frames. This approach achieves high classification accuracy with a fraction of the computational cost of conventional deep learning pipelines. We will also present the latest extensions that enable multilingual support via neural machine translation, allowing the solution to interface with the DSP audio signal processing chain and to deploy across global streaming ecosystems. The session concludes with a video demonstration of the classifier deployed on a TV connected to a YouTube streaming service for content-aware processing to improve the quality of experience in practice. Attendees will leave with a concrete understanding of how edge-intelligence and signal-processing can be co-designed to improve the quality of experience while readily scaling to billions of CE devices.

IET-3.2: How to Build Realistic Acoustic Datasets For AI-audio Training Using Simulated Data From Validation to Large-Scale Datasets

Steinar Guðjónsson, Treble Technologies
This presentation demonstrates a practical, end-to-end workflow for building realistic acoustic datasets using modern simulation tools, emphasizing validation, efficiency, and scalability. Using realistic acoustic datasets in audio AI training as opposed to simplified approaches in empty shoebox rooms has shown to give 25% lower WER through improved speech enhancement. The talk begins with validation of the Treble SDK simulation engine by comparing simulated results against measurements from the Benchmark for Room Acoustic Simulations (BRAS) database. A controlled single-reflection scenario is analyzed across multiple boundary conditions to establish physical accuracy and confidence in the underlying solver. Next, a full Head-Related Transfer Function (HRTF) is simulated by importing a 3D scan of a KEMAR mannequin into Treble SDK. The audience will see how dense, high-quality HRTFs can be generated rapidly and efficiently. Building on this foundation, a realistic room environment is created, and the simulated HRTF is used to render binaural room impulse responses at arbitrary listener positions. These results are validated through direct comparison with measured data. Finally, the presentation scales these techniques to large dataset production. Using 1,000 procedurally generated living room models—each containing five source locations and fifty receiver positions—the pipeline produces a total of 250,000 binaural impulse responses. This final step illustrates how physically validated simulation can enable diverse, large-scale datasets suitable for training and evaluating spatial audio and machine learning systems. These large datasets can then be used to build a realistic audio scenes, including multiple speakers and background noises, ideal for training and evaluating complex audio AI enhancement algorithms. Attendees will gain concrete insights into validated simulation workflows and practical strategies for generating realistic acoustic data at scale and why that matters.

IET-3.3: From Text to Talk: How New Speech LLMs Will Make Conversations with Technology More Natural

Kyu J. Han, Oracle Cloud Infrastructure (OCI)
The rapid rise of Large Language Models (LLMs) has redefined the boundaries of natural language understanding and generation, propelling advances across machine learning, conversational AI, and human-computer interaction. However, LLMs, while remarkable in text-based tasks, inherently overlook the vibrant complexity of spoken communication—where meaning is interwoven with emotion, prosody, timbre, and speaker individuality. For the ICASSP community, which sits at the forefront of speech, signal processing, and audio research, the evolution from text-only LLMs to models natively bridging speech and language stands as a defining technical frontier. This talk spotlights Speech Large Language Models (SpeechLLMs)—a novel class of models that move beyond the traditional ASR → LLM → TTS pipeline by directly learning from and generating speech waveforms. SpeechLLMs fuse the representational power of LLMs with the rich acoustic and prosodic informatics of speech. This paradigm shift resolves persistent bottlenecks faced by cascaded systems: information loss during conversion, compounded errors across modules, and latency that limits real-time interaction. By integrating raw audio processing with end-to-end context-aware generation, SpeechLLMs capture nuances such as emotion, speaker traits, and conversational dynamics, enabling new forms of expressive, natural dialogue. The technical content will delve into the architectures and training strategies that empower SpeechLLMs—from self-supervised audio representation learning and sequence-to-sequence modeling, to tokenization techniques that merge acoustic and semantic information. Real case studies will illustrate how SpeechLLMs enable capabilities like real-time speaker turn-taking, emotional tracking, and cross-lingual voice interaction. This talk will review pioneering benchmarks and evaluation frameworks, offering a candid look at open research questions around scalability, bias, and robustness. ICASSP attendees will see how SpeechLLMs not only push the envelope in foundational areas like neural signal processing, end-to-end modeling, and multimodal learning, but also open up broader interdisciplinary collaborations across audio, NLP, and user experience design. The relevance and novelty for ICASSP is clear: as generative models become increasingly universal, the integration of speech into the LLM ecosystem will fuel fundamentally new applications in assistive technology, global communication, accessibility, and interactive media. Participants will gain technical insights into this fast-emerging landscape as well as concrete inspiration. The session is designed to motivate not just application-oriented engineers, but also researchers interested in core algorithms, data representation, and theoretical challenges. By surfacing both the current limitations and the promise of SpeechLLMs, this talk invites ICASSP’s diverse audience to join in shaping the future of conversational AI—one that doesn’t just generate text, but listens, responds, and connects through speech as naturally as humans do. Join us to explore how SpeechLLMs will unlock the next frontier of conversational intelligence, energizing future research and redefining how we interact with and through technology.