IEEE ICASSP 2024 || Seoul, Korea || 14-19 April 2024

DEMO-1B: Show and Tell Demos I-B

Tue, 16 Apr, 13:10 - 15:10 (UTC +9)

Location: Hall D2: Podium Pitch Room B

DEMO-1B.1: w2v2viz: An Interactive Transformer Probe Visualisation Toolkit

Patrick Cormac English, Erfan A. Shams, John D. Kelleher, Julie Carson-Berndsen

State-of-the-art transformer-based models like wav2vec 2.0 have led to performance improvements in speech processing tasks. However, model explainability remains limited, especially regarding fine-grained phonetic information encoded in speech embeddings. This demonstration presents a novel analysis framework to probe wav2vec 2.0 for articulatory transitions using a limited amount of labeled data. We developed an interactive visualisation tool called w2v2viz that generates 3D terrains depicting probe outputs over time for manual inspection. In our ICASSP 2024 paper "Following the Embedding: Identifying Transition Phenomena in wav2vec 2.0 Representations of Speech Audio", with paper ID #6920. we train multilayer perceptron (MLP) models on phone-averaged embeddings from TIMIT data to predict manner of articulation (MOA), place of articulation (POA), and other articulatory features. Our analysis on test set embeddings, represented as 25ms audio frames, shows that frame-level articulatory information can be identified without direct annotations. Users can navigate audio embeddings using slider controls for model layer and audio time selection. Additionally, in the extended version, users can observe the spectrogram of the audio as well as the previous and next frame probabilities. Peaks on the probability terrain reveal frames with high confidence for specific articulatory feature pairs like Manner and Place of articulation, in the case of consonants. The tool also shows target phone labels and descriptive text outputs for additional detected features. Our analysis framework and w2v2viz visualisation demo contribute to transformer model explainability by probing speech embeddings for sub-phonemic articulatory transitions. We demonstrate how a limited amount of labeled data can uncover frame-level phonetic knowledge encoded by wav2vec 2.0, even without direct supervision at this temporal resolution.

DEMO-1B.2: Multilingual speech intelligibility testing in crowdsourced and laboratory settings

Nerio Moran, Ginette Prato, Miguel Plaza, Shirley Pestana, Daniel Arismendi, Jose Kordahi, Cyprian Wronka, Laura Lechler, Kamil Wojcicki

Advancements in generative algorithms promise new heights in what can be achieved, for example, in the speech enhancement domain. Beyond the ubiquitous noise reduction, destroyed speech components can now be restored—something not previously achievable. These emerging advancements create both opportunities and risks, as speech intelligibility can be impacted in a multitude of beneficial and detrimental ways. As such, there exists a need for methods, materials and tools for enabling rapid and effective assessment of speech intelligibility. Yet, the well established laboratory-based measures are both costly and do not scale well. Furthermore, public availability of multilingual test materials with associated software is lacking. The 2024 ICASSP paper #9588 “Crowdsourced multilingual speech intelligibility testing” aims to address some of the above challenges. This includes a public release of multilingual recordings and software for test survey creation. While the novelty of our approach rests primarily on the adaptation to crowdsourcing, this by no means limits applicability to in-laboratory environments. The proposed “Show and Tell” aims at demonstrating the above contributions to the speech research community at the 2024 ICASSP conference. It will include an overview of the test method, public test data release, open-source test software, along with a demonstration of a test setup. The audience will have the opportunity to take a short version of the test via noise cancelling headphones and hence experience the test first-hand in an interactive way. We note that the demo and associated releases will be useful for evaluation of, for example, speech enhancement and neural codec models, but also other technologies including those that produce or require their own audio recordings, such as text-to-speech, voice conversion, or accent correction systems. As such the proposed "Show and Tell" will be of broad interest to the conference audience.

DEMO-1B.3: MeetEval, Show Me the Errors! \\ Interactive Visualization of Transcript Alignments for the Analysis of Conversational ASR

Thilo von Neumann, Christoph Boeddeker, Marc Delcroix, Reinhold Haeb-Umbach

The importance of conversational speech recognition has proliferated in recent years. It is the task of recognizing the speech of all speakers in a conversation, estimating utterance start and end times, and finding which speaker uttered it. It is a much more challenging problem than (conventional) single-speaker speech recognition, and detailed analysis of system performance is crucial for efficient development. We present an interactive visualization tool developed in MeetEval (https://github.com/fgnt/meeteval) to facilitate error analysis and spotting error locations. The tool displays alignments between ground truth and estimated transcripts for multiple speakers and long recordings. It highlights and summarizes different error types, such as word insertions, deletions, and substitutions, so that problematic regions with high error density can be identified easily. Where automatic detection of error types is unreliable, e.g., for temporal misalignment, speaker changes, or leakage, the visualization allows a (human) user to spot issues. Various tools already exist for the visualization of speech recognition errors, e.g., included in Kaldi. They are, however, designed for classical single-utterance recognizers and are thus not suitable for conversational ASR. To the best of our knowledge, no tools exist to visualize transcription errors for conversational speech recognition easily. We will demonstrate the tool using examples from various transcription systems and datasets, including the latest CHiME (Computational Hearing in Multisource Environments) challenge. We will show how our tool enables easy analysis and reveals previously undetected issues. Participants will be able to test and use the tool themselves through examples accessible by a QR code. The participants can enter custom transcripts for visualization during the session. An early work-in-progress example of the tool is available at https://groups.uni-paderborn.de/nt/meeteval/viz.html.

DEMO-1B.4: Towards Streaming Speech-to-Avatar Synthesis

Tejas S. Prabhune, Peter Wu, Bohan Yu, Gopala K. Anumanchipalli

Speech-driven avatar animation is useful for many applications in speech and linguistics. Specifically, it can facilitate second language (L2) pronunciation learning via visual feedback and aid hearing-impaired individuals to lip-read when only an audio signal is available during communication. Previous works focusing on offline synthesis have achieved success using face scans as well as various input modalities like magnetic resonance imaging (MRI) and/or electromagnetic articulography (EMA) to model the movement of vocal articulators. To the best of our knowledge, these advances in offline animation methods have not been extended to streaming solutions yet. In this show-and-tell, we demonstrate a system connecting the recent advances in deep articulatory inversion to improving real-time speech-driven facial and tongue animation. Audience members will be able to speak into a microphone and watch their speech drive the 3D avatar's face and inner mouth movements in real-time. For the proposed articulatory streaming architecture, we use an acoustic-to-articulatory inversion process (AAI) converting speech to EMA followed by a mapping between each EMA vocal tract feature and a joint on the 3D face. We first train a state-of-the-art six-layer Transformer model prepended with three residual convolutional blocks with the HPRC dataset. The model uses the 10th layer of WavLM for speech representations and outputs EMA, tract variables, phonemes, and pitch simultaneously. Additionally, we build a 3D facial model and joint-based rig for our avatar. We create joint systems for each EMA feature and add appropriate skin weights to map areas of the face and inner mouth to joints. During the streaming task, we use an audio stream and a voice activity detector (VAD) to convert all speech within a rolling window into the corresponding EMA features using the AAI model. We then map the EMA data for each frame to the facial model, creating real-time animation of speech.

DEMO-1B.5: Phone-aid: An innovative language-independent pronunciation assessment tool

Mostafa Shahin, Beena Ahmed

Phone-aid is an innovative tool designed to assist non-native speakers in learning and improving their pronunciation of a target language by evaluating it at both the phonemic and phonological levels. The tool features a two-tier assessment system. The first tier compares the recognized phoneme sequence of the user’s speech against the expected sequence to identify any discrepancies, such as added, omitted, or replaced sounds. The second tier evaluates 35 speech attributes, covering the manners and places of articulation as well as other phonological features like voicing, identifying where these features are correctly or incorrectly applied. It generates a sequence indicating the presence (+) or absence (-) of each attribute, which is then compared to the standard for that word or sentence. Phone-aid incorporates this sophisticated technology into an easy-to-use web interface that enables users to practice their pronunciation. Users can record their speech, which is then analyzed to highlight errors at both the phoneme and phonological levels. The tool allows practice with any word or short phrase, which the user enters in a text box before recording their speech. It also provides the option to use either Arpabet or International Phonetic Alphabet (IPA) symbols for reference. Phone-aid stands out as the first tool to assess pronunciation in terms of phonological features, providing not only a granular phoneme-level analysis but also the ability to identify changes in each attribute during pronunciation. The language-independent nature of its phonological features positions phone-aid as a trailblazer, laying the foundation for a universal pronunciation learning tool that transcends language barriers, contributing to a more inclusive and effective language education landscape.

DEMO-1B.6: A web demo for real-time affect prediction using active data representation of acoustic features

Fasih Haider, Saturnino Luz

The COVID-19 pandemic has led to unprecedented restrictions on individuals' lifestyles, which have had a profound effect on their mental well-being. In the 2018/2019 period, around 12% of adults in Scotland reported symptoms associated with depression, while approximately 14% reported symptoms related to anxiety. Enabling remote emotion recognition through home devices would simplify the detection of these often neglected challenges, which, if left unattended, can lead to temporary or chronic disability. To address this problem, we developed an effect recognition system. This system uses a novel machine learning approach to identify and analyse emotions. It was trained using audio recordings collected during the winter lockdown period of COVID-19 in Scotland. The features are extracted from voice recordings acquired from household and portable devices such as phones and tablets. Hence, it provides valuable insights into the feasibility of remotely, automatically, and comprehensively monitoring individuals' mental well-being. The proposed model exhibits good predictive performance for affect, attaining a concordance correlation coefficient of 0.4230 for arousal, with a Random Forest model, and 0.3354 for valence with Decision Trees. The front- and back-end are implemented using the Django web-framework. The back-end of the web demo is based on voice activity detection (VAD), openSMILE and active data representation. Upon receiving an audio recording, the system used VAD to extract voice segments. and OpenSMILE to extract eGeMAPs audio features, and then applies active data representation to extract compound features for affect prediction. The front-end is a graphical representation of affect on an arousal-valence plane, with specific moods indicated. A dot is plotted based on the predicted arousal and valence values. The system also provides users with an option to record an audio file or upload an existing recording for processing.

DEMO-1B.7: Enhancing Human-AI Interaction through Emotionally Responsive Voice Recognition

Soroosh Mashal, Mehrzad Mashal, Alican Akman, Dagmar Schuller, Felix Burkhardt, Florian Eyben, Bjoern Schuller

In the rapidly evolving field of artificial intelligence, the integration of emotional intelligence in human-AI interactions represents a significant leap forward. This demo aims to showcase an innovative Emotionally Responsive Assistant that elevates the standard of communication between humans and AI-powered agents. This assistant uniquely combines Automatic Speech Recognition (ASR), Text-to-Speech (TTS), and Large Language Model (LLM) technologies with audEERING’s Emotion Recognition model for detecting emotions from voice. Its dual-running mode processes conversations both with standard ASR, TTS, and LLM technologies and with emotion detection. This approach lets participants compare the impact of emotion recognition in AI interactions, showcasing enhanced AI responsiveness and empathy. This technology is especially relevant in areas like customer service, therapy, education, and human-computer interaction research. The emotion recognition is trained on a large dataset of annotated audio clips, allowing the deep-learning transformer models to learn the complex patterns and nuances that characterize different emotions. The use of annotated datasets is a critical aspect of the Emotion AI system as it enables the models to achieve high accuracy in classifying emotions. In conclusion, our demonstration at ICASSP 2024 offers attendees a unique opportunity to experience the future of AI-human interaction. It showcases how integrating emotional intelligence into AI can significantly enhance its ability to interact with humans in a more natural, understanding, and empathetic manner. This emotionally responsive assistant stands as a testament to the potential of AI in revolutionizing the way we communicate and interact with technology.

DEMO-1B.8: A REAL-TIME AUDIO DEEPFAKE DETECTION ON LIMITED RESOURCES DEVICES

Hung Dinh-Xuan, Thien-Phuc Doan, Long Nguyen-Vu, Kihun Hong, Souhwan Jung

In the rapidly evolving field of AI, we have seen an explosion of artificially generated multimedia content, highlighting the authentic nature of synthetic examples. Yet, alongside generative AI’s advantages, it raises significant concerns about the misuse of technology to impersonate the public, spread disinformation, and commit fraud. Traditionally, synthetic voice research has focused on batch inference models. These models are inadequate for providing immediate warnings against fraudulent calls. Therefore, we urgently need a real-time detection system that integrates essential features: •Fast inference speed with a latency under 20 milliseconds. •The ability to process short audio segments. •A streamlined model architecture to ensure minimal overhead in environments with constrained resources. •Performance is on par with larger and more complex models. Our approach is based on Knowledge Distillation (KD) and Quantization to pursue the objectives above. We adopt the state-of-the-art model proposed by as the teacher model, which enables the transfer of knowledge to a more streamlined model. This compact model results from combining distillW2v2 and AASIST- L. Lastly, we leverage Quantization to further compress the model by reducing the bit precision of the weight parameters. This optimized model is deployed within a mobile application that supports streaming inference. Upon activation, the application continuously analyzes the streamed audio from the phone’s microphone in realtime. It makes decisions regarding the authenticity of the speech signal and whether it is machine-generated. Our methodology represents one of the first attempts at providing a real-time audio deepfake detection solution.

DEMO-1B.9: Brain-Computer Interface on HoloLens 2 platform

Ivan Tashev, David Johnston

The input modalities of the user interface in augmented and virtual reality glasses usually are voice and gesture. In conditions of high noise levels when the hands of the wearer are busy holding equipment or wearing protective gloves, both input modalities cannot be used. In such conditions a Brain-Computer Interface (BCI) can make the device usable. In this demo we propose coarse-to-fine selection of the action button. Using gaze is selected a button group, represented by a single-color dot or marker. Once the button group is selected, it activates, and several flashing buttons appear. The user pays attention to one of the buttons, which is detected by processing of the electroencephalographic (EEG) signals, acquired by electrodes retrofitted in the device. This method is known as Steady State Visual Evoked Potentials (SSVEP) and has the main disadvantage of annoying constantly flashing buttons. Another problem with SSVEP is the limited number of buttons because of the useful flashing frequency range, usually between 10 and 15 Hz. Coarse-to-fine selection addresses both issues. We have only several flashing frequencies and the number of button groups is limited only by their separation in the 3D space. The demo consists of HoloLens 2 augmented reality device equipped with eight EEG electrodes and control software. The scenario is a maintenance worker selecting for viewing various technical documents, while working on servicing the machinery with both hands busy in presumably noisy conditions. On a laptop screen is shown the wearer point of view with all the buttons and selected documents. The BCI system achieves in real time close to one second reaction time with detection accuracy above 90%.