Organized by: Lies Bollens, Corentin Puffay, Bernd Accou, Hugo Van hamme, and Tom Francart
Various neuroimaging techniques can be used to investigate how the brain processes sound. Electroencephalography (EEG) is popular because it is relatively easy to conduct and has a high temporal resolution. Besides fundamental neuroscience research, EEG-based measures of auditory processing in the brain are also helpful in detecting or diagnosing potential hearing loss. They enable differential diagnosis of populations that can otherwise not be tested, such as young children or people with mental disabilities. In addition, there is a growing field of research in which auditory attention is decoded from the brain, with potential applications in smart hearing aids. An increasingly popular method in these fields is to relate a person’s electroencephalogram (EEG) to a feature of the natural speech signal they were listening to. This is typically done using linear regression to predict the EEG signal from the stimulus or to decode the stimulus from the EEG. Given the very low signal-to-noise ratio of the EEG, this is a challenging problem, and several non-linear methods have been proposed to improve upon the linear regression methods. In the Auditory-EEG challenge, teams will compete to build the best model to relate speech to EEG. We provide a large auditory EEG dataset containing data from 105 subjects who listen on average to 108 minutes of single-speaker stimuli for a total of around 200 hours of data. We define two tasks:
Task 1 match-mismatch: given 5 segments of speech and a segment of EEG, which segment of speech matches the EEG?
Task 2 regression: reconstruct the mel spectorgram from the EEG. We provide the dataset, code for preprocessing the EEG and for creating commonly used stimulus representations, and two baseline methods.
Organized by: P. Filntisis and N. Efthymiou
The 2nd e-Prevention challenge (https://robotics.ntua.gr/icassp2024-eprevention-spgc/) aims to stimulate innovative research on the prediction and identification of mental health relapses via the analysis and processing of the digital phenotype of patients in the psychotic spectrum. The challenge will offer participants access to long-term continuous recordings of raw biosignals captured from wearable sensors - namely accelerometers, gyroscopes and heart rate monitors embedded in a smartwatch as well as supplemental data such as sleep schedules, daily step count, and demographics.
Participants will be evaluated on their ability to use this data to extract digital phenotypes that can effectively quantify behavioral patterns and traits. This will be assessed across two distinct tasks: 1) Detection of non-psychotic relapses, and 2) Detection of psychotic relapses, both in patients within the psychotic spectrum.
The extensive data that will be used in this challenge have been sourced from the e-Prevention project (https://eprevention.gr/), an innovative integrated system for medical support that facilitates effective monitoring and relapse prevention in patients with mental disorders.
Organized by: Lei Xie, Eng Siong Chng, Zhuo Chen, Jian Wu, Longbiao Wang, Hui Bu, Xin Xu, Binbin Zhang, Wei Chen, Pan Zhou, He Wang, Pengcheng Guo, and Sun Li
As cars become indispensable parts of human daily life, a safe and comfortable driving environment is more desirable. The traditional touch-based interaction in cockpit is easy to distract the drivers' attention, leading to inefficient operations and potential security risks. Therefore, as a natural user interface (NUI), speech-based interaction has attracted more attention. In-car speech-based interaction endeavors to create a seamless driving and cabin experience for drivers and passengers through various speech processing applications, like speech recognition for command control, entertainment, navigation, and more. Differing from the commonly used automatic speech recognition (ASR) systems deployed in household or meeting scenarios, in-car systems force exclusively challenges. Nevertheless, the lack of publicly available real-world in-car speech data has been a major obstacle to the advancement of the field. Therefore, we launch the ICASSP2024 In-Car Multi-Channel Automatic Speech Recognition Challenge (ICMC-ASR), tailored to the domain of speech recognition in complex driving conditions. In this challenge, we will release a over 1000 hours of real-world recorded, multi-channel, multi-speaker, in-car conversational Mandarin speech data, which includes far-field data collected by distributed microphones placed in the car as well as near-field data collected by each participants' headset microphone. Additionally, over 400 hours of far-field microphones recorded real in-car noise will be available for participants to explore the data simulation technology. The challenge consists of 2 tracks composed of automatic speech recognition (ASR) and automatic speech diarization and recognition (ASDR), aiming to promote in-car automatic speech recognition research and explore challenging research problems accordingly. After the challenge, we plan to open source the data which will further afford the research community with continuous input in this area.
Organized by: Prasanta Kumar Ghosh and Philipp Olbrich
This challenge is the continuation of LIMMITS'23 (ICASSP 23 SPGC), it is aimed at making further progress in multi-speaker, multi-lingual TTS by extending the problem statement to voice cloning. Enabling voice cloning with multilingual TTS systems expands possibilities for cross-lingual synthesis for target speakers. In this challenge, we present the opportunity for the participants to perform TTS Voice cloning with a multilingual base model of 14 speakers. We further extend this scenario, allowing training with more multi-speaker corpora such as VCTK, and LibriTTS. Finally, we also present a scenario for zero-shot voice conversion. Towards these, we share 560 hours of studio-quality TTS data in 7 Indian languages. The evaluation will be performed on mono as well as cross-lingual synthesis, with naturalness and speaker similarity subjective tests.
Organized by: Konstantinos Plataniotis, Juwei Lu, Pai Chet Ng and Zhixiang Chi
Introducing ICASSP 2024 SPGC competition aiming at reconstructing skin spectral reflectance in the visible (VIS) and near-infrared (NIR) spectral range from RGB images captured by everyday cameras, offering a transformative approach for cosmetic and beauty applications. By reconstructing the skin spectral reflectance in both VIS and NIR spectrum, this competition aims to provide rich hyperspectral information accessible on consumer devices. With the reconstructed skin spectral, we pave the way for the creation of personalized beauty and skincare solutions directly through consumers' smartphones and other accessible devices. With the goal of democratizing skin analysis and advancing the field of beauty technology, this competition invites computer vision researchers, machine learning experts, and cosmetic professionals to contribute to a future where personalized beauty and skincare are accessible to all.
Organized by: Tejas Jayashankar, Binoy Kurien, Alejandro Lancho, Gary Lee, Yury Polyanskiy, Amir Weiss and Gregory Wornell
This challenge will require developing an engine for signal separation of radio-frequency (RF) waveforms. At inference time, a superposition of a signal of interest (SOI) and an interfering signal will be fed to the engine, which should recover the SOI by performing a sophisticated interference cancellation. SOI is a digital communication signal whose complete description is available (modulation, pulse-shape, timing, frequency, etc). However, the structure of the interference will need to be learned from data. We expect successful contributions to adapt existing machine learning (ML) methods and/or propose new ones from the areas of generative modeling, variational auto-encoders, U-Nets and others.
Organized by: Jun Du, Chin-Hui Lee, Jingdong Chen, Shinji Watanabe, Sabato Marco Siniscalchi and Odette Scharenborg
Speech-enabled systems often experience performance degradation in real-world scenarios, primarily due to adverse acoustic conditions and interactions among multiple speakers. Enhancing the front-end speech processing technology is vital for improving the performance of the back-end systems. However, most existing front-end techniques are solely based on the audio modality and have reached performance plateaus. Building upon the observation that visual cues can aid human speech perception, the focus of Multimodal Information Based Speech Processing (MISP) 2023 Challenge is on the Audio-Visual Target Speaker Extraction (AVTSE) problem, which aims to extract the target speaker’s speech from mixtures containing various speakers and background noise. MISP 2023 challenge focuses explicitly on the problem under a real scenario with a complex acoustic environment. It provides a benchmark dataset collected from home TV environments, reflecting the challenges of such settings. In addition, to explore the impact of AVTSE on the back-end task, we use a pre-trained speech recognition model to evaluate the performance of the AVTSE.
Organized by: Ander Biguri and Subhadip Mukherjee
The proposed challenge seeks to push the limits of deep learning algorithms for 3D cone beam computed tomography (CBCT) reconstruction from low-dose projection data (sinogram). The key objective in medical CT imaging is to reduce the X-ray dose while maintaining image fidelity for accurate and reliable clinical diagnosis. In recent years, deep learning has been shown to be a powerful tool for performing tomographic image reconstruction, leading to images of higher quality than those obtained using the classical solely model-based variational approaches. Notwithstanding their impressive empirical success, the best-performing deep learning methods for CT (e.g., algorithm unrolling techniques such as learned primal-dual) are not scalable to real-world CBCT clinical data. Moreover, the academic literature on deep learning for CT generally reports the image recovery performance on the 2D reconstruction problem (on a slice-by-slice basis) as a proof-of-concept. Therefore, in order to have a fair assessment of the applicability of these methods for real-world 3D clinical CBCT, it is imperative to set a benchmark on an appropriately curated medical dataset. The main goal of the challenge is to encourage deep learning practitioners and clinical experts to develop novel deep learning methodologies (or test existing ones) for clinical low-dose 3D CBCT imaging with different dose levels.
We will use an instance of the LIDC-IDRI public dataset for the challenge. This dataset contains 1010 3D CT images (obtained with a helical fan beam CT) of chest scans of patients with lung nodules. We will provide simulated CBCT sinograms with our custom forward operator based on the ASTRA toolbox, and custom CT noise simulator. The noise simulator accounts for photon counts, flat fields, electronic sources, and detector cross-talk as sources of noise that are added to the simulated sinograms, such that it provides a fairly accurate model of scanner noise. Reconstruction using the FDK algorithm (the cone-beam equivalent of FBP) will be also provided. Reconstructed scans and sinograms corresponding to 50%, 25%, 10%, and 5% of the approximate clinical dose will be provided. The winner of the challenge will be decided based on the lowest average mean-squared error (MSE) of the reconstructed 3D volumes measured against the corresponding ground-truth (normal-dose) test scans.
Organized by: Michael Akeroyd, Scott Bannister, Jon Barker, Trevor Cox, Bruno Fazenda, Simone Graetzer, Alinka Greasley, Graham Naylor, Gerardo Roa, Rebecca Vos, and William Whitmer
Someone with a hearing loss is listening to music via their hearing aids or headphones. The challenge is to develop a signal processing system that allows a personalised rebalancing of the music to improve the listening experience, for example by amplifying the vocals relative to the sound of the band. One approach to this would be to a demix the music and then apply gains to the separated tracks to change the balance when the music is downmixed to stereo. There is a global challenge of an ageing population, which will contribute to 1 in 10 people having disabling hearing loss by 2050. Hearing loss causes problems when listening to music. It can make picking out lyrics more difficult, with music becoming duller as high frequencies disappear. This reduces the enjoyment of music and can lead to disengagement from listening and music-making, reducing the health and well-being effects people otherwise get from music. We want to get more of the ICASSP community to consider diverse hearing and so allow those with a hearing loss to benefit from the latest signal processing advances.
Organized by: Ross Cutler, Ando Saabas, Lorenz Diener, and Solomiya Branets
PLC is an important part of audio telecommunications technology and codec development, and methods for performing PLC using machine learning approaches are now becoming viable for practical use. Packet loss, either by missing packets or high packet jitter, is one of the top reasons for speech quality degradation in Voice over IP calls.
With the ICASSP 2024 Audio Deep Packet Loss Concealment Challenge, we intend to stimulate research in audio packet loss concealment. Initially, we will provide a set audio files for validation, degraded by removing segments corresponding to lost packets from real recordings of packet loss events, including lost-packet annotations, with high rates of packet loss, along with the corresponding clean reference files. Participants will be able to use this dataset to validate their approaches. Towards the end of the challenge, we will provide a blind test set constructed in the same way (audio files and lost packet annotations, without references). Building on the previous PLC Challenge at INTERSPEECH 2022, this challenge will feature an overall harder task and improved evaluation procedure.
Submissions will have to fill the gaps in the test set audio files using only a maximum of 20 milliseconds of look-ahead, mirroring the tight requirements of real-time voice communication. We will evaluate each submission’s performance on the blind test set based on crowd-source ITU P.804 mean opinion scores as well as speech recognition rate, and the three approaches with the best weighted average scores will be declared the winners.
Organized by: Ross Cutler, Ando Saabas, Babak Naderi, Nicolae-Catalin Ristea, Robert Aichner, and Sebastian Braun
The ICASSP 2024 Speech Signal Improvement Challenge is intended to stimulate research in the area of improving the speech signal quality in communication systems. The speech signal quality can be measured with SIG in ITU-T P.835 and is still a top issue in audio communication and conferencing systems. To improve SIG the following speech impairment areas must be addressed: coloration, discontinuity, loudness, and reverberation. This is the second Speech Signal Improvement challenge, with the first held at ICASSP 2023. We improve on that challenge by providing a dataset synthesizer to allow all teams to start at a higher baseline, an objective metric for our extended P.804 tests, and we add Word Accuracy (WAcc) as a metric. We provide a real test set for this challenge, and the winners will be determined by subjective test (using an extended crowdsourced implementation of ITU-T P.804) and WAcc.