IEEE ICASSP 2026 || Barcelona, Spain || 4-8 May 2026

MMSP-P23.1: Teacher-Guided Pseudo Supervision and Cross-Modal Alignment for Audio-Visual Video Parsing

Yaru Chen, University of Surrey, United Kingdom of Great Britain and Northern Ireland; Ruohao Guo, Peking University, China; Liting Gao, Yang Xiang, Qingyu Luo, University of Surrey, United Kingdom of Great Britain and Northern Ireland; Zhenbo Li, China Agricultural University, China; Wenwu Wang, University of Surrey, United Kingdom of Great Britain and Northern Ireland

MMSP-P23.2: SOUNDING HIGHLIGHTS: DUAL-PATHWAY AUDIO ENCODERS FOR AUDIO-VISUAL VIDEO HIGHLIGHT DETECTION

Seohyun Joo, Gwangju Institute of Science and Technology, Korea, Republic of; Yoori Oh, Seoul National University, Korea, Republic of

MMSP-P23.3: CONSTRUCTING COMPOSITE FEATURES FOR INTERPRETABLE MUSIC-TAGGING

Chenhao Xue, University of Oxford, United Kingdom of Great Britain and Northern Ireland; Weitao Hu, Independent Researcher, United Kingdom of Great Britain and Northern Ireland; Joyraj Chakraborty, Zhijin Guo, Kang Li, University of Oxford, United Kingdom of Great Britain and Northern Ireland; Tianyu Shi, University of Toronto, Canada; Martin Reed, Nikolaos Thomos, University of Essex, United Kingdom of Great Britain and Northern Ireland

MMSP-P23.4: An End-to-End Multimodal System for Subtitle Recognition and Chinese-Japanese Translation in Short Dramas

Jing An, Beijing International Studies University, China; Haofei Chang, Renmin University of China, China; Rui-Yang Ju, Kyoto University, Japan; Jinhua Su, Renmin University of China / Simashuhui‌ Ltd., China; Yanbing Bai, Renmin University of China, China; Xin Qu, Beijing International Studies University, China

MMSP-P23.5: Can Large Audio Language Models Understand Audio Well? Speech, Scene and Events Understanding Benchmark for LALMs

Han Yin, Jung-Woo Choi, Korea Advanced Institute of Science and Technology, Korea, Republic of

MMSP-P23.6: ROVLM: REGION-AWARE OPTIMAL VISION-LANGUAGE ALIGNMENT FOR ZERO-SHOT RECOGNITION

Feng Guo, Zhongshu Chen, Yunqian Yu, Mengmeng Jing, Lin Zuo, University of Electronic Science and Technology of China, China

MMSP-P23.7: GMS-CAVP: Improving Audio-Video Correspondence with Multi-Scale Constrative and Generative Pretraining

Shentong Mo, Carnegie Mellon University, United States of America; Zehua Chen, Jun Zhu, Tsinghua University, China

MMSP-P23.8: REALCOUNT: ROBUST OPEN-WORLD OBJECT COUNTING VIA DUPLEX CONTRASTIVE LEARNING

Ziqiang Shi, Rujie Liu, Fujitsu Research & Development Center Co.,LTD., China; Jun Takahashi, Shan Jiang, Fujitsu Limited, Japan

MMSP-P23.9: AVO-65: A LARGE-SCALE HIERARCHICAL AUDIO-VISUAL OBJECT DATASET

Zehao Yao, Guanghui Zhang, Lei Wang, Dongchen Zhu, Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences, China

MMSP-P23.10: HARMONET: MUSIC GROUNDING BY SHORT VIDEO VIA HARMONIC RESAMPLE AND DYNAMIC SPARSE ALIGNMENT

Yaomin Shen, Nanchang Research Institute, Zhejiang University, China; Wei Fan, Independent Researcher, China; Haichuan Hu, Alibaba Cloud, China; Xinqi Liu, The University of Hong Kong, Hong Kong; Min Yang, Nanchang Research Institute, Zhejiang University, China; Rui Jia, East China Normal University, China; Junbiao Cai, Independent Researcher, China