AASP-P30.10

SightSound-R1: Cross-Modal Reasoning Distillation from Vision to Audio Language Models

Qiaolin Wang, Xilin Jiang, Linyang He, Columbia University, United States of America; Junkai Wu, University of Washington, United States of America; Nima Mesgarani, Columbia University, United States of America

Session:
AASP-P30: Audio for Video and Multimedia Poster

Track:
Audio and Acoustic Signal Processing [AA]

Location:
Poster Area 25

Presentation Time:
Fri, 8 May, 14:00 - 16:00

Presentation
Discussion
Resources
No resources available.
Session AASP-P30
AASP-P30.1: Temporally Heterogeneous Graph Contrastive Learning for Multimodal Acoustic Event Classification
Yuanjian Chen, Harbin University of Science and Technology, China; Yang Xiao, The University of Melbourne, Australia; Jinjie Huang, Harbin University of Science and Technology, China
AASP-P30.2: StereoFoley: Object-Aware Stereo Audio Generation from Video
Tornike Karchkhadze, UC San Diego, United States of America; Kuan-Lin Chen, Mojtaba Heydari, Robert Henzel, Alessandro Toso, Mehrez Souden, Joshua Atkins, Apple, United States of America
AASP-P30.3: LEARNING WHAT TO HEAR: BOOSTING SOUND-SOURCE ASSOCIATION FOR ROBUST AUDIOVISUAL INSTANCE SEGMENTATION
Jinbae Seo, Hyeongjun Kwon, Kwonyoung Kim, Yonsei University, Korea, Republic of; Jiyoung Lee, Ewha Womans University, Korea, Republic of; Kwanghoon Sohn, Yonsei University, Korea, Republic of
AASP-P30.4: Efficient Audio-Visual Inference via Token Clustering and Modality Fusion
Chenjie Pan, Yi Zhu, Songkai Ning, Xiangyang Liu, Weiping Zhen, Chenyou Fan, South China Normal University, China
AASP-P30.5: V2A-DPO: OMNI-PREFERENCE OPTIMIZATION FOR VIDEO-TO-AUDIO GENERATION
Nolan Chan, The Chinese University of Hong Kong, Hong Kong; Timmy Gang, National Research Council Canada, Canada; Yongqian Wang, The University of Warwick, United Kingdom of Great Britain and Northern Ireland; Yuzhe Liang, Shanghai Jiao Tong University, China; Dingdong Wang, The Chinese University of Hong Kong, Hong Kong
AASP-P30.6: AUDIOGEN-OMNI: A UNIFIED MULTIMODAL DIFFUSION TRANSFORMER FOR VIDEO-SYNCHRONIZED AUDIO, SPEECH, AND SONG GENERATION
Le Wang, China University of Mining and Technology, China; Jun Wang, Chunyu Qiang, Feng Deng, Chen Zhang, Kun Gai, Kuaishou Technology, China
AASP-P30.7: SIREN: SPATIALLY-INFORMED RECONSTRUCTION OF BINAURAL AUDIO WITH VISION
Mingyeong Song, Seoyeon Ko, Junhyug Noh, Ewha Womans University, Korea (Democratic People's Republic of)
AASP-P30.8: Asynchrony-Aware Decoupled Multimodal Control for Cued Speech Video Generation
Fengji Ma, HongKong University of Science and Technology (Guangzhou), China; Xiao-Ping Zhang, Tsinghua Berkeley Shenzhen Institute, China; Li Liu, HongKong University of Science and Technology (Guangzhou), China
AASP-P30.9: VISUAL KEYS TO SYMPHONIES: LATENT DIFFUSION FOR MULTI-SCENE VIDEO-TO-MUSIC GENERATION
Chiu Fai Ng, Karsper So, Jing Yang, Patricio Ovalle, Simon Lui, Fan Fan, Central Media Technology Institute, Huawei, Hong Kong; Yuhan Dong, Shenzhen International Graduate School, Tsinghua University, China
AASP-P30.10: SightSound-R1: Cross-Modal Reasoning Distillation from Vision to Audio Language Models
Qiaolin Wang, Xilin Jiang, Linyang He, Columbia University, United States of America; Junkai Wu, University of Washington, United States of America; Nima Mesgarani, Columbia University, United States of America
Contacts