MMSP-P17.4

mCoT-VLA: Towards Robust Vision-Language-Action Models via Multimodal Chain-of-Thought

Huazhen Huang, Juncai Zhang, Jianbo Zhao, Fangyu Liu, Hao Wang, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, China

Session:
MMSP-P17: Vision-Language Models: Reasoning, Benchmarks, and Adaptation Poster

Track:
Multimedia Signal Processing [MM]

Location:
Poster Area 42

Presentation Time:
Thu, 7 May, 09:00 - 11:00

Presentation
Discussion
Resources
No resources available.
Session MMSP-P17
MMSP-P17.1: GEOMETRY-AWARE RECONSTRUCTION OF LARGE VISION-LANGUAGE MODELS FROM DENSE INTO MIXTURE-OF-EXPERTS
Heng Zhang, South China Normal University, China; Haichuan Hu, Alibaba Cloud, China; Lubin Gan, University of Science and Technology of China, China; Haochen You, Columbia University, United States of America; Weihao Yu, Research Institute of China Telecom Corporate Ltd, China; Jin Huang, South China Normal University, China
MMSP-P17.2: CAN VISION LANGUAGE MODELS PERCEIVE GRAPHS ACCURATELY? A VISUAL GRAPH PERCEPTION EVALUATION BENCHMARK
Ruiqi Zhou, Yudong Li, Shiqi Yan, Tsinghua University, China; Guoliang Ma, Xinjiang University, China; Yongfeng Huang, Tsinghua University, China
MMSP-P17.3: ENCORE: ENTROPY-GUIDED CROPPING AND ATTENTION REGULARIZATION FOR ROBUST VISION–LANGUAGE UNDERSTANDING
Yuanhao Sun, Huawei Ji, Jiaxin Ding, Luoyi Fu, Xinbing Wang, Shanghai Jiao Tong University, China
MMSP-P17.4: mCoT-VLA: Towards Robust Vision-Language-Action Models via Multimodal Chain-of-Thought
Huazhen Huang, Juncai Zhang, Jianbo Zhao, Fangyu Liu, Hao Wang, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, China
MMSP-P17.5: MULTI-OS: MULTIMODAL OOD SYNTHESIS ENHANCES OUT-OF-DISTRIBUTION DETECTION FOR VISION-LANGUAGE MODELS
Weizhi Wu, Jinlun Ye, Jiankang Chen, Dexia Chen, Ruixuan Wang, Sun Yat-sen University, China
MMSP-P17.6: MAPROUTE-BENCH: EVALUATING SPATIAL REASONING ON TOP-VIEW MAPS IN VISION-LANGUAGE MODELS
Ao Cheng, National University of Defense Technology, China; Jianguo Ma, Intelligent Game and Decision Lab, China; Fei Yang, National University of Defense Technology, China; Zhengding Luo, Nanyang Technological University, Singapore; Jiyuan Chen, Chunping Qiu, Intelligent Game and Decision Lab, China
MMSP-P17.7: DYNAMIC LANGUAGE ADAPTATION AND COLLABORATIVE MEMORY MODELING FOR VISION-LANGUAGE TRACKING
Guomao Guo, Gu Geng, Youqiang Xiong, Huayi Zhu, Pengfei Wei, Rui Chen, Di Yuan, Xidian University, China
MMSP-P17.8: MULTI-TURN PHYSICS-INFORMED VISION-LANGUAGE MODEL FOR PHYSICS-GROUNDED ANOMALY DETECTION
Yao Gu, Shanghaitech University, China; Xiaohao Xu, University of Michigan, United States of America; Yingna Wu, Shanghaitech University, United States of America
MMSP-P17.9: FASTAV: EFFICIENT TOKEN PRUNING FOR AUDIO-VISUAL LARGE LANGUAGE MODEL INFERENCE
Chaeyoung Jung, Youngjoon Jang, Seungwoo Lee, Joon Son Chung, Korea Advanced Institute of Science & Technology, Korea, Republic of
MMSP-P17.10: FEATURE PROJECTION LEARNING FOR BETTER VISION-LANGUAGE REASONING
Yi Zhang, Weicheng Lin, Liang-Jie Zhang, Shenzhen University, China
MMSP-P17.11: DISTILLING SYNERGISTIC KNOWLEDGE FROM A FUSION TEACHER FOR SAR OBJECT DETECTION
Jialei Ni, Yinghua Wang, Hongwei Liu, Xidian University, China
MMSP-P17.12: EMODRIVE: AN EMOTION-AWARE VISION-LANGUAGE MODEL FOR HUMAN-CENTRIC AUTONOMOUS DRIVING
Xiangwen Zhang, Beijing Technology and Business University, China; Zeke Zexi Hu, The University of Sydney, Australia; Chen Wang, Xiaoming Chen, Beijing Technology and Business University, China; Qiang Qu, The University of Sydney, China
MMSP-P17.13: CHROMOUVQA: BENCHMARKING VISION-LANGUAGE MODELS UNDER CHROMATIC CAMOUFLAGED IMAGES
Yunfei Zhang, Amazon, United States of America; Yizhuo He, Google, United States of America; Yuanxun Shao, MurcuryMind, United States of America; Zhengtao Yao, Haoyan Xu, University of Southern California, United States of America; Junhao Dong, Nanyang Technological University, Singapore; Zhen Yao, Lehigh University, United States of America; Zhikang Dong, Stony Brook University, United States of America
MMSP-P17.14: TRAINING-FREE TEST-TIME ADAPTATION WITH BROWNIAN DISTANCE COVARIANCE IN VISION-LANGUAGE MODELS
Yi Zhang, Shenzhen University, China; Chun-Wun Cheng, University of Cambridge, United Kingdom of Great Britain and Northern Ireland; Angelica I. Aviles-Rivero, Tsinghua University, China; Zhihai He, Southern University of Science and Technology, China; Liang-Jie Zhang, College of Computer Science and Software Engineering, China
MMSP-P17.15: LATENT DOMAIN PROMPT LEARNING FOR VISION-LANGUAGE MODELS
Zhixing Li, Arsham Gholamzadeh Khoee, Yinan Yu, Chalmers University of Technology, Sweden
Contacts