MMSP-L1.3

TOWARDS PRACTICAL AND EFFICIENT IMAGE-TO-SPEECH CAPTIONING WITH VISION-LANGUAGE PRE-TRAINING AND MULTI-MODAL TOKENS

Minsu Kim, Jeongsoo Choi, KAIST, Korea, Republic of; Soumi Maiti, Carnegie Mellon University, United States of America; Jeong Hun Yeo, KAIST, Korea, Republic of; Shinji Watanabe, Carnegie Mellon University, United States of America; Yong Man Ro, KAIST, Korea, Republic of

Session:
MMSP-L1: Multimodal Processing: Vision + Language 1 Lecture

Track:
Multimedia Signal Processing

Location:
Room 201

Presentation Time:
Tue, 16 Apr, 17:10 - 17:30 (UTC +9)

Session Co-Chairs:
Jin Zeng, Tongji University, Shanghai, China and Fernando Pereira, IST, Portugal
View Manuscript
Presentation
Discussion
Resources
Contacts