Technical Program

Paper Detail

Paper IDD-3-3.6
Paper Title TEMPORAL ATTENTION FEATURE ENCODING FOR VIDEO CAPTIONING
Authors Nayoung Kim, Ewha W. University, Korea (South); Seong Jong Ha, NCSOFT, Korea (South); Jewon Kang, Ewha W. University, Korea (South)
Session D-3-3: Image and video processing based on deep learning
TimeThursday, 10 December, 17:30 - 19:30
Presentation Time:Thursday, 10 December, 18:45 - 19:00 Check your Time Zone
All times are in New Zealand Time (UTC +13)
Topic Image, Video, and Multimedia (IVM): Special Session: Image and video processing based on deep learning
Abstract In this paper, we propose a novel video captioning algorithm including a feature encoder (FENC) and a decoder architecture to provide more accurate and richer representation. Our network model incorporates feature temporal attention (FTA) to efficiently embed important events to a feature vector. In FTA, the proposed feature is given as the weighted fusion of the video features extracted from 3D CNN, and, therefore it allows a decoder to know when the feature is activated. In a decoder, similarly, a feature word attention (FWA) is used for weighting some elements of the encoded feature vector. The FWA determines which elements in the feature should be activated to generate the appropriate word. The training is further facilitated by a new loss function, reducing the variance of the frequencies of words. It is demonstrated with experimental results that the proposed algorithms outperforms the conventional algorithms in VATEX that is a recent large-scale dataset for long-term video sentence generation.