Technical Program

Paper Detail

Paper ID	E-2-3.6
Paper Title	Effects of End-to-end ASR and Score Fusion Model Learning for Improved Query-by-example Spoken Term Detection
Authors	Takumi Kurokawa, Atsuhiko Kai, Hiroki Kondo, Shizuoka University, Japan
Session	E-2-3: Speech Recognition
Time	Wednesday, 09 December, 17:15 - 19:15
Presentation Time:	Wednesday, 09 December, 18:30 - 18:45 Check your Time Zone
	All times are in New Zealand Time (UTC +13)
Topic	Speech, Language, and Audio (SLA):
Abstract	Query-by-example spoken term detection (STD) systems can make effective use of automatic speech recognition (ASR), especially in situations where the recognition accuracy is high. However, out-of-vocabulary (OOV) problem at the ASR stage has a significant impact on the performance of STD for speech retrieval and can often occur for query terms. Recent studies have shown that end-to-end (E2E) ASR systems can achieve competitive performance compared to conventional DNN-HMM-based ASR systems and reduce the impact of OOV problem by adopting output units of characters or subwords. This paper proposes to apply E2E ASR system in an STD method that considers acoustic similarity at sub-phone level, and to combine it with the DNN-HMM-based ASR and auxiliary information by a score fusion method. Experimental results on the NTCIR-12 SpokenQuery\&Doc-2 task showed that the STD method using the hybrid CTC/Transformer E2E ASR improved the search performance over the STD method using the DNN-HMM-based ASR. The best detection performance was obtained using a score fusion model, demonstrating that combining E2E ASR and auxiliary information with DNN-HMM-based ASR is effective for both known and OOV word queries.