Technical Program

Paper Detail

Paper ID	E-2-2.3
Paper Title	Phoneme Embeddings on Predicting Fundamental Frequency Pattern for Electrolaryngeal Speech
Authors	Mohammad Eshghi, Kazuhiro Kobayashi, Nagoya University, Japan; Kou Tanaka, Hirokazu Kameoka, Nippon Telegraph and Telephone Corporation, Japan; Tomoki Toda, Nagoya University, Japan
Session	E-2-2: Speech Analysis
Time	Wednesday, 09 December, 15:30 - 17:00
Presentation Time:	Wednesday, 09 December, 16:00 - 16:15 Check your Time Zone
	All times are in New Zealand Time (UTC +13)
Topic	Speech, Language, and Audio (SLA):
Abstract	Electrolaryngeal (EL) speech has robotic quality owing to constant fundamental frequency (F0) patterns. In existing F0 pattern prediction frameworks, acoustic models are trained on spectral features of a large corpus of healthy speech. However, EL speech does not embed any useful information about F0 into spectrogram. Moreover, creating datasets with reasonably large number of EL utterances for training neural networks is very time-consuming. Hence, F0 prediction based on other features with sharing capability between EL and normal speech must be investigated. In this study, we investigate F0 prediction based on clustering of the phoneme embeddings. For a dataset consisting of utterances of both speech types, phoneme labels are extracted. These phoneme labels are then used to learn phoneme embeddings in a common 2-D space. Through clustering of the learned phoneme embeddings, new onehot features are created for F0 prediction. Experimental results show that when considering training sets consisting mixed utterances of EL and normal speech, by using new features, improvements in F0 prediction accuracy can be achieved. Moreover, accurate F0 patterns can be predicted even based on lower-dimensional features corresponding to small values for the number of clusters. This could simplify the structure of the recognition system required to extract phoneme labels from EL speech.