Training candidate selection for effective out-of-set rejection in robust open-set language identification

J Acoust Soc Am. 2018 Jan;143(1):418. doi: 10.1121/1.5017608.

Abstract

Research in open-set language identification (LID) generally focuses on in-set language modeling versus out-of-set (OOS) language rejection. However, unknown/OOS language rejection is essential for effective speech and language pre-processing. To address this, an approach for OOS language selection is proposed. Using probe OOS data, three effective OOS candidate selection methods are developed for universal OOS language coverage. The selected OOS candidates are expected to reflect the entire OOS language space for the state-of-the-art i-vector LID system followed by a Gaussian back-end. Two front-end feature selection strategies are proposed: (i) unsupervised k-means clustering and (ii) complementary candidate selection. Also, (iii) general candidate selection is proposed according to language relationship explored at the score level. All methods are evaluated on a large-scale corpus (LRE-09) containing 40 languages. The proposed selection methods reduce OOS training data diversity by 86% while achieving performance similar to closed-set using all probe OOS for training. The proposed methods also show clear benefits versus random candidate selection (i.e., the proposed solutions achieve sustained performance while employing a minimum number of effective OOS language candidates). To the best of our knowledge, this is the first major effort on effective OOS language selection and enhancement for improved OOS rejection in open-set LID.

Publication types

  • Research Support, U.S. Gov't, Non-P.H.S.
  • Research Support, Non-U.S. Gov't