Font Size: a A A

Research On Utterance Representation In Language Identification

Posted on:2017-05-14Degree:MasterType:Thesis
Country:ChinaCandidate:R L CuiFull Text:PDF
GTID:2308330485451802Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Language identification (LID) is an automatic process to identify the type of lan-guage for a given speech utterance, one of pattern recognition problems for speech ut-terances. So a key question of LID is how to obtain the representation to describe the utterances. As it includes the core technologies in speech signal processing field, such as feature extraction and speech recognition, LID has some scientific value. At the same time as a front-end processing, LID has extensive applications in multi-lingual speech recognition, cross-lingual communication systems and military monitoring.The traditional methods are mainly based on phoneme matching and acoustic fea-tures in language recognition. Although the existing methods have made significant progress for long-duration utterances, the performance is still far from satisfactory for confusable dialects and short-duration utterance. With development and successful applications of Deep Neural Network (DNN), language identification has a new re-search area. On the front-end feature extraction, Deep Bottleneck Feature-Total Vari-ability (DBF-TV) is proposed for LID successfully based on bottleneck hidden layer of DNN. On the backend model, we can take full advantage of its output layer information based on DNN discriminative modeling capabilities, such as DNN/i-Vector method, namely using the senone posteriors of DNN output layer to estimate universal back-ground model (UBM). However, DNN is trained based on acoustic features as input and senone posteriors as output, so we believe that DNN from the input layer to the output layer successively reflects relatively complete information of speech from the acoustic characteristics to phonemes associated Semantic and different layers informa-tion have complementarity. Therefore, this thesis is to gain the utterance representation based on the same DNN different layers, specifically, the intermediate bottleneck layer and output layer.Firstly, the frame-level features of senones extracted from the DNN output layer, can be considered as the sequence of phoneme states. Then we can calculate its statis-tics as utterance representations. The resulting representation is in the form of vector, we can directly use the discriminative model to distinguish its classification, specifi-cally SVM with a suitable kernel function according to their characteristics. And as the complementarity of different layers information from the same DNN, this method can enhance the performance of language recognition systems fused with DBF-TV.Secondly, to achieve DBF based DNN/i-Vector baseline system using DNN with the intermediate bottleneck layer, which simultaneously extracts the DBF from bottle-neck layer and clusters the DBF depending on senones. It is a fusion in the model domain. Specifically, DNN/i-Vector uses the senonc posteriors to calculate the UBM combining with DBF. Carrying Acoustic Factor Analysis (AFA) model for DBF based on DNN/i-Vector System, it can better describe the feature space and further improve the recognition performance of the system.Finally, because the features are frame-level features extracted from the DNN out-put layer, we can also use statistical model to describe its distribution at the frame-level feature space in order to gain the utterance representation, similar to DBF. However, the features generally fall into high dimensional feature, we need to analyze it in a low-dimensional subspace. We use Mixtures of Factor Analyzers (MFA) to study the characteristics of features in the low-dimensional subspace, which is a combination of dimension reduction and clustering. It is equivalent to clustering firstly and then us-ing factor analysis to map the features into low-dimensional subspace in each cluster. Compared to statistical representation from the output layer of DNN, the performance of this method has some improvement, especially in the short-duration utterance.
Keywords/Search Tags:Language Identification, Utterance Representation, Deep Neural Network, Senone Posteriors, Deep Bottleneck Feature
PDF Full Text Request
Related items