Font Size: a A A

Deep Neural Network Based Acoustic Feature Extraction For LVCSR Systems

Posted on:2015-01-26Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y B BaoFull Text:PDF
GTID:1268330428999925Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
In recent years, the resurrection of deep neural network (DNN) has caused strong repercussions in many research areas, and attracts more and more atten-tions. In speech recognition, DNN techniques have been reported to be able to significantly improve the performance of acoustic models so that it becomes a new research hotspot. Usually, when applied to acoustic modeling, DNN can be employed to1) build hybrid architecture models along with the HMM (i.e. DNN-HMM) for replacing the GMM to compute the state emission probabilities, or2) work as a front-end acoustic feature extractor to provide more effective acoustic features for the traditional GMM-HMM modeling framework. This thesis will fo-cus on the later, i.e. DNN based acoustic feature extraction and its application in large vocabulary continuous speech recognition (LVCSR) systems. It introduces our research and innovations related to DNN based Tandem feature extraction (or probabilistic feature) and bottleneck feature extraction.Firstly, in this thesis we construct a new phoneme modeling units set for Mandarin LVCSR systems. Here, the construction involves segmenting the Fi-nal part of Initial/Final units into several phoneme units, revising the phoneme units and making the corresponding lexicon according to some prior knowledge, and designing the question set based-on the extended vowel triangle for the first time. The new phoneme units set seems more compact with reduced redundances and overlaps and enhanced discrimination between distinct units. When a neural network is used to extract Tandem features, the output layer nodes can be effec-tively reduced because of the relatively small number of phoneme units. Thus, the perplexity of the neural network output layer is reduced and it is beneficial for the Tandem feature extraction. Our experimental results suggest that the new phoneme units set outperforms the conventional Initial/Final units set in the form of recognition accuracy both in baseline GMM-HMM modeing and Tandem feature extraction.Secondly, this thesis builds a DNN based bottleneck feature extraction base-line system and improves it with some heuristic techniques. To extract bottleneck features, we usually place a hidden layer with very small number of hidden nodes, which is often same as the dimension of MFCC or PLP features, in the middle of the DNN. This hidden layer is vividly described as bottleneck layer. Accordingly, the DNN with this special structure is called bottleneck DNN and the outputs from the bottleneck layer are exactly the expected baseline bottleneck features. By imposing some heuristic techniques, such as de-correlation with linear trans-formation method PCA, augmenting the bottleneck features with their first and second order derivatives, and re-adjusting the relative importance between acous-tic model scores and language model scores with acoustic scaling factor in the decoding, the performance of the bottleneck features is improved significantly in our experiments and is comparable to the state-of-the-art DNN-HMM hybrid models. Among of these techniques, the acoustic scaling factor is most important.Thirdly, this thesis proposes two novel incoherent training methods for bottle-neck DNN. The first method relies on minimizing the coherence of weight matrix in the bottleneck layer while the second one attempts to minimize the correla-tion coefficients of bottleneck features calculated in each mini-batch data in DNN training. The basic idea of our incoherent training methods is to introduce a regularization term to the original objective function of DNN training. The reg-ularization term aims to directly measure and control the correlation among all activation signals in the bottleneck layer to derive better de-correlated bottleneck features, which are even more suitable for succeeding GMM-HMM modeling with diagonal covariance matrices. Experimental results show that the two proposed incoherent training methods have produced further gains over the baseline bot-tleneck features and the discriminatively trained GMM-HMM models using these incoherently trained bottleneck features have consistently surpassed the popular hybrid DNN-HMM models in all evaluated LVCSR tasks.Lastly, this thesis leverages the sequential discriminative training (SDT) to improve the bottleneck DNN for deriving better bottleneck features. SDT method has received significant improvements in the GMM-HMM framework. The sequen-tial information between speech frames included in the objective function of the SDT is important for speech recognition while they are lack in the traditional DNN training methods (mainly the frame-based cross entropy method). In this context, we resort to the SDT method to further optimize the parameters of the bottleneck DNN, and in the meanwhile we adopt two novel bottleneck DNN struc-tures for better performance. The bottleneck layer is placed in the last hidden layer in both of these two structures. But, the first one keeps the other hidden layers containing the same number of nodes, and the second one constructs the other hidden layers with one wide layer adjacent to one narrow layer, and so forth. It can be observed from the experiments that the SDT method is beneficial for getting better bottleneck features, and the performance can be further improved when the bottleneck layer is moved into the last hidden layer. Besides, the sec-ond proposed novel structure is helpful to reduce the computations of bottleneck feature extraction caused by the backward moving of the bottleneck layer, while there is almost no loss of the performance.
Keywords/Search Tags:Deep Neural Network (DNN), Large Vocabulary Continuous SpeechRecognition (LVCSR), Tandem Feature, Bottleneck Feature, Incoherent Training, Discriminative Training
PDF Full Text Request
Related items