Font Size: a A A

Research On Discriminative Techniques Of Feature Extraction And Acoustic Model Training In Continuous Speech Recognition

Posted on:2016-05-22Degree:DoctorType:Dissertation
Country:ChinaCandidate:B ChenFull Text:PDF
GTID:1108330482979240Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
In the traditional speech recognition system, the cepstral features are extracted and the acoustic model is trained with maximum likelihood criterion. However, the dynamic time characteristics acquired from cepstral and its differential features are limited and the confusing information among models is not effectively used, which brings about insufficient discriminability among features.The maximum likelihood criterion based model training method aims at optimizing model parameters in individual class under some assumptions, not considering the connection among classes, so it is difficult to obtain the best acoustic model. These problems can be effectively improved with discriminative technologies. The discriminative feature extraction method uses the confusion information effectively to extract the long-term feature, which is more discriminative and robust. The discriminative acoustic model training method takes account of the interaction between the models fully, and focuses on adjusting the decision surfaces between the models, which is often accompanied by reduced error rates and improved system performance.This dissertation focuses on studying the discriminative techniques of feature extraction and acoustic model training in continuous speech recognition.The discriminative feature extraction method is studied in the feature space and model space respectively, and three research achievements are made as following:(1)The feature-space based discriminative feature extraction is studied.To improve the performance when the data distribution is complex, a linear discriminant analysis method based on the minimum classification error criterion is proposed, and further applied to the continuous speech recognition feature transformation. The data probability distribution is estimated using non-parametric kernel density estimation method. According to the obtained probability distribution, a gradient descent based linear search procedure is performed to get the discriminant analysis transformation matrix under the minimum classification error criterion. The dimensionality of super-vector conjoined by the adjacent frames Mel filter bank output is reduced with the transformation matrix, and then the time-frequency feature is acquired after dimensionality reduction. Experimental results show that compared with the traditional methods, the recognition accuracy rate of the novel method has a higher improvement.(2) Further, in order to extract the stable long-term features from insufficient data, a group-Lasso based mixture discriminant analysis method is put forward. Firstly, the Gaussian mixture model is used to describe the distribution of data, and the objective function of group-Lasso based mixture discriminant analysis is got based on the quadratic variational form of the group-Lasso.Subsequently, through defining the blurred response matrix, the discriminant analysis transform matrix is figured out combining with the optimal scoring method. Finally, the super-vector is obtained by conjoined the adjacent frames Mel filter bank output, and the time-frequency feature is extracted after the dimensionality of super-vector is reduced using the transform matrix. Experimental results show that, under the condition of the noise environment and lack of data, this new method can achieve even higher recognition performance.(3) The model-space based discriminative feature extraction is studied. A segment based discriminative feature transform method is presented to improve the stability of the frame based method. Feature transform is considered as the sparse high dimensional approximation problem with an over-complete dictionary, which was constructed by the feature transforms of tied-state based training of RDLT (Region Dependent Linear Transform) and mean-offset fMPE(feature Minimum Phone Error).Using the matching pursuit to optimize iteratively, the transform and coefficients of the speech segment got through force alignment is automatic determined with the maximum likelihood criteria, and a correlation measurement is introduced to remove the correlated feature basis in the recurrence process.To get a more precise transformation, the transformation acquired by the matching pursuit is set as the initial values, and an appropriate regularization term is combined with the likelihood objective function. The optimal transform matrix and its coefficients of the test speech segment are automatically chosen using the fast iterative shrinkage thresholding optimization algorithm.After getting the transformed feature based on segment, the bottleneck feature transformed upon the frame based is combined, and the acoustic model is trained using the combined feature. The experimental results show that, compared with the traditional RDLT method, the new method can obtain better recognition performance, and this method is more robust.The discriminative acoustic model training method is studied from the training criteria, training data selection, and complementary system construction respectively, and three findings are achieved as following:(4) The discriminative training criteria of acoustic model is studied, and a generalized margin based discriminative training criterion is proposed. The different discriminative training criterion is unified in a theory framework and two novel discriminative training objective function is designed. By analyzing the relationship between different discriminative training objective function and MMI (Maximum Mutual Information) set as the separation measure, the different discriminative training objective function is unified into a discriminative training criteria based on generalized margin. The weighting function in the criteria is further discussed and two kinds of discriminative objective function are got. When the candidate path is weighted through a combination of boosted factor and the number of the misrecognition words in the candidate path, a discriminative objective function SBMMI (Soft Boosted MMI) is presented. While a single candidate word is dynamic weighted using the exponential form in which the misrecognition rate of each training statement is defined by the posterior probability of a single candidate, the other discriminative objective function VWMMI (Variable Weighting MMI) is proposed. The experimental results show that compared with the soft margin estimation and boosted maximum mutual information method, the recognition accuracy of SBMMI method is higher and VWMMI method can get an additional improvement upon SBMMI method.(5) The discriminative training data selection method is studied. To select the training data effectively, and reduce the amount of computation of speech recognition system, a training data selection method based on variable weighting is proposed. Firstly, the lattice is pruned using the posterior probability based beam algorithm, after that a single candidate word is variable weighted upon the misrecognition rate of each training statement, which is defined by the posterior probability of a single candidate. Secondly, the phone accuracy is calculated via the penalty weights, which are variable added to the confusing phone according to the confusion degree of phone pairs. Thirdly, after estimating the distribution of the expected phone accuracy of candidate arcs, all the arcs is soft weighted in the Gaussian form. Finally, the data is selected by combining posterior probability with phone accuracy.Experimental results show that compared with the minimum phone error criterion, this variable weighting method recognition accuracy is higher, also can effectively reduce the training time.(6) The discriminative complementary system construction method is studied. For the theory of the existing complementary system structure is not strong, the difference description among the complementary systems is not accurate.So a complementary system generation method based on confusing information weighting is proposed within the framework of discriminative training. Firstly, each pair of confusing phones is dynamically weighted according to the phone confusion information, and the weighted phone accuracy is calculated by referring to the three best hypothesis paths of the base system. Meanwhile, the standard phone accuracy is obtained using the true transcription as the reference. Then, a model space complementary system is constructed by maximizing the weighted phone accuracy,and by minimizing the standard phone accuracy simultaneously. Furthermore, through combining the model-space complementary system-generating method with the RDLT feature transform process, a feature space complementary system is constructed. Experimental results show that the presented method can enlarge the diversity among the complementary systems. Compared with the complementary minimum phone error criterion, the recognition rate is increased by combining the base system with the model space complementary system. The highest performance gain is got when combining the base system with both the feature and model space complementary systems.
Keywords/Search Tags:Continuous Speech Recognition, Acoustic Model, Discriminative Training, Linear Discriminant Analysis, Feature Transform, Regularization Method, Region Dependent, System Combination
PDF Full Text Request
Related items