Font Size: a A A

Research On Speaker Adaptation Of Neural Network Acoustic Models For Speech Recognition

Posted on:2019-11-02Degree:MasterType:Thesis
Country:ChinaCandidate:D GuFull Text:PDF
GTID:2428330542499278Subject:Electronic Science and Technology
Abstract/Summary:PDF Full Text Request
In recent years,deep neural network has made great progress in auto,matic speech recognition,which significantly improves the recognition performance of the system compared the traditional Gaussian mixture model.However,DNNs,like other supervised learning based approaches,are susceptible to performance degradations due to the mismatch between training and test conditions,speaker adaptation is used to deal with the mismatch problem caused by speaker variability.Research on speaker adaptation of neural network acoustic models has become a popular direction in speech recognition.In this thesis,we focus on the feature-based and model-based adaptation approaches.These approaches are combined to improve the performance.And we also investigate some adaptation approaches in convolution neural network architecture.The main research contents are as follows:Firstly,for the DNN's poor discrimination of speaker information in acoustic models,auxiliary features approach is introduced which the acoustic feature vectors are augmented with additional speaker-specific features as input to DNN,and enhance the perception of the network to speaker information.The experimental results show that the feature augmentation method can improve the discrimination of the network and reduce inter-speaker differences,then the system's word error rate is reduced.Then,the model-based method,Learning Hidden Unit Contributions(LHUC),is deeply investigated and we adopt two strategies to improve the original method.Considering the complementarity between the adaptation methods,we combine LHUC with auxiliary features to further improve the system performance.In addition,in order to address the data sparsity issue in adaptation stage,multi-task learning(MTL)is employed to adapt LHUC parameters by adding auxiliary phone classification as the second task.The experimental results show that the fusion method can effectively improve the matching degree of the model to a specific speaker and further reduce the system's word error rate.The MTL-LHUC method expands the coverage of the acoustic space to deal with the unseen state problem,and bring a better performance in scenarios with a limited amount of available adaptation data.Finally,we explore speaker adaptation on convolutional neural network(CNN)framework.In order to use the model-based method,LHUC with several advantages,we extend the method to CNN-based acoustic model.Experiments are conducted to compare the performance of adaptive layers in convolution layers and pooling layers or even the input layer,and proves the method works.Meanwhile,we propose an i-vector based method by inserting i-vector into convolution layers through a transform matrix.The new convolutional layer structure can reduce inter-speaker differences while extracting local information.The relevant experiments demonstrate that this method effectively reduces word error rate compared to the baseline system,and only adds a small amount of parameters as cost.In addition,the fusion of above two methods can further improve the system performance,and achieve the best result in unsupervised fashion.
Keywords/Search Tags:speech recognition, speaker adaptation, deep neural network, auxiliary features, LHUC, multi-task learning, convolutional neural network, i-vector
PDF Full Text Request
Related items