Research Of Deep Learning Based Low-resource Speech Recognition

Posted on:2017-12-25

Degree:Master

Type:Thesis

Country:China

Candidate:C X Qin

Full Text:PDF

GTID:2428330596459981

Subject:Military Intelligence

Abstract/Summary:

PDF Full Text Request

With the development of speech recognition,higher demands of speech recognition are proposed.Low-resource speech recognition,as a typical type of constrained speech recognition,has become a popular research because of its low recognition accuracy and high value of application.In low-resource conditions,feature extraction and acoustic model will both be severely affected.Deep learning is usually used to improve performance of speech recognition system.How to improve training of deep learning based model,and how to improve model itelf,are two main problems that remain to be solved under low-resource conditions.Aiming at solving these problems mentioned above,this thesis focuses on studying feature extraction and acoustic model of low-resource speech recognition,obtaining contributions as follows:(1)A novel approach of extracting Deep Neural Network(DNN)based features is proposed.For the Tandem system,the existence of Bottle-Neck(BN)layer degrades classification accuracy of DNN when extracting BN features,hence a new high-level feature extraction approach,which does not change the training structure of DNN,is proposed.A DNN is firstly trained without setting the BN layer,then Non-negative Matrix Factorization(NMF)algorithm is applied on weights matrix of a certain hidden layer.The obtained basis matrix is set as the weights matrix of the new formed feature layer,and a new type of low-dimensional high-level feature is extracted by forward passing input data without setting a bias vector in the new feature-layer.Experiments show that the feature has a stable performance around different recognition tasks.When using enough English training data in the experiment,the proposed features have almost the same recognition performance with the conventional BN feature.When in low-resource environment with only 1 hour of Czech training data,the recognition accuracy of the new Tandem system outperforms both DNN hybrid system and BN-Tandem system obviously.(2)To alleviate the performance degradation that deep neural network based features suffer when transcribed training data is insufficient,two deep neural network based feature extraction approaches for low-resource speech recognition are proposed.Firstly,some high-resource corpuses are used to help train a BN deep neural network based on a shared-hidden-layer network structure,then dropout,maxout,and rectified linear units(ReLU)methods are exploited in order to solve the overfitting problem caused by irregular distributions of multi-stream training samples,and at the mean time reduce the number of network parameters and training time.Secondly,to further enhance the performance of DNN based features with multilingual training,a method of combining Convex-Nonnegative Matrix Factorization(CNMF)algorithm and multilingual training is proposed.A shared-hidden-layer multilingual DNN is firstly trained,then weights matrix of a certain shared-hidden-layer is factorized to get the basis matrix as weights matrix of the newly formed feature-layer.Experiments based on 1 hour of Vystadial 2013 Czech low-resource training data show that with the help of 26.7 hours of English training data,the recognition system obtains 7.0% relative recognition accuracy improvement from baseline system when to dropout and rectified linear are applied,and obtains 12.6% relative recognition accuracy improvement while reduces 62.7% relative network parameters and 25% training time from other proposed systems when dropout and maxout are applied.CNMF based features perform better over bottleneck features in both low-resource monolingual and multilingual training situations.They also gain from 0.8% to 3.4% word accuracy over the state-of-art deep neural network hidden Markov models hybrid systems.(3)To improve Convolutional Neural Network(CNN)acoustic model under low-resource conditions,this paper proposes a method of multi-stream features incorporated CNN acoustic modeling approach.CNN acoustic model outperforms DNN acoustic model in low-resource speech recognition task.However,network parameters suffer from insufficient training under the low-resource training data condition.In order to make use of more acoustic information of features from limited data to do acoustic modeling,parallel convolutional sub-networks are built using multi-stream features from low-resource data,and then some fully connected layers are added above to form a new CNN structure.Experiments show that parallel convolutional sub-networks normalize different feature spaces more similar,and it gains 3.3% and 2.1% recognition accuracy improvement separately from traditional multi-feature splicing training approach and baseline CNN system.Furthermore,when multilingual training is introduced based on the proposed approach,recognition accuracy improves 5.7% and 4.6% separately.

Keywords/Search Tags:

Low-resource Speech Recognition, Deep Neural Network, Bottle-Neck Features, Non-negative Matrix Factorization, Multilingual Training, Convolutional Neural Network, Multi-stream Features

PDF Full Text Request

Related items

1	Research On Image Recognition Based On Semi-supervised Non-negative Matrix Factorization
2	Analysis Of Effective Fused Features And Model Evaluation For Speech Emotion Recognition
3	Research On Speaker Adaptation Of Neural Network Acoustic Models For Speech Recognition
4	Research On Face Recognition Algorithms Based On Multi-layer Non-negative Matrix Factorization Architecture
5	The Research Of Dimensional Speech Emotion Recognition Based On Neural Network And Fusion Features
6	Speech Enhancement Algorithm Based On Deep Neural Network
7	Speech Emotion Recognition Based On Deep Learning
8	Research On Speech Emotion Recognition Based On Multi Features Fusion
9	Research On Deep Learning Based Speech Enhancement
10	Computational Auditory Model And Deep Neural Network Based Binaural Speech Segregation