Font Size: a A A

Time Delay Neural Network Based Automatic Speech Recognition

Posted on:2021-05-09Degree:MasterType:Thesis
Country:ChinaCandidate:X R HuangFull Text:PDF
GTID:2428330611966447Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
With the development of deep learning,time delay neural network based deep learning has become one of the mainstream.However,there are still some problems in the current research process.On the one hand,there is still a lack of deepening time delay neural network(TDNN)On the other hand,for end-to-end speech recognition,it is easy to encounter the problem of data shortage in low resource speech recognition scene,which leads to performance degradation.In view of the above problems,our main research contents and achievements are as follows:1.A stochatic depth based factorization time delay neural network is proposed.The traditional TDNN only uses one weight matrix for feature extraction for each context splicing layer.To solve this problem,this paper proposes deepening behind each context splicing layer of the TDNN.On the basis of deepening,this paper introduces residual connection to improve the convergence ability of the model,and uses the training method of stochatic depth to enhance the generalization ability of the model.However,the deepening of the model inevitably leads to the increase of parameters.In this paper,A stochatic depth based factorization time delay neural network is proposed.The structure of Singular Value Decomposition is used to initialize the deepening feedforward neural network layer,and one of the weight matrix is limited to semi-positive definite to ensure the stability of training,thus increasing the practicability of the model.The experimental results show that the model proposed in this paper outperforms the factorization time delay neural network on the English data sets of AMI and SWBD,and can achieve the same results as the recurrent neural network.On the other hand,the parameters of the model proposed in this paper are lower,and do not depend on the recurrent neural network.It has the advantages of low time delay,easy convergence and efficient calculation in practical application.2.In the end-to-end low resource speech recognition scenario,aiming at the problem of data shortage,this paper proposes a method of single step weight transfer for knowledge transfer,and has been improved in the low resource scenarios from English to English and English to Czech.In the aspect of output modeling unit,aiming at the problem of lack of data in the output unit,this paper proposes a suitable method in the low resource speech recognition scenario,the above-mentioned two phonemes binding algorithm is implemented by setting the minimum threshold number of the left context phonemes and the minimum threshold number of the single phonemes,which not only reduces the amount of model parameters,but also alleviates the lack of output unit data;in terms of acoustic modeling,this paper proposes a convolution neural network based stochatic depth factorization time delay neural network,which is initialized by singular value decomposition structure in the deep feedforward neural network layer,and through the training method of stochatic depth,the input layer of convolution neural network and limiting the semi-positive qualitative of the deep feedforward neural network layer to enhance the generalization ability of the model in low resource scenarios.Finally,compared with some mainstream end-to-end models,the proposed algorithm improves the model parameters and model performance significantly in the end-to-end low resource speech recognition scene.
Keywords/Search Tags:Speech Recognition, Time Delay Neural Network, End-To-End Speech Recognition, Low Resource Speech Recognition, Modeling Unit
PDF Full Text Request
Related items