Font Size: a A A

Research On Deep Neural Networks Based Models For Speech Recognition

Posted on:2018-04-13Degree:DoctorType:Dissertation
Country:ChinaCandidate:S L ZhangFull Text:PDF
GTID:1318330512482669Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Speech,as the most natural and effective communication method,has been one of the most talked about research field of human-machine communication and inter-action.Automatic speech recognition is a key technology to achieve human-machine interaction,which is to let the computer "understand" the human speech that convert the speech signal into text.Acoustic model(AM)and language model(LM)are the two core modules of the speech recognition system.Traditionally,the hybrid Gaussian Mixture Model and Hidden Markov Model(GMM-HMM)based acoustic model and the n-gram language model are widely used in speech recognition systems.In recent years,with the rise of deep learning,deep neural networks based acoustic model and language model have been significantly outperform the traditional GMM-HMM and n-gram models respectively.Under these circumstances,this thesis concentrates on the model structure of deep neural networks.By optimizing the existing models and com-bining the characteristics of speech and text signal,novel neural networks models are proposed to improve the performance as well as the training speed of speech recognition system.Firstly,we investigate the feedforward fully-connected deep neural networks(DNN)based acoustic modeling.We have studied two different types of DNNs for large vocab-ulary continuous speech recognition:sigmoid activation function based DNN(sigmoid-DNN)and rectified linear units(ReLUs)based DNN(RL-DNN).For the traditional sigmoid-DNN,we study the sparse characteristic of the hidden layer weights and pro-pose the shrinking hidden layer network structure.Experimental results show that it can reduce model size to 45%of original size and accelerate training and test time by two times without losing recognition accuracy.Moreover,we use dropout as pre-conditioner(DAP)to initialize sigmoid-DNN prior to back-propagation(BP)for better recognition accuracy.For RL-DNN,we found that using reasonable parameter con-figuration,we can learn RL-DNN with a very large batch size in stochastic gradient descent(SGD).Therefore,the SGD learning can be easily parallelized among multi-ple computing units for much better training efficiency.Moreover,we also propose a tied-scalar regularization technique to make the large-batch SGD learning of RL-DNNs more stable.Secondly,we propose a fixed-size ordinally forgetting encoding(FOFE)method for language modeling.FOFE can almost uniquely encode any variable-length sequence of words into a fixed-size representation.It can model the word order in a sequence us-ing a simple ordinally-forgetting mechanism according to the positions of words.In this work,we have applied FOFE to feedforward neural network language models(FOFE-FNNLM).Experimental results have shown that without using any recurrent feedbacks,FOFE-FNNLM can significantly outperform not only the standard fixed-input FNNLM but also the popular RNNLM.Thirdly,we propose a novel neural network structure,namely feedforward sequen-tial memory networks(FSMN),to model long-term dependency in time series without using recurrent feedback.The proposed FSMN is a standard fully-connected feedfor-ward neural network equipped with some learnable memory blocks in its hidden layers.The memory blocks use a tapped-delay line structure to encode the long context in-formation into a fixed-size representation as short-term memory mechanism.We have evaluated the FSMNs in several standard benchmark tasks,including speech recogni-tion and language modeling.Experimental results have shown FSMNs outperform the conventional recurrent neural networks(RNN)while can be learned much more reli-ably and faster,in modeling sequential signals like speech or language.Moreover,we also propose a compact feedforward sequential memory networks(cFSMN)by com-bining FSMN with low-rank matrix factorization and make a slight modification to the encoding method used in FSMNs in order to further simplify the network architecture.Furthermore,we add some shortcut connections between the memory block of cFSMN which can prevent the gradient vanishing problem during training in order to train deeper cFSMN.We have evaluated the performance of the proposed methods in speech recog-nition acoustic modeling Switchboard and Fisher tasks.Experimental results on Fisher task shown that we can achieve about 13.8%relative word error rate(WER)reduction compare to the popular BLSTM system.Finally,we propose a novel model for high-dimensional data,called the Hybrid Orthogonal Projection and Estimation(HOPE)model,which combines a linear or-thogonal projection and a finite mixture model under a unified generative modeling framework.The HOPE model itself can be learned unsupervised from unlabelled data based on the maximum likelihood estimation as well as discriminatively from labelled data.More interestingly,we have shown the proposed HOPE models are closely re-lated to neural networks(NNs)in a sense that each hidden layer can be reformulated as a HOPE model.As a result,the HOPE framework can be used as a novel tool to probe why and how NNs work.In this work,we have investigated the HOPE frame-work to learn NNs for several standard tasks,including image recognition on MNIST and speech recognition on TIMIT.Experimental results have shown that the HOPE framework yields significant performance gains over the current state-of-the-art meth-ods in various types of NN learning problems,including unsupervised feature learning,supervised or semi-supervised learning.
Keywords/Search Tags:Speech Recognition, Deep Learning, Deep Neural Networks, Hybrid Or-thogonal Projection and Estimation, Fixed-size Ordinally Forgetting Encoding, Feed-forward Sequential Memory Networks
PDF Full Text Request
Related items