Font Size: a A A

Research On Several Modeling Problems In Deep Learning Speech Recognition Systems

Posted on:2021-01-19Degree:DoctorType:Dissertation
Country:ChinaCandidate:J TangFull Text:PDF
GTID:1368330605979415Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
As a natural way of human communication,speech has its inherent advantages as a means of human-machine interaction.Automatic speech recognition(ASR)is a key technology to achieve human-machine communication,which realize the transfor-mation from the speech signal to text.With the development of deep learning(DL),DNN-based ASR has become the mainstream.Nowadays,there are two kinds of model frameworks in ASR:hybrid based and end-to-end(E2E)based.In this context,this pa-per focuses on several modeling problems under those two frameworks.We study the practical problems(such as time delay and noise robustness)in the hybrid architecture to reduce the impact of real application conditions(such as time delay and noise)on the performance.Besides,aiming at the shortcomings of end-to-end ASR,we study the ra-tionality of the end-to-end model(such as the optimization of attention vector and how to use multi-level label in modeling).By applying low-cost but effective information(such as posteriori information,multi-level label information)in end-to-end modeling,novel neural network models are proposed to improve the performances.Firstly,we investigate the long short-term memory(LSTM)network based acous-tic modeling.To solve the problem of high latency of bidirectional LSTM based acous-tic modeling,we proposed a novel attention-based LSTM(ALSTM)layer.ALSTM layer is consists of an LSTM and an attention mechanism.The LSTM is used to encode the past context while temporal future context information can be obtained with the help of attention mechanism.By constructing the network with ALSTM layer,we can get an acoustic model with controllable delay and high-performance.The experimental results in Switchboard show that the acoustic model composed of multiple ALSTM layers can achieve nearly BLSTM performance,and the delay can be controlled.Secondly,we propose a Densely Connected Residual Network,termed DenseR-Net,for acoustic modeling.This architecture can be regarded as the integration of residual basic component and dense block.As for DenseRNet can exploit rich multi-resolution feature maps,we can obtain a noise-robust acoustic model.Experimen-tal results have presented to demonstrate that DenseRNet exhibits the robustness to beamforming-enhanced speech as well as near and far-field speech.Thirdly,this project explores the use of posterior information for attention model-ing in ASR.We demonstrate that direct application of posterior information gives rise to two deficiencies related to the context of the information used,and the introduction of additional mismatches between training and inference.To counter the first deficiency,we present an encoder modification to introduce additional context information in the output prediction computation.Then the second deficiency is overcome through two kinds of solution comprising a mismatch penalty term and an alternate learning strategy(ALS).The former applies a divergence-based loss to correct the mismatched bias dis-tribution,while the latter employs a novel update strategy which relies on introducing iterative inference steps alongside each training step.Experiments show that the re-sulting system called EPAM(Extended Posterior Attention Modeling,EPAM)achieves significant improvement.Finally,we propose a multi-granularity sequence alignment(MGSA)approach based on the usage of the cross-sequence interactions for the encoder-decoder based ASR.Specifically,a decoder module is designed to generate multi-granularity sequence predictions.By exploiting the latent alignment mapping among units with different granularities,the obtained decoded multi-level sequences can be used as complemen-tary information for the final inference.The cross-sequence interactions can also be ap-plied to re-calibrate the output probability at the inference stage.Experimental results on both WSJ-80hrs and Switchboard-300hrs benchmark datasets show the superiority of the proposed method compared to the traditional multi-task method and the baseline system with a single granularity unit.
Keywords/Search Tags:Automatic Speech Recognition, Deep Learning, End-to-end Modeling, Attention Mechansim, Residual connection, Densely Connection, Posterior Attention Model, Multi-granularity target sequence
PDF Full Text Request
Related items