Font Size: a A A

Mandarin Automatic Speech Recognition Based On Transformer

Posted on:2022-10-30Degree:MasterType:Thesis
Country:ChinaCandidate:C ZhangFull Text:PDF
GTID:2518306569472904Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Speech recognition is developing rapidly at present.Due to simpler structure and unified objective function,end-to-end speech recognition has reached a level which is comparable to traditional speech recognition systems.Among them,the Transformer-based end-to-end speech recognition framework has been widely used in the field of offline speech recognition because of its excellent modeling capabilities,but there are still some problems in current research.The excellent performance of the Transformer-based framework benefits from its global modeling capability of the self-attention module,but the global attention mechanism does not have monotonicity and lacks the modeling capability of local timing dependence of timing signals.Also,its unique batch processing structure,large-scale concurrent operations can increase not only the training rate at training time,but also increases the exposure error.Therefore,how to improve the local modeling ability and improve the robustness is particularly important.In addition,decoding with Transformer-based framework results in large fluctuations in duration and a significant variation in sentence length and delay,which leads to performance degradation.To deal with the above problems,our main research contents and results are as follows:1.A Transformer model based on local timing dependence is proposed.Aiming at the problem that the encoder part lacks the local modeling ability of the speech feature sequence,we propose to use a local dense synthesis algorithm to limit the range of attention to local.When combined with the global attention algorithm,this method effectively improves modeling capabilities.Aiming at the lack of local modeling of the target text sequence in the decoder,a loss-adaptive partial masking sampling algorithm is proposed.It reduces the exposure error and strengthen the local modeling of common Chinese word formation.Incorporating the modules above into the Transformer structure,an accuracy improvement of about 13.8% and 9.3% is obtained on the Chinese data set Aishell1 and Aishell2.2.A Transformer-based speech recognition decoding speed optimization algorithm is proposed,including model inference acceleration and search optimization.The model inference acceleration involves the acceleration of the encoder-decoder attention module and the selfattention module.This can reduce the decoding delay by 25% relatively,without any loss of accuracy.Search optimization includes two parts: beam search algorithm optimization and nonautoregressive decoding acceleration.The beam search algorithm includes both static and dynamic threshold decoding algorithm,which is analogous to the traditional beam search decoding process and effectively cuts the decoding path with lower confidence.Combined with the above acceleration algorithms,the decoding time can be reduced by at least 45%.Secondly,the time-series connection time classification loss function is introduced in the Transformer framework,and integrates the Transformer decoding score into its prefix prediction results.The proposed non-autoregressive decoding algorithm can replace the Transformer autoregressive decoding method.Compared with model inference acceleration and beam search algorithm optimization fusion method,the algorithm can further increase the model decoding speed by nearly doubled,making the performance better.
Keywords/Search Tags:Transformer Framework, Self-attention Mechanisms, Local Time Series Modeling, Decoding Speed Optimization
PDF Full Text Request
Related items