People receive information in daily life through advanced perception abilities such as vision and hearing,and can be converted into information that people can understand through efficient processing by the brain.Computer storage and computing capabilities are amazing,but they cannot directly understand the received video and audio information like human beings.Voice is the most common and effective way of communication in life.People have developed speech recognition technology,the purpose is to allow electronic computers to "think" what people say,and convert them into text symbols with a higher level of abstraction.Thanks to the improvement of computer performance,the current recognition effect of clean speech has reached and surpassed the capabilities of human beings.Under external interference conditions,the transcription misunderstanding rate will increase sharply,how to improve the recognition accuracy in noise condition determines the key of this technology.This thesis studies a robust transcription scheme for peripheral interference noise.Separately study speech enhancement technology and text transcription technology,and use neural network to learn the non-linear mapping relationship from noisy speech to transcribed text.Reduce the transcription error rate under noise conditions,and apply it to the apron control project.The main work are as follows:(1)Research on end-to-end algorithms for speech enhancement.Aiming at the problem of serious phase information loss under the condition of low signal-to-noise ratio,the end to end modeling idea is adopted to model the time domain audio signal.UNet structure can capture more detailed information,mining high-dimensional feature local information from different fields of view by stacking multi-scale blocks,and fusing the evaluation indicators into the training process to obtain clearer speech.(2)Research on acoustic modeling methods for speech recognition,using multi-layer stacked CBRD units,reduce computer load,and facilitate structural trimming.On this basis,the study builds a language model to determine the logical context of the text,so as to modify the results of speech recognition and transcription,and make the transcription results of the model more logical.(3)The thesis proposes a method of fusing the enhancement model and the recognition model.Based on the already trained enhancement and recognition model,transfer learning and joint training are applied to the noisy data set,so that the neural network is adaptively adjusted based on the existing weights to maximize the two models and achieve an increase in accuracy.(4)Research on the special recognition of land-air communication in apron control tasks,and construct special pronunciation mapping relationships according to the land-air communication standards to realize effective recognition of land-air communication commands;in order to better realize the transformation of algorithm results And visual display,developed a suitable prototype system. |