Font Size: a A A

Research On Speech Recognition Model Based On Conference Scene

Posted on:2024-05-21Degree:MasterType:Thesis
Country:ChinaCandidate:H Y JiangFull Text:PDF
GTID:2568307157981589Subject:Master of Electronic Information (Professional Degree)
Abstract/Summary:PDF Full Text Request
Speech recognition,as the main way of human-computer interaction,has a great prospect in the conference scene.With the increasing number of meetings,the forms of meetings are complex and diverse,which puts forward higher requirements for speech recognition tasks.At present,there are some problems in the practical application of commonly used speech recognition models: the input of speech enhancement models is usually noisy speech in fixed scenes,and the robustness of trained models is low;The existing phonetic training set is small,which makes it easy to overfit in the training process.The complexity of speech recognition model adding speech enhancement network is high,which is not conducive to hardware implementation.Aiming at the above problems,this paper aims to obtain a lightweight and high-precision speech recognition model under the condition of small sample training,studies the speech enhancement network with high robustness,the speech recognition network with small sample set training and the speech recognition network with lightweight and high-precision,and builds the conference transcribed speech recognition system.The research contents are as follows:A spectrum mapping network based on GAN is proposed to solve the problem of low robustness of speech enhancement models.Firstly,the Fourier transform amplitude information is input into the adversarial generation network,and the current learning parameter is added as a monitoring function to prevent the loss of speech frames in the compression process,which improves the stability of the model.At the same time,phase information is input into DNN network to train phase generator.Finally,the generated amplitude and phase are superimposed to obtain enhanced speech output.The experimental results show that the speech enhancement effect of this model is better than Wiener filter,subspace denoising and other speech enhancement algorithms in different scenarios where SNR=10,and the speech intelligencability is improved by 16% PESQ,18% COVL and 20% SSNR compared with the original SEGAN algorithm.It provides more accurate speech features for subsequent speech recognition.In order to solve the problem of overfitting generated by training with small data sets and low robustness of the model generated under noisy conditions,a dense connected preconvolutional network was proposed.The network extracts deep speech features,convolves the extracted features with the original input feature parameters by using the residual module,and convolves the deep features with the shallow features,which expands the data set and avoids the overfitting phenomenon.However,to solve the problem that adding pre-speech enhancement networks to the above networks increases the complexity of the model,an End-to-end convolutional networks based on IRM mask mapping is designed.Using small sample training,the amplitude and phase information of pure speech is used to calculate the IRM value of speech signal as the input of neural network training,reduce the number of network layers,and train the IRM predictor.The speech signal with noise is reconstructed by IRM prediction to achieve noise reduction effect.At the same time,the predicted value of IRM is convolved with the real value to expand the data set.The model is tested on the TIMIT data set in the multi-person conversation scene.The results show that the recognition accuracy of the model is improved by 20% compared with the traditional LAS algorithm.
Keywords/Search Tags:Speech enhancement, speech recognition, conference scene, small sample set, transcription system
PDF Full Text Request
Related items