Font Size: a A A

Monaural Speech Enhancement Based On Time-frequency Mask In Complex Noise Environments

Posted on:2021-01-20Degree:MasterType:Thesis
Country:ChinaCandidate:S ChengFull Text:PDF
GTID:2518306290997009Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of artificial intelligence,speech communication and speech interaction are widely used in mobile phones,smart homes,intelligent in-vehicle equipments and other fields.However,the speech signal is inevitably polluted by environmental noise.Driven by the application demand of high quality speech communication and high efficiency human-computer interaction performance in different scenarios,speech enhancement research faces new opportunities and challenges.Most monaural speech enhancement algorithms have good performance in the case of high signal-to-noise ratio(SNR)and stationary noise,but in the complex environments of low SNR and non-stationary noise,the enhancement performance decreases obviously.Aiming at the problem under the interference of low SNR and actual non-stationary noise,this thesis adopts two time-frequency(TF)mask based methods to achieve speech enhancement.The main work and contributions are as follows:(1)A TF mask estimation speech enhancement method based on time-varying filtering is proposed.Inspired by the effectiveness of robust time-varying filtering(RTVF)method for filtering and separating non-stationary signals in low SNR environments,it is applied to speech enhancement.Firstly,the initial instantaneous frequency(IF)information is estimated from the noisy speech by combining speech characteristics and image processing methods in the TF distribution.Next,based on IF information,the reconstructed speech signal with less noise is obtained via RTVF.Finally,the TF binary mask is predicted according to the reconstructed signal's TF characteristic,then the clean speech spectra is estimated by multiplying the mask with the noisy speech spectra,and speech enhancement is completed.The experimental results show that the proposed method has more advantages in noise suppression and speech quality improvement than the two classical methods of multi-band spectral subtraction and minimum mean square error short-time spectral amplitude estimation,especially in low SNR environments.(2)A visual speech enhancement method based on complex ratio mask(CRM)is also proposed.CRM can simultaneously enhance the magnitude and phase spectra to improve speech quality.Moreover,in the complex noise environments,video information can effectively assist to distinguish speech segment and non-speech segment,so as to improve the performance.Based on the advantages of audio-visual information as network input and CRM as network prediction target,this thesis designs an audio-visual speech enhancement network,including audio encoder,video encoder,feature fusion and audio decoder.In order to solve the problem of the fusion ratio between video features and audio features,this thesis adopts attention mechanism on feature fusion module to assign appropriate weight to video features by using the correlation between video and audio features,so as to realize the effective fusion of video information.In addition,residual structure is introduced between the encoding-decoding network to reduce the loss of low-level speech details caused by the stacking of multiple convolutional layers,and to improve network performance.The experimental results indicate that the proposed method has advantages in suppressing various kinds of actual noises in the low SNR environments of-5db to 5d B,and can effectively improve the speech quality and intelligibility.
Keywords/Search Tags:Monaural Speech Enhancement, Complex Noise Environment, Time-Frequency Mask, Audio-Visual Processing
PDF Full Text Request
Related items