Font Size: a A A

Research On Multi-Channel Speech Enhancement Algorithm Based On Time-Frequency Mask

Posted on:2023-11-02Degree:MasterType:Thesis
Country:ChinaCandidate:S H ChenFull Text:PDF
GTID:2558306914471104Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the development of technology,intelligent speech devices are becoming increasingly popular.The need for front-end enhancement algorithms is also increasing as the interference of noise and reverberation in real-world scenes can affect the quality and intelligibility of speech and hence the overall performance of speech processing systems.In recent years,mask based beamforming methods have shown good performance,however,most of them failed to make full use of the information contained in the speech signal during the mask estimation stage,and in practice,data mismatch problems are caused by the large differences of noise and speaker type between training set and real acoustic environment,resulting in inaccurate mask estimation,which in turn affects the beamforming.To solve these problems,the main research of this paper is as follows:(1)A mask estimation method based on domain feature fusion is investigated.This paper proposed to add hand-craft spatial features in the time and frequency domain and features related to human ear hearing perception to the input signal,so that the network can make full use of the time-,frequency-and space-domain information in the signal to estimate the mask.Multi-domain optimization of the model is achieved using a loss function that combines the mean square error of the time-frequency mask with the time-domain SI-SNR loss.The loss function is further improved by combining STOI and PESQ to alleviate the mismatch between the loss function and the evaluation metric.A complex-valued proportional mask is used instead of an ideal binary mask as the training target to make full use of the phase information to characterize the presence probability of the target source.Experimental results show that the network fusing spatial features in the time-frequency domain and features related to human ear auditory perception improves SI-SNR,PESQ and STOI by 0.5,0.1 and 0.01 on simulated data,respectively,and MOSNet scores by 0.2 and 0.4 on the real validation and test set,respectively,relative to the method using only spectral estimation masks.The effectiveness of domain feature fusion for speech enhancement task is verified,and it is proved that the use of manual features can improve the robustness of the network to a certain extent.(2)A joint mask estimation approach of deep learning and spatial clustering estimation was investigated.The improved network is used to initially estimate the mask relatively more accurately,and this output is used to calculate the speech presence probability(SPP)to initialize the statistical model parameters,and to correct the network output by clustering the observations at each frequency.The results were compared when using the neural network alone,different mixture models(complex Gaussian mixture model,complex Watson mixture model or complex angular central Gaussian mixture model),different initialization methods(random initialization,real-valued mask or SPP)and different numbers of iterations.The results show that using SPP to initialize the cACGMM has better results.Compare to use neural network alone,MOSNet improved by 0.255 and 0.2 on the validation and test set respectively.And solved the frequency permutation problem when using the clustering method alone,and achieving better results with fewer iterations.The method in this paper improved the real-time performance and robustness in real scenarios.
Keywords/Search Tags:multi-channel speech enhancement, cross-domain feature fusion, time-frequency mask, multi-channel
PDF Full Text Request
Related items