As intelligent technology continues to be applied in life and work,speaker voice recognition is also widely used in intelligent remote control,text conversion and other fields.In real life,the speaker’s voice is always interfered by background noise during transmission,resulting in serious voice pollution,which greatly affects the accuracy of speaker voice recognition.Therefore,it is especially important to denoise and enhance the speaker’s voice.The most widespread external expression of the speaker’s voice is speech,so this paper investigates the speech in the speaker’s voice.Current algorithms for speech denoising and enhancement are available in both traditional and deep learning methods,but relatively few forms of combining these two methods are available.Therefore,in this paper,a model of masking and phase-dual branching communication in deep neural networks combined with spectral subtraction in traditional methods is used for the study of speech denoising and enhancement.The specific work consists of two main parts:The first is a proposed speech enhancement model based on the combination of two-branch communication and spectral subtraction.The model is chosen to combine the deep learning method,which uses the more popular masking method in deep neural networks,and the traditional method,which uses the classical spectral subtraction method.In the combined model,two branches are used to predict the amplitude masking and phase respectively,while the harmonics are extracted by using frequency transform blocks in the masking prediction to capture the global correlation along the frequency axis and further guide the phase;and the information is exchanged between the two branches to improve the accuracy of obtaining speech feature information by such information exchange;finally,the post-processing is done by spectral subtraction to The residual noise is denoised to further improve the enhancement effect.Secondly,in order to improve the accuracy and generalization performance of the model,the existing datasets are filtered and expanded to make the model more adaptable to both hybrid and specific types of datasets in response to the small size of publicly available speech enhancement and denoising datasets.The datasets in the experiments are divided into two parts.The first dataset utilizes the commonly used public datasets Voice Bank and DEMAND as the basic experimental dataset for the model.This dataset has a high number of speech and is a hybrid dataset with more noise types added to the noise mixture as well as the signal-to-noise ratio of the noise.The second dataset is the type-specific dataset Voice Bank and Noise-92 expanded by itself,and the purpose of expanding the dataset is to make the model more relevant in this paper.In this dataset,three different signal-to-noise ratios of high,medium and low as well as four common noises are mixed,and a total of 12 sets are synthesized in this paper.Based on the consideration of time and space cost,the most straightforward S/N ratio-based mixture is used in this paper. |