As the most common and important way of information exchange in daily life,speech has gradually become a hot topic in the field of intelligent interaction with the rapid development of artificial intelligence today.The problem of speech separation originates from the classic "cocktail party effect",which means that people have the ability to easily select and track the voice signal of a specific speaker in a complex voice environment.However,this is a very challenging task for machines.How to extract the pure and clear voice signal of a specific speaker from a complex environment has become an urgent problem in the development of speech technology.Single channel speech separation has become a research focus due to its advantages such as easy deployment and low cost.The speech separation method based on deep learning has shown significant progress compared to the speech separation method based on traditional speech signal processing,but the separation effect still needs to be improved.In order to solve the difficulties encountered by machines in single channel speech separation,this article improves the speech separation algorithm based on deep learning.The main work done is as follows:(1)A deep clustering speech separation method based on auditory modulation mechanism is proposed to address the issue of poor separation performance caused by overlapping time-frequency points of different speakers in a single channel speech separation method using spectrogram features as input.This method first calculates the modulation signal through frequency band division and envelope detection,and extracts the modulation amplitude spectrum using Fourier transform;Then,the modulation amplitude spectrum embedding features are extracted using the BLSTM neural network combined with the self attention mechanism.Finally,self-organizing mapping algorithm is used to cluster embedded features,obtain mask matrices for different speakers,and reconstruct speech signals.The experimental results show that the PESQ and SDRi of this method on the WSJ0-2mix dataset reach 3.03 and 9.41 d B,respectively.(2)To address the issue of difficulty in distinguishing different speaker features in the above speech separation algorithms under insufficient or imbalanced data,and to further improve the performance of the speech separation model,a speech separation model based on the fusion of twin neural networks and deep clustering is proposed,which balances the proportion of hard and non hard samples in the training data through hard sample mining and resampling;Design a twin neural network based on collaborative attention mechanism to extract discriminative features from different speakers.Among them,collaborative attention mechanism can capture the interaction between triplet features and explore their correlation in feature embedding space;Finally,an improved self-organizing mapping network is constructed to obtain mask matrices for different speakers,achieving speech separation for any number of speakers.The experimental results show that the PESQ and SDRi of the proposed model on the WSJ0-2mix dataset reached 3.38 and 12.62 d B,respectively,which are 4.32% and 5.17%higher than the current state-of-the-art methods.(3)The speech open platform system is designed and implemented.With the voice separation function as the core,the separated target speaker’s voice is used for voiceprint recognition,keyword recognition and other operations to extract effective information in speech.The system applies a speech separation algorithm based on the fusion of twin neural networks and deep clustering in the system,using the trained model to separate and display the results of the tested mixed speech in a concise and convenient manner.In conclusion,this paper has carried out relevant work in the field of speech separation,proposed a deep clustering speech separation method based on auditory modulation mechanism and a speech separation algorithm based on the fusion of twin neural networks and deep clustering,and built a speech open platform according to the algorithm.The experimental results show that the existing speech separation models have been improved and the effectiveness of speech separation has been significantly improved. |