In a complex acoustic environment where multiple speakers speak at the same time,the human auditory system can excellently separate the audio of different speakers and focus on the speech of the target speaker.However,it is difficult for computer systems to track the target speaker ’s speech like humans,which seriously affects the accuracy of subsequent speech back-end tasks such as speech recognition and speaker segmentation clustering.Therefore,improving the speech separation ability of computer system is one of the key problems in speech signal processing.At present,speech separation can be divided into speech separation based on only auditory information and Audio-visual speech separation according to the mode of input data.In the research of speech separation based on only auditory information,how to make full use of the context information of long speech sequences is a key issue.In the Audio-visual speech separation,how to effectively realize the fusion of audio-visual two-modal data and make full use of the complementary information of two-modal data is a research difficulty.The main work of this paper is as follows :Firstly,aiming at the problem of how to strengthen the context-aware ability of speech separation model and fully model the global and local information of long speech sequence in the work of speech separation based on only auditory information,this paper proposes a dual-path network speech separation method based on multi-scale feature fusion.The multi-scale feature fusion module of inter-chunk elements in the dual-path network improves the full-text perception ability of the model while retaining the local information perception ability of the model.In addition,by introducing the star topology into the chunk processing Transformer,the computational complexity of the model is reduced by the sparse attention mechanism while retaining the context-aware ability of the model.Secondly,aiming at how to capture the complex correlation between audio and video features in the Audio-visual speech separation,and make full use of audio-visual information,this paper proposes an audio-visual speech separation framework based on decomposition bilinear pooling fusion.Firstly,audio and video features are extracted through speech feature extraction network and lip feature extraction network respectively,and then the data features of audio-visual two modes are integrated by decomposition bilinear pooling feature fusion module,and the complementary information of audio-visual two modes is used to effectively improve the performance of multi-modal speech separation.Thirdly,in order to strengthen the use of the relationship between audio-visual modal data,this paper proposes a decomposition bilinear pooling audio-visual speech separation algorithm based on attention mechanism.The speech features are enhanced by the attention mechanism.In order to give full play to the guiding role of visual features in speech separation,this paper extracts lip motion features and facial features as visual features,and performs attention operation on visual features through attention weight matrix to form visual attention features more related to auditory features.To strengthen the extraction of the relationship between audio-visual modes. |