| Music is an art with a long history and a form of media that people often come into contact with in daily life.In recent years,with the great development of deep learning technology,new research directions and challenges have emerged in the field of traditional signal processing.This thesis focuses on the field of music signal processing,and the research goal is to apply deep learning technology to innovate the music signal processing tasks,including the stereoscopic reconstruction task of monophonic music based on visual information and the automatic transcription task of music.The key to the stereoscopic reconstruction task of monophonic music based on visual information is how to effectively extract and fuse the spatial information into the audio signal,so as to give the monophonic signal a stereoscopic effect.Survey of related research shows that the audio and video feature fusion algorithms proposed by the existing researches are relatively simple,which will introduce too much noise into the audio signal,resulting in limited stereo quality.To solve this problem,this thesis designed an audio-visual feature fusion algorithm based on the self-attention mechanism,which realizes the retention of important visual features and the filtering of irrelevant information.The algorithm introduces as little noise as possible when injecting important spatial information into the audio signal,ensuring a high-quality stereo signal.In addition,inspired by the related research on audio source separation,an iterative network structure is designed to further optimize the generated stereo quality.The comparison with the experimental results of previous studies shows that the algorithm in this thesis can achieve the state-of-the-art performance.For automatic music transcription,previous work generally processed stereo signal into monophonic signal,which failed to make full use of stereo information and limited the accuracy of transcription models.In order to transcribe stereo music,a stereo feature enhancement module is designed to fully extract the correlation and difference information between the two stereo channels,which improves the performance and robustness of the model transcription.In addition,a temporal convolutional module is designed to model the time structure of music,while ensuring the running performance and transcription effect of the model.Experiments on related dataset prove that the algorithms in this thesis have ideal effects.Finally,based on the summary and thinking of this work,suggestions for future research and innovation directions are put forward. |