With the improvement of material living standards,people pay more and more attention to the needs of spiritual life,and more and more people choose music education.As one of the most mature directions in music education,piano education has attracted many students.Automatic Music Transcription(AMT)can symbolize the output of piano performance,detect the currently playing notes,and output the pitch,start time,and end time,which helps performers record their performances and improve their performance.This research studies and implements the automatic transcription system of the piano,inputs the audio or video of the piano performance,and detects the performance information of each note based on the image or sound,including pitch,start time,and end time.The main contents of this article include:(1)In response to the lack of audio-visual fusion dataset,we created Play Dataset.This research proposed the practice mode and performance mode of piano playing for the first time.The player characteristic,difficulty and lighting conditions,and features of video transcription and audio transcription were taken into consideration when we constructed this dataset.(2)We improved existing audio transcription system and video transcription system.The energy balance algorithm is proposed in the audio transcription system,which strengthens the weak starting features,F1 value in the first 30 s of MAPS ENSTDk Cl is 88.38%.This research innovatively proposed dual camera recording for video transcription,which solves the problem of low accuracy of key recognition which are perpendicular to the camera,and achieved 93.5% F1 value in Play Dataset evaluation;(3)Two audio-video fusion transcription systems were designed and implemented: a logical fusion transcription system based on audio and video singlemode transcription,and a network fusion transcription system based on CNN.These two systems have their own advantages in different application scenarios.Logic fusion is more suitable for rapid system construction.Network fusion is suitable for system construction that requires higher accuracy and robustness.The logic fusion system achieved 94.5% F1 value in Play Dataset testing,and the network fusion achieved 96.8%.The systematic experiments shows that the accuracy and robustness of the audio-visual fusion system realized in this thesis are higher than that in the existing piano transcription system,the CNN-based network fusion system has the best transcription effect,which can support piano teaching. |