Font Size: a A A

Design And Research Of Intelligent Conference System Based On Multimode

Posted on:2021-04-26Degree:MasterType:Thesis
Country:ChinaCandidate:J X GuoFull Text:PDF
GTID:2518306503464874Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Intelligent conference system is an important part of modern enterprise operation.However,the existing intelligent conference system only uses a single text mode to record and summarize the conference,and the video and audio mode information in the conference process are not effectively used.In addition,the existing meeting record data is lack of structured labels,so the staff can not use it to classify,archive and retrieve the review after the meeting.Therefore,it is of great significance and application value to effectively utilize and structurally integrate multimodal information in the conference process.The existing multimodal model can realize the visual location of sound source in video by using the audio-visual mode fusion technology.However,in the meeting scene,due to the limited space,the speaker positioning accuracy is low and the voice reverberation is large.Many factors make it difficult for multimodal intelligent conference system to integrate high-precision positioning information,identity recognition and voice recognition.Aiming at the problem of low accuracy and difficult integration of multi-modal information in intelligent conference system,this thesis makes the following research:First of all,this thesis proposes a speaker positioning accuracy improvement method based on multimodal model.The problem of inaccurate speaker positioning in conference scenes is due to the low quality and insufficient fusion of audio-visual modal features of existing multimodal models.In this thesis,long-term and short-term memory networks are used to improve the content of audio-visual fusion modal feature information,and combined with the facial modal information obtained from target detection to complete the accurate positioning of the speaker in the meeting scene.Compared with the original model,the improved multimodal network model improves the accuracy of audio-visual action recognition task by 4.6%,the rate of missing detection of sound source decreases by 4.53%,and the accuracy of locating the speaker in the meeting scene reaches 90.8%.Secondly,this thesis proposes speech segmentation and voiceprint filtering technology based on speaker recognition.In this thesis,an acoustic model based on the speaker identity vector is established to recognize and segment the conference audio,embed the conference audio into the speaker identity tag and realize automatic classification and storage.After speech segmentation,the false rejection rate of the participants' identity is 6.78%,and the false reception rate is 9.09%,which is lower than 16.67% and 23.08% of the same speaker's speech segmentation based on Bayesian rule.On the basis of speaker recognition and speech segmentation,aiming at the problem of low recognition accuracy of the speaker recognition model applied in the scene where multiple people speak at the same time in the venue,this thesis proposes an audio filtering technology based on the identification of the speaker embedded code in the multi person speaking environment.In this thesis,a speaker coding network and an audio filtering network are built,and on this basis,the directional separation of mixed speech signals is completed.The deviation ratio of speech signal is 12.6db,which is 4 times higher than that of unprocessed audio signal.Finally,this thesis proposes a method to build multimodal conference data set.In this thesis,a reverse transformation algorithm of panoramic camera is proposed,which can not only improve the efficiency of audio-visual action collection of participants,but also effectively eliminate the face distortion caused by the recording of spherical panoramic camera.In addition,this thesis embeds the high-precision speaker identification tag visualization from voiceprint recognition into the conference video processed by multimodal model,and realizes the visual annotation of the speaker's face and identity information.After annotation and segmentation,the conference video contains the identity of participants,facial annotation of participants,speech content and other modal information,while the conference audio and video information automatically generates a variety of labels such as the identity of the conference speaker,the date of the conference and so on.The multi-modal conference data set not only contains the multi-modal information of audio and video during the conference,but also realizes the structure of conference information,which provides convenience for post conference summary and retrieval.
Keywords/Search Tags:Intelligent Conference System, Multi-modal, Voiceprint Recognition, Directional Vocal Separation
PDF Full Text Request
Related items