Font Size: a A A

Multi-speaker Recognition Based On Audio Video Information Fusion In Meeting Room Environment

Posted on:2012-01-06Degree:MasterType:Thesis
Country:ChinaCandidate:P PanFull Text:PDF
GTID:2178330335466988Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
With the constant development of sensor technology and the continual advancement of audio-visual processing algorithm, speaker recognition by audio-visual fusion method has become a significant technology in current identity recognition field. A typical application is the speaker recognition research in meeting room environment.This thesis chose the audio-visual meetings in the AMI Corpus as the simulation material and recognized multi-speaker in meeting room environment using audio-visual fusion method. The details were shown as follows:First of all, this thesis recognized the most dominant person in meetings using single feature such as speaking length, speaking energy, speaking times as well as the combination of the single features. Then the effectiveness of every single feature and feature combination was analyzed and ranked. Later, meetings with several dominant persons were discussed using both hard and soft criterions.Next, audio based speaker recognition system was designed with the reference of ICSI RT07s system. In speech activity detection section, the speech/non-speech detector was modeled by Gaussian Mixture Model. Copmared with the Hidden Markov Model based speech/non-speech detector in the ICSI RT07s system, Gaussian Mixture Model based speech detector was simple and entensive, which was the innovation of the research. Later, some tunable parameters in the model training process were optimized and the model training process was completed with the optimal value of each tunable parameter.In video recognition section, two kinds of frame difference methods were compared from mathematical aspect. Participant with the maximum face activity level in the video clip was detected using frame difference method with the best performance and the detected person was treated as the speaker of the meeting. Compared with foreign MPEG based method, frame difference in this thesis can be used in any kind of video format, which was another innovation of the research.After getting the results of both audio recognition and video recognition, this thesis matched the two kinds of results by greedy matching fusion method and accomplished the fusion process of the recognition results in different modes. The effectiveness of the fusion algorithm was tested using 58 synchronous audio video meeting segments. Experiments showed that the recognition accuracy rate will enhance with the increase of the length of the meeting segments. The whole recognition rate for the total 58 meeting segments can reach 74.14%.Compared with the traditional single mode speaker recognition, recognize speaker with information fusion method can considerably promote the persistence and robustness of the recognition process. When one mode signal was interfered or sheltered, speaker recognition can still be conducted with the effectiveness of the other mode signal. Besides, audio-video associated speaker recognition can let the researchers see the appearance of the speaker while recognizing his voice, which made the recognition more direct and lively. All the above are the senses of the research.
Keywords/Search Tags:Audio Feature, Speech Activity Detection, Gaussian Mixture Model, Face Activity Detection, Frame Difference, Matching Fusion
PDF Full Text Request
Related items