Font Size: a A A

Research On Speaker Tracking Algorithm Based On Fusion Of Audio And Video Information

Posted on:2023-08-12Degree:MasterType:Thesis
Country:ChinaCandidate:Z C XiongFull Text:PDF
GTID:2568307031492484Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
Speaker tracking algorithms based on the fusion of audio and video information have attracted increasing attention due to the potential applications in video conferencing,individual speaker discrimination,surveillance,and monitoring,to name a few.In practical applications,audio and video information fusion tracking is faced with many challenges,including the fusion of multi-modal information,estimation of the variable number of speakers and their states,and dealing with tracking errors under various conditions such as occlusion,limited camera view,lighting environment changes,and room reverberation.Integrating audio and video information to locate and track speakers is a hot topic.The purpose of this thesis is to solve some of the above challenges under the Bayesian framework.This thesis studies two speaker tracking algorithms using audiovideo multimodal information fusion in terms of co-located and spatially distributed datasets.For tracking speakers using spatially distributed datasets,a particle filtering algorithm based on random finite set(RFS)framework is proposed.In the RFS approach,the computational cost becomes expensive as the number of speakers increases.To solve this problem,a probability hypothesis density(PHD)filter is adopted and combined with a sequential monte carlo(SMC)implementation.In the proposed audio-visual mean-shift sequential monte carlo probability hypothesis density(AVMS-SMC-PHD)tracking algorithm,audio data is used to determine when to propagate and re-allocate these particles based on their types,and the mean-shift(MS)method is applied to the tracking system,which makes the estimated position closer to the real position of the speaker,and improves the estimation accuracy and computational efficiency of the algorithm.For tracking speakers using co-localized datasets,a novel audio-video information fusion(AV)tracking algorithm is proposed for multi-speaker tracking.The audio location information is used to combine face detection three-dimensional(3D)mouth information to improve the video likelihood function and 3D mouth-height information is used to assist audio observation to improve the audio likelihood.Moreover,this thesis sets an adjustable weight to better integrate audio positioning information and 3D mouth information in different scenes.Compared with the previous methods,the proposed algorithm subtracts the calculation and comparison of the color model,and directly integrates the 3D audio positioning information and video positioning information,which not only performs well for different scenarios,but the tracking efficiency is also greatly improved.
Keywords/Search Tags:Speaker tracking, audio and video fusion, particle filtering, likelihood function, probability hypothesis density
PDF Full Text Request
Related items