Speaker localization and tracking is a hot research topic in human-computer interaction. It has applications in fields such as multimedia systems, video conference systems, video monitoring systems, and intelligent robotics etc. Due to high noises and severe reverberation in real-world environments, speaker localization and tracking based on audio information is a challenge topic. This thesis focuses on robust methods for speaker localization based on audio information and speaker tracking combining audio and video information.First, we introduce classical methods for estimating time delay of arrival (TDOA) between a microphone pair and for TDOA estimation based on multi-channel correlation coefficient using multiple microphone pairs. Traditional TDOA estimation methods may become invalid due to noises and reverberation in real-world applications. The audio signal may contain valid valid or invalid time frames. This thesis paper models the probability density of activeness in circular microphone arrays by which the valid and invalid frames are distinguished. Then we propose a RANSAC algorithm using TDOA estimation for robust speak localization based on valid frames. Compared with the traditional methods, the proposed method achieves greater robustness and better accurateness. Furthermore, we introduce the mean shift algorithm for object tracking using video information. Finally, we propose to track speaker using distributed Kalman filter that uses both audio and video information. Its advantage is that by using both cues, but the shortcoming of single modality is avoided while both modalities complement to each other, and as such, the speaker locations can be positioned more accurately. |