Font Size: a A A

The Modeling Research In Speaker Diarization

Posted on:2017-02-18Degree:MasterType:Thesis
Country:ChinaCandidate:Y XuFull Text:PDF
GTID:2308330485953725Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the evolution of internet and big data, the information people accept in daily life grows rapidly. The speech about telephone communication, TV shows and shows in internet are all contained in the information. In addition, with the development of technology and a massive expansion of electronic equipments, voice mails and voice searching play more and more important roles in daily life. How to extract information from a mass of information is a hotspot and difficult problem.A variety of audio processing techniques are combined with speaker diarization. Besides speaker diarization is able to achieve the exact management of the speech and the speech content. Therefore, speaker diarization has aroused many researchers’in-terests. For example, MIT、LIMSI、Cambridge and Berkeley have been launched re-searches and achieved some results. However, there are still some difficulties in speaker diarization such as the inaccurate modeling in complex situation, the short time speech segments representation, the slow clustering efficiency and the difficulty in determining the number of speakers. In this dissertation, the focus is on these issues. The specific contents can be summarized as follows.For the problem of inaccurate modeling in complex scenes, a supervised modeling method is applied to speaker diarization. The deep neural network replaces the tradi-tional modeling method to extract deep and complex information in speech signals. The phoneme states in the output of DNN are modeled as the initial class number, and are combined with total variability. So that the features of the phoneme and the speakers’ features are effectively decoupled, and finally get a more robust representation of the speech segments which can improve the performance of the system.To solve the problems why short-duration segmentations can’t be represented cor-rectly, we constructed the total variability based on deep neural networks. The explicit modeling of the short-duration segmentation is used to compensate intra-conversation intra-speaker difference and reduce the negative impact of the interference information, so that the iVector with low dimension contains correct speaker information. Finally the representation of short-duration segmentations was very well.In order to achieve the goal of modeling and clustering in high efficiency, spec-tral clustering is used to replace the hierarchical agglomerative clustering. The affinity matrix was constructed based on the distance between different segmentations. The im-proved eigcn-gap method was used to find the best clustering number which was also the speaker number in the speech. We made the spectral clustering based on the analyz-ing of the eigen-structural of the affinity matrix. Spectral clustering not only solved the problem of speaker number detection but also was more efficient than agglomerative hierarchical clustering.Three methods are proposed in this dissertation, which can effectively solve the difficulties in current speaker diarization. The results of the experiments indicate that the system performance can be improved significantly.
Keywords/Search Tags:speaker diarization, deep neural network, total variability, intra-conversation intra-speaker variability modeling, spectral clustering
PDF Full Text Request
Related items