Font Size: a A A

Research On Multilingual Documents Clustering Based On Parallel Information Bottleneck

Posted on:2017-05-22Degree:MasterType:Thesis
Country:ChinaCandidate:Y E LuFull Text:PDF
GTID:2348330485480423Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The fast development of Internet and further-deepened globalization trend make network data grow rapidly. In particularly, there are a large amount of multilingual documents data in the network due to the advent of machine translation systems. For the clustering of multilingual documents data, traditional text clustering algorithm only considers different languages information of multilingual documents data, and the potential relations between different languages is ignored, resulting in the prejudice, caused by single language data, existing in the data model structure.IB method is a data analysis method based on rate distortion theory, and has a unique advantage for the clustering of high-dimensional sparse data. This method regards the extraction of data pattern as a process of data compression, that is, finding a maximally compressed mapping of the input variable that preserves as much as possible the information on the output variable. This is benefit to find the internal model contained in data objects effectively. The IB approach has yet been successfully applied in many fields. Multivariate Information Bottleneck is an improved variant of the traditional IB method, which has a unique advantage in dealing with multilingual documents data. Multivariate Information Bottleneck mainly includes the parallel IB and symmetric IB.To solve the problems of multilingual documents clustering, in this paper, we propose a multilingual documents clustering algorithm based on the parallel IB method: ML-PIB algorithm. The proposed method not only can consider multiple languages and excavate associated information between different language information, but also can effectively improve the quality of clustering. We first consider different languages and build corresponding related variables. Then mutual information is used to measure the amount of information between multiple languages characteristic information; Finally, we use the information theory-based optimization method to optimize the objective function, aiming at getting the local optimum. The experimental results on Reuters Multilingual data set show that ML-PIB algorithm can efficiently deal with the various information of multilingual documents. Compared with sIB algorithm, k-means algorithm, PLSA algorithm and LDA algorithm, ML-PIB algorithm has higher Clustering Accuracy and Normalized Mutual Information, and it also has obvious advantages compared with the existing five Multilingual documents clustering algorithms.
Keywords/Search Tags:IB theory, Parallel IB, Multilingual, Document clustering, Mutual information
PDF Full Text Request
Related items