Research On Multilingual Documents Clustering Based On Parallel Information Bottleneck

Posted on:2017-05-22

Degree:Master

Type:Thesis

Country:China

Candidate:Y E Lu

Full Text:PDF

GTID:2348330485480423

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

The fast development of Internet and further-deepened globalization trend make network data grow rapidly. In particularly, there are a large amount of multilingual documents data in the network due to the advent of machine translation systems. For the clustering of multilingual documents data, traditional text clustering algorithm only considers different languages information of multilingual documents data, and the potential relations between different languages is ignored, resulting in the prejudice, caused by single language data, existing in the data model structure.IB method is a data analysis method based on rate distortion theory, and has a unique advantage for the clustering of high-dimensional sparse data. This method regards the extraction of data pattern as a process of data compression, that is, finding a maximally compressed mapping of the input variable that preserves as much as possible the information on the output variable. This is benefit to find the internal model contained in data objects effectively. The IB approach has yet been successfully applied in many fields. Multivariate Information Bottleneck is an improved variant of the traditional IB method, which has a unique advantage in dealing with multilingual documents data. Multivariate Information Bottleneck mainly includes the parallel IB and symmetric IB.To solve the problems of multilingual documents clustering, in this paper, we propose a multilingual documents clustering algorithm based on the parallel IB method: ML-PIB algorithm. The proposed method not only can consider multiple languages and excavate associated information between different language information, but also can effectively improve the quality of clustering. We first consider different languages and build corresponding related variables. Then mutual information is used to measure the amount of information between multiple languages characteristic information; Finally, we use the information theory-based optimization method to optimize the objective function, aiming at getting the local optimum. The experimental results on Reuters Multilingual data set show that ML-PIB algorithm can efficiently deal with the various information of multilingual documents. Compared with sIB algorithm, k-means algorithm, PLSA algorithm and LDA algorithm, ML-PIB algorithm has higher Clustering Accuracy and Normalized Mutual Information, and it also has obvious advantages compared with the existing five Multilingual documents clustering algorithms.

Keywords/Search Tags:

IB theory, Parallel IB, Multilingual, Document clustering, Mutual information

PDF Full Text Request

Related items

1	A New Approach To Improve Web Search Results For Multilingual Documents
2	Research On Efficient Document Clustering Using Improvised Sub-Document Based Framework
3	Research On Parallel Non-Intervention Document Clustering Algorithm
4	Application And Research Of Web Document Clustering In Search Engine
5	Multilingual education model construction based on superior cognitive skills of multilingual students
6	InforadarML: A multi-lingual information discovery tool exploiting automatic document categorization
7	Design And Implementation An Of Document Clustering Algorithm Based On The GPU
8	Research On Multilingual Text Clustering
9	The Design And Implementation Of Neural Machine Translation System For Multilingual Mutual Translation
10	Research On Mutual Information Hierarchical Clustering Based On Grassberger Entropy Estimator