Font Size: a A A

Research And Application Of Distributed Clustering

Posted on:2012-11-20Degree:MasterType:Thesis
Country:ChinaCandidate:C Y DuFull Text:PDF
GTID:2178330332976237Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of computer technology, the digital library has a growing impact on the promotion of social advance. And it has attracted lots of attention from many countries all over the world. In order to improve content-based retrieval of digital books, there is a need to mine the information from digital books, which is an important research direction of digital library. This paper conducts intensive research on the distributed clustering and its application in digital library.As the data grows explosively, it is a challenge for the traditional machine learning algorithm to deal with large scale data. Many parallel algorithms have been proposed to address the problem, such as MapReduce-based K-means algorithm and distributed spectral clustering algorithm. AP clustering (Affinity Propagation clustering) was introduced to overcome some drawbacks of the traditional clustering methods such as K-means algorithm. However, its scalability and performance still need to be improved when dealing with large scale data. In this paper, two parallel approaches based on AP clustering are proposed with different strategies:affinity sparsifying and hierarchical sampling. With the hierarchical sampling method, large scale data are first partitioned into several smaller subsets randomly. Then all subsets are sampled in parallel. The result data are fused and then clustered again, which finds a set of high-quality exemplars. Finally, all data are assigned to exemplars in parallel. Experiments on synthetic datasets, human face image datasets, and iris dataset demonstrate that this algorithm can achieve high performance both on scalability and accuracy.Having studied the mapreduce-based distributed computing technology in Hadoop environment, we designed the mapreduce-based distributed affinity propagation clustering algorithm named DisAP, which was applied into the process of data mining and analysis in digital library. And its scalability was verified with large scale data. A framework of multimedia information retrieval related to Traditional Chinese Medicine was also proposed. In this framework digital books are first processed with the technology of image processing, feature extraction, key words selection and so on. The illustrations and its titles in the books are picked up and associated with the rich resource on the internet. And then the distributed affinity propagation clustering will be used to generate visual words to represent the images'features. Finally, the text and images are used to construct the invert index and an image retrieval user interface is implemented.
Keywords/Search Tags:Digital Library, Distributed Clustering, MapReduce, Image Retrieval, Visual Words
PDF Full Text Request
Related items