Font Size: a A A

Research On Cloud Computing Search Engine Design And Parallelization K-means Clustering Algorithms For Big Data

Posted on:2016-07-21Degree:MasterType:Thesis
Country:ChinaCandidate:L LiFull Text:PDF
GTID:2308330479483295Subject:Instrumentation engineering
Abstract/Summary:PDF Full Text Request
This paper presents a design method for the cloud computing search engine of big data based on the cloud computing framework of the YARN and the analyzed issues existed in the search engine of big data, including the complex structure, the high maintenance costs and the great difficulty in actualizing. An adaptive parallelization Canopy-K-means algorithm is proposed in this thesis to improve the poor adaptability of the parallelization clustering algorithm. A cloud computing platform based on Spark and YARN has been built, and comparative experiments on the platform verify the effectiveness of the algorithm.The main works on the design method of the cloud computing search engine for big data and the data mining algorithm in this thesis are listed as follows:① The paper summarises the present development situation and the existent difficulties based on the investigation and analysis of the related development and the achievements of research institutions and technology companies at home and abroad in the field of the search engine for big data. Then, after a detailed study and analysis on the Hadoop cloud computing framework, the core Map-Reduce calculation model, and the related theory on the Spark cloud computing framework, the cloud computing platform on the Spark has been successfully constructed ultimately.② The paper put forward a design method of the cloud computing search engine for large data based on the study of the YARN cloud computing framework, which mainly includes two stages- the stage of data organization and the stage of comparison and retrieval. We have taken the design of massive face search engine as an example to introduce the method of designing and implementing a search engine based on YARN cloud computing framework in detail. At last, the search engine of a massive face recognition based on the YARN cloud computing has been designed successfully.③ In the stage of data organization of the cloud computing search engine for large data, a parallel adaptive Canopy-K-means clustering algorithm, which is based on the Map-Reduce calculation model and running on the Spark cloud computing framework, has been proposed after the study on the data mining algorithm. The algorithm optimizes the adaptive parameter estimation in the parallel Canopy-K-means algorithm, solves the issue that parameters heavily depends on the artificial experience during the process of Canopy by the statistical methods, and runs on the Spark cloud computing framework. Massive experiments on different scale UCI(University of California Irvine) datasets and self-built facial feature dataset indicate that the proposed algorithm performed better in stability and parallel computing efficiency, compared with K-means and the Canopy-K-means algorithms. The realization of the YARN cloud computing platform based on Spark on algorithm helps the program run efficiently without the impact of the data scope, which may ensure the efficiency and reliability of the algorithm.
Keywords/Search Tags:Big Data, Map-Reduce, Retrieval System, Spark, K-means Clustering
PDF Full Text Request
Related items