Research On Cloud Computing Search Engine Design And Parallelization K-means Clustering Algorithms For Big Data

Posted on:2016-07-21

Degree:Master

Type:Thesis

Country:China

Candidate:L Li

Full Text:PDF

GTID:2308330479483295

Subject:Instrumentation engineering

Abstract/Summary:

PDF Full Text Request

This paper presents a design method for the cloud computing search engine of big data based on the cloud computing framework of the YARN and the analyzed issues existed in the search engine of big data, including the complex structure, the high maintenance costs and the great difficulty in actualizing. An adaptive parallelization Canopy-K-means algorithm is proposed in this thesis to improve the poor adaptability of the parallelization clustering algorithm. A cloud computing platform based on Spark and YARN has been built, and comparative experiments on the platform verify the effectiveness of the algorithm.The main works on the design method of the cloud computing search engine for big data and the data mining algorithm in this thesis are listed as follows:① The paper summarises the present development situation and the existent difficulties based on the investigation and analysis of the related development and the achievements of research institutions and technology companies at home and abroad in the field of the search engine for big data. Then, after a detailed study and analysis on the Hadoop cloud computing framework, the core Map-Reduce calculation model, and the related theory on the Spark cloud computing framework, the cloud computing platform on the Spark has been successfully constructed ultimately.② The paper put forward a design method of the cloud computing search engine for large data based on the study of the YARN cloud computing framework, which mainly includes two stages- the stage of data organization and the stage of comparison and retrieval. We have taken the design of massive face search engine as an example to introduce the method of designing and implementing a search engine based on YARN cloud computing framework in detail. At last, the search engine of a massive face recognition based on the YARN cloud computing has been designed successfully.③ In the stage of data organization of the cloud computing search engine for large data, a parallel adaptive Canopy-K-means clustering algorithm, which is based on the Map-Reduce calculation model and running on the Spark cloud computing framework, has been proposed after the study on the data mining algorithm. The algorithm optimizes the adaptive parameter estimation in the parallel Canopy-K-means algorithm, solves the issue that parameters heavily depends on the artificial experience during the process of Canopy by the statistical methods, and runs on the Spark cloud computing framework. Massive experiments on different scale UCI(University of California Irvine) datasets and self-built facial feature dataset indicate that the proposed algorithm performed better in stability and parallel computing efficiency, compared with K-means and the Canopy-K-means algorithms. The realization of the YARN cloud computing platform based on Spark on algorithm helps the program run efficiently without the impact of the data scope, which may ensure the efficiency and reliability of the algorithm.

Keywords/Search Tags:

Big Data, Map-Reduce, Retrieval System, Spark, K-means Clustering

PDF Full Text Request

Related items

1	Parallelizing K-means-based Clustering On Spark
2	Optimized Design And Implementation Of K-means Algorithm Based On Big Data Spark Platform
3	Research And Application Of Clustering Method For Big Visual Data
4	Research On Parallel Clustering Algorithm Based On Map-Reduce
5	Optimization And Application Of K-means Clustering Algorithm Based On Spark Framework
6	Research On Spark Oriented Fuzzy C-means Clustering Algorithm
7	Research Of The Clustering Algorithm Based On The Spark
8	Oneof Text Clustering Algorithm Based On Big Data
9	Construct High Performance Text Clustering Systems Based On Map-Reduce
10	Research And Application Of Parallelization Optimization Of Spatial Clustering Algorithm Based On Spark