Research On Query Processing And Analysis Technique Of Big Data In Cloud Environment

Posted on:2016-03-09

Degree:Master

Type:Thesis

Country:China

Candidate:F Wang

Full Text:PDF

GTID:2308330479976615

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

In recent years, with the rapid growth of the amount of data, data processing technology is also obtained very great development. Unlike the traditional data query and analysis technology, mass data in the technology of query processing and analysis technology with the aid of cloud computing has its essential characteristic. At present, the massive data query processing and analysis technology in cloud environment is not mature enough, but its superiority and practicability with no doubt.Therefore, research on mass data processing and analysis technology in cloud environment has important significance.The existing research has solved the problem of simple query processing. Some more complex query processing and analysis problems either can not be solved or solved with low efficiency. We focused on k nearest neighbor join query and k Means cluster analysis of mass data in cloud environment, the main work is as follows:(1) According to the characteristics of mass data query processing and analysis in cloud environment, a computing framework based on data flow is presented for multi Map Reduce jobs which have dependency relationship. Since the Map Reduce jobs need to read and write data from distributed file system, it is inefficient when to express dependencies between jobs. The framework models according to the data flow diagram instead of single task. It reduces read and write data between Map Reduce jobs. The reasonable combination of each subprocess also can reduce the execution time.(2) The k NN join query is a common operation in spatial database. With data explosive growth, it is current urgent problem to design distributed k NN join algorithm. Because of the existing distributed k nearest neighbor query algorithm includes several rounds of serial Map Reduce tasks, we propose an efficient k NN join algorithm based on the data flow framework. The algorithm maps multi-dimensional data sets into one dimension using space-filling curves(z-values), and transforms k NN joins into a sequence of one-dimensional range searches.(3) The traditional centralized k Means algorithm can not meet the current scale of data. The existing distributed k Means clustering algorithm based on Map Reduce framework does not consider the influence of initial clustering center. Therefore, this work present an efficient k Means algorithm based on data flow framework. The algorithm uses the initial clustering center selection method based on multiple sampling to achieve load balancing and reducing the number of iterations.(4) The thesis improved and extended the Hive. Considered the complexity of spatial data query and analysis, and the characteristics and demand of query processing and analysis in cloud environment, the system extended the exsiting system. It can support more complex query processing and analysis.

Keywords/Search Tags:

Cloud, Data Flow, kNN join, kMeans Cluster

PDF Full Text Request

Related items

1	The Bad Data Identification Of Power System Based On Cloud Computing And Improved KMeans
2	A Study On Cluster Analysis Of Comprehensive Stock Index Data Based On Kmeans
3	Optimizing Multi-Join In Cloud Environment
4	Research And Implementation Of The Big Spatial Data Join Query Processing Algorithms In Cloud Environment
5	Research On Partition Selection Strategy For Big Data Management Based On KNN Connection Processing
6	Research And Implementation Of Multi-Way Join Query Processing Algorithms Over Big Spatial Data In Cloud Environment
7	Join Method Research Based On MapReduce
8	Join Processing And Optimizing On Large Data Sets Based On Hadoop Framework
9	Improvement Of Kmeans Clustering Algorithm And Its Application In Information Retrieval System
10	Research Of Join Algorithm With Skew Data On Mapreduce