Font Size: a A A

Research On Query Processing And Analysis Technique Of Big Data In Cloud Environment

Posted on:2016-03-09Degree:MasterType:Thesis
Country:ChinaCandidate:F WangFull Text:PDF
GTID:2308330479976615Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In recent years, with the rapid growth of the amount of data, data processing technology is also obtained very great development. Unlike the traditional data query and analysis technology, mass data in the technology of query processing and analysis technology with the aid of cloud computing has its essential characteristic. At present, the massive data query processing and analysis technology in cloud environment is not mature enough, but its superiority and practicability with no doubt.Therefore, research on mass data processing and analysis technology in cloud environment has important significance.The existing research has solved the problem of simple query processing. Some more complex query processing and analysis problems either can not be solved or solved with low efficiency. We focused on k nearest neighbor join query and k Means cluster analysis of mass data in cloud environment, the main work is as follows:(1) According to the characteristics of mass data query processing and analysis in cloud environment, a computing framework based on data flow is presented for multi Map Reduce jobs which have dependency relationship. Since the Map Reduce jobs need to read and write data from distributed file system, it is inefficient when to express dependencies between jobs. The framework models according to the data flow diagram instead of single task. It reduces read and write data between Map Reduce jobs. The reasonable combination of each subprocess also can reduce the execution time.(2) The k NN join query is a common operation in spatial database. With data explosive growth, it is current urgent problem to design distributed k NN join algorithm. Because of the existing distributed k nearest neighbor query algorithm includes several rounds of serial Map Reduce tasks, we propose an efficient k NN join algorithm based on the data flow framework. The algorithm maps multi-dimensional data sets into one dimension using space-filling curves(z-values), and transforms k NN joins into a sequence of one-dimensional range searches.(3) The traditional centralized k Means algorithm can not meet the current scale of data. The existing distributed k Means clustering algorithm based on Map Reduce framework does not consider the influence of initial clustering center. Therefore, this work present an efficient k Means algorithm based on data flow framework. The algorithm uses the initial clustering center selection method based on multiple sampling to achieve load balancing and reducing the number of iterations.(4) The thesis improved and extended the Hive. Considered the complexity of spatial data query and analysis, and the characteristics and demand of query processing and analysis in cloud environment, the system extended the exsiting system. It can support more complex query processing and analysis.
Keywords/Search Tags:Cloud, Data Flow, kNN join, kMeans Cluster
PDF Full Text Request
Related items