Research On Some Key Technologies Of Parallel Processing For Big Data Based On Map Reduce

Posted on:2018-09-19

Degree:Doctor

Type:Dissertation

Country:China

Candidate:B Zhang

Full Text:PDF

GTID:1318330536952278

Subject:Pattern Recognition and Intelligent Systems

Abstract/Summary:

The characters of big data are volume,variety,velocity,common hardware and open source.However modern big data analysis faces a confluence of growing challenges,that system become more and more inefficiently.Map Reduce introduced by Google in 2004 has already been widely used in the field of big data processing.Research on relational database technology in the field of traditional scientific data have been more perfect,but because the current CPU computing power and disk read and write is not balanced development of technology,make I/O become the performance bottleneck of the traditional database.In addition,the traditional database is not equal to non-structure data types of data analysis and processing tasks.Along with the data of the study is increasingly deepening,the emergence of all kinds of new data query processing method.The integration research of Relational and non-relational database technology is a hot research field of data science and engineering.There are still many key problems to be solved.How to realize the using of Map Reduce distributed parallel computing method to solve the big data query processing efficently,and query optimization strategies concern now is still valid questions.This topic is parallel ideas and calculation method using Map Reduce starting fromdistributed query processing system,construct a few new class for large data sets,provide more distributed computing model and Map Reduce materialization strategies cost model for database theory and practical application research.In addition,the research on load data skew dynamic detection,respectively according to differentphysicochemical opportunity,construct the materialization strategies of three Map Reduce parallel environments,and verified by experiment combined with a large number of standard test data,for large data parallel done some very meaningful attempt to achieve processing method.The main contributes of our work are as below:1.After introducing the motive and the significance of chaos control and synchronization,we listed the challenges that chaos control and synchronization would be confronted with.Then,we gave out the main contents and the innovations of this study.2.The system performance will serious decline during the process of big data by traditional relational database.Therefore the Big data management technology has become research hotspot.In this paper,we analysis and comment the development of domestic and foreign research of big data from Parallel Database,Map Reduce model for big data processing,No SQL vs.Map Reduce and integration of Map Reduce and database technology.3.In traditional relational database,materialization can speed up query processing greatly.However modern big data analysis faces a confluence of growing challenges,that system become more and more inefficiently and scalability.Consequently,this paper presents the materialization strategies based on column-store to provide an effective environment for big data analysis.Firstly,it analyzes the impact of materialization efficiency by Map Reduce cost model.Secondly,it designs the Map Reduce column-store File,and achieves optimization by cooperative localization strategy.Thirdly,according to the different materialization time window,it proposes materialization strategies in Map Reduce based on column-store(MSMC),which is composed of three strategies: Map Reduce early materialization strategy(MEMS),Map Reduce late materialization strategy(MLMS)and Map Reduce early-late materialization strategy(MELMS).Thirdly,for the sake of avoiding malignant expansion of materialization sets,it designs the adaptive materialization sets adjust strategy(AMSAS),which realize the optimization of MSMC effectively.4.Aiming at the system inefficiency and scalability problem of traditional relational database in Big Data analysis,this paper presents an algorithm of Hash joins in a Map Reduce distributed environment based on column-store by introducing Map Reduce computing model.First of all,the design of large dataoriented distributed computing models is proposed.It designs the Map Reduce Column-store File,and achieves optimization by cooperative localization strategy.Secondly,we propose the partition aggregation and the heuristic optimization strategy to realize the implementation of Hash join algorithm.Finally,the experiments evaluate execution time and load capacity,and verify the effectiveness of the proposed method,while also providing good scalability in Big Data analysis.5.The growing demand for big data analytics has led to the design of highly scalable data-intensive computing infrastructures such as the Map Reduce.Recurring queries,repeatedly being executed for long periods of time on rapidly evolving high-volume data,have become a bedrock component in big data analytic applications.Consequently,this paper presents optimization strategies for recurring queries for big data analysis.Firstly,it analyzes the impact of recurring queries efficiency by Map Reduce recurring queries model.Secondly,it proposes the Map Reduce consistent window slice algorithm,which can not only create more opportunities for reuse of recurring queries,but also can greatly reduce redundant data while loading input data by the fine-grained scheduling,.Thirdly,in terms of data scheduling,it designs the Map Reduce late scheduling strategy that improve data processing throughput and optimize computing resource allocation in Map Reduce cluster.Finally,it constructs the efficient data reuse execution plans by Map Reduce recurring queries reuse strategy.Finally,the experiments evaluate execution time and load capacity.The results reveal that the optimization strategies can effectively reduce the intermediate data process of Map Reduce,network bandwidth and unnecessary I/O.It verifies the effectiveness of the proposed method in big data analysis.Finally,the experiments evaluate execution time and load capacity.The results reveal that the materialization strategies in Map Reduce based on column-store and adaptive materialized set adjustment strategy can effectively reduce the intermediate data process of Map Reduce,network bandwidth and unnecessary I/O.It verifies the effectiveness of the proposed method in big data analysis.

Keywords/Search Tags:

Big data, column-store, Map Reduce, Hash Join, materialization strategy, recurring queries

Related items

1	Efficient Star Join For Column-Oriented Data Store In The MAP Reduce Environment
2	Research On Some Key Technologies For Column-stores
3	Research And Implementation Of Query Optimizing Of Column Store In Data Warehouse Management System
4	Research And Implementation Of Key Techniques For Query Rewriting In Column-Store Data Warehouse
5	The Optimization Of The Query Execution Engine In Column Oriented DWMS
6	Research And Optimization Of Multidimensional Data Warehouse Model Based On Column Storage
7	Research On Query Optimization In Column-Oriented Data Warehouse
8	Research Of Key Technology Of Index In Column-Oriented DWMS
9	Research On Optimization For Multi-way Join In A Map-Reduce Environment
10	Column Store Database---A New Approach to GIS Application