Design And Implementation Of A Data Integration System Based On MapReduce

Posted on:2016-10-11

Degree:Master

Type:Thesis

Country:China

Candidate:H T Zhang

Full Text:PDF

GTID:2348330521951099

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the development of informatization and computer technologies,more and more data-intensive applications come into being.Large amounts of data are analyzed to mine valuable information.Building a data warehouse system to integrate the massive distributed heterogeneous data is one of the methods enterprises adopt to support making decisions.However,traditional data integration approaches need to set up a central node to aggregate the distributed heterogeneous data to the distributed data warehouse,which will become a bottleneck when integrating large scale datasets,resulting in a large data migration and low performance.Obviously,traditional data integration approaches can't satisfy the demands of massive data integration.A distributed parallel data integration approach appropriate for massive data under the MapReduce programming model is proposed.The idea of "moving computation to data" is adopted to implement the integration process,thus the data on each node can be integrated in parallel,eliminating the bottleneck of central node.Four kinds of integration strategies with different join implementation methods are proposed,and also a evaluation model based on I/O cost is presented according to which a strategy selection algorithm is designed.In addition,to address the parallelism of computation on integration data,a data layout strategy is designed to distribute the integrated data,which focuses on the data location,data balance in the nodes of data warehouse.The proposed distributed parallel data integration approach is implemented under the Hadoop environment and a series of experiments with different data amounts are designed.And the proposed integration approach is proved feasible and the I/O evaluation model is proved correct.Besides,the layout strategy on the integrated data is simulated,and got an expected balanced data distribution.The proposed distributed parallel data integration approach is another application of MapReduce Model in the massive data analysis,addressing to the poor parallel performance and expensive cost problems of massive data integration,and having significance in both theory and practice.

Keywords/Search Tags:

Hadoop, MapReduce, Cloud Computing, data-intensive applications, data integration, data distribution layout, IO cost

PDF Full Text Request

Related items

1	Research On Optimization Of Map Reduce For Interactive Analysis On Big Data
2	Job Scheduling Technologies In Data Intensive Supercomputing Systems
3	Reseach On Data Placement Strategy For Data-intensive Applications In Cloud
4	Energy efficient data-intensive computing with MapReduce
5	Performance-Aware Scheduling For Data-Intensive Cloud Computing
6	Task Scheduling And Virtual Machine Integration Of Data Intensive Batch Processing Workflow
7	Mass Sales Data Processing Platform Design And Implementation
8	Research Of Data Classification Algorithms In Data-intensive Computing Environments
9	Research On Minimum Cost Data Storate Problem In Multi-clouds
10	Design And Implementation Of The Data Analysis System Besed On Hadoop