Font Size: a A A

Design And Implementation Of A Data Integration System Based On MapReduce

Posted on:2016-10-11Degree:MasterType:Thesis
Country:ChinaCandidate:H T ZhangFull Text:PDF
GTID:2348330521951099Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of informatization and computer technologies,more and more data-intensive applications come into being.Large amounts of data are analyzed to mine valuable information.Building a data warehouse system to integrate the massive distributed heterogeneous data is one of the methods enterprises adopt to support making decisions.However,traditional data integration approaches need to set up a central node to aggregate the distributed heterogeneous data to the distributed data warehouse,which will become a bottleneck when integrating large scale datasets,resulting in a large data migration and low performance.Obviously,traditional data integration approaches can't satisfy the demands of massive data integration.A distributed parallel data integration approach appropriate for massive data under the MapReduce programming model is proposed.The idea of "moving computation to data" is adopted to implement the integration process,thus the data on each node can be integrated in parallel,eliminating the bottleneck of central node.Four kinds of integration strategies with different join implementation methods are proposed,and also a evaluation model based on I/O cost is presented according to which a strategy selection algorithm is designed.In addition,to address the parallelism of computation on integration data,a data layout strategy is designed to distribute the integrated data,which focuses on the data location,data balance in the nodes of data warehouse.The proposed distributed parallel data integration approach is implemented under the Hadoop environment and a series of experiments with different data amounts are designed.And the proposed integration approach is proved feasible and the I/O evaluation model is proved correct.Besides,the layout strategy on the integrated data is simulated,and got an expected balanced data distribution.The proposed distributed parallel data integration approach is another application of MapReduce Model in the massive data analysis,addressing to the poor parallel performance and expensive cost problems of massive data integration,and having significance in both theory and practice.
Keywords/Search Tags:Hadoop, MapReduce, Cloud Computing, data-intensive applications, data integration, data distribution layout, IO cost
PDF Full Text Request
Related items