Font Size: a A A

Research On Data Placement Technology In Mapreduce-styled Data Processing Platform

Posted on:2017-07-29Degree:MasterType:Thesis
Country:ChinaCandidate:H C WangFull Text:PDF
GTID:2348330503992872Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Map Reduce-styled data processing platform(“Map Reduce platform” for short) is one of the latest technologies in the field of massive data processing. Data localityaware processing on the Map Reduce-styled platform means the massive volume of data is stored on the local disk on the computing nodes and the computing tasks are scheduled to the nodes that the consuming data resides as much as possible, so as to reduce the communication overhead caused by remote data access and improve the data processing efficiency. Therefore, to increase the data localization processing probability is one of the main objectives of the Map Reduce platform.Data placement is one of the core technologies of the data processing platform, which implements the function that the data in the platform can be distribute stored in all storage nodes via a reasonable and effective way. The new features that massive data based on computing node data storage and localization, which is different from traditional data processing, requires not only the service to the data storage efficiency, but also the service in the data calculation efficiency. Existing data placement techniques mostly aim at enhancing the efficiency of data access and reducing the data I/O bottleneck. However, when applied to the Map Reduce platform, these technologies will lead to the poor data processing efficiency, due to that these technologies are lack of the consideration of the computing load on the data storage node and reduce the ratio of the localized processing of the hot-spot data. To solve the problems above, this dissertation researches technologies of Map Reduce platform data placement, in order to enhance the localization data processing probability, and improves the data processing efficiency in data placement decision by introducing a copy of the data accessing localization ratio, the calculation of node residual computation ability of new factors. The main contributions of this dissertation are as follows:(1) Data placement decision-making information sets is defined. According to the new features of the Map Reduce platform, the decision-making information set of data placement are defined in this dissertation. And the accessing frequency of data block replica, the ratio of localized accessing in the data block replica, node remaining resource capacity and other information are introduced for the first time, which can be considered as new decision-making factors for the data placement.(2) Decision-making information acquisition mechanism is designed and implemented. The acquisition mechanism of data placement decision-making information is defined, including information collection, information prediction and information gathering. A decision-making information acquisition framework based master / slave structure is designed to separate information collection and information prediction from computing nodes(slave nodes). Thus the center node only completes the function of information gathering, which can reduce the load pressure of the center node. The information prediction based on gray prediction is designed at the same time.(3) Dynamically replacement strategy of existed block replicas is designed and implemented. Through analyzing the relationship of decision-making factors and data block replica as well as data nodes, designing the block replica evaluate method and node evaluate method. The migration data block replica candidate set and the migration destination node candidate set can be chosen out with above evaluation value. At last, the best destination node from migration destination node candidate set for each data block replica which is in the migration data block replica candidate set is chose without reduce the fault tolerance of the file system.(4) Data placement strategy for the new-adding data blocks is designed and implemented. In order to increase the probability of local access and balance the storage resources usage, block replicas is placed in the node which has largest remaining resources from candidate nodes that is random chose in the cluster when writing data into the distributed file system.(5) Simulation environment is set up and performance test is completed. Simulation software Cloud Sim is extended and Map Reduce simulation platform consists of hundreds of nodes is configured. With the same task and data submission, the performance of improved placement strategy is compared with HDFS default placement strategy in the areas like average execute time of jobs, etc. The result shows that the strategy proposed in this dissertation decreases the execute time of tasks by 12.03% in average.
Keywords/Search Tags:Map Reduce, HDFS, block replica placement, gray prediction, Cloud Sim
PDF Full Text Request
Related items