Font Size: a A A

Research And Implementation On Optimization Of Data Placement Mechanism In Hadoop

Posted on:2019-07-25Degree:MasterType:Thesis
Country:ChinaCandidate:Y DuFull Text:PDF
GTID:2428330590975365Subject:Computer technology
Abstract/Summary:PDF Full Text Request
As the next core generation of enterprise data storage architecture,Hadoop Distributed File System(HDFS)has been extensively used to solve the problems like the storage capacity limitation,I/O performance bottleneck and storage cost.HDFS stores big data in blocks and distributes them to different data nodes according to a certain data placement strategy.It can not only to improve the storage and processing efficiency of the data center,but also can achieve the goals of high availability and high reliability.However,with the continuous expansion of cloud computing applications and evolution of the data center,the data generated by the big data applications in upper-tier has become more and more obvious with attribute of “cold” and “hot”.That appearance brings new challenges to data management in HDFS: On the one hand,if huge and low accessed cold data are still stored in the way of the default three-replica redundancy strategy in the HDFS,it would bring a huge storage cost.On the other hand,with the continuous operation and expansion of the data center,the heterogeneity of nodes is also highlighted.Rack-aware data placement strategies could result in uneven load and allocation of resource for nodes,due to ignoring the heterogeneity.Accordingly,the overall performance of the Hadoop Distributed File System is degraded.In order to solve the above problems,this thesis proposes to study the relevant mechanisms and strategies about hotness-based data clustering and data placement to achieve the goal of reducing storage cost and maintaining a certain of availability at the same time.Concretely,the research can be followed by three aspects:First,this thesis analyzes the method of data hotness awareness clustering with time series.There exists the problem that the three-copy redundancy strategy in HDFS brings high storage cost by ignoring the difference of data hotness.Therefore,a hotness-aware algorithm is proposed,which collected the time series over a period of time.The time series are calculated by DTW distance and clustered by K-Means which can translate the frequency of the time series into the property of the data hotness.That work provides the support for later data placement strategies.Secondly,this thesis proposes the data placement optimization strategies with hotnesssensitive.In order to solve the problem of degrading system performance caused by the fact that the heterogeneity of nodes is neglected in the process of data placement,the corresponding placement optimization strategies are proposed for cold and hot data respectively.For hot data with high access frequency,a placement is presented aimed at improving the resource utilization of the storage system from a variety of resource perspectives.For cold data with low access frequency,a placement with the redundancy of erasure code is established by achieving the goal of reducing the storage cost and maintaining the availability.Eventually,this thesis designs and implements Kitty-Twinkle,a data placement optimization system based on the Hadoop Distributed File System.The system adds the module of data statistics and modifies the data placement process and is deployed at the SEU CLOUD.The experiment results show that the method of data hotness awareness clustering with time series and the data placement optimization strategies with hotness-sensitive can significantly reduce the storage cost,improve the data availability and load capacity,and enhance the system performance.This thesis provides an effective solution for the management and storage of Big Data.
Keywords/Search Tags:data placement, data hotness, HDFS, erasure code, replication
PDF Full Text Request
Related items