With rapidly expansion of information, the people prefer to mining and filtering theinformation than looking for it. Every day we do something in internet, also provide lots ofdata. Although the huge amount of data generated every day for the corporate andproduction environments, but also on these large amounts of data is stored for future datamining, because exhumed the data analysis, they can understand how to do in theproduction aspects and marketing aspects-This is the value of the data. The popular ofcloud computing, many people and enterprises use the cloud solution。For unique of CloudComputing’s service delivery model, it could generate large data in cloud backend. So howto store the data in reliable and secure way brought huge challenges to vendors.The author raised the node status based storage replication distribute policy (NSRDpolicy) via study and research many mainstream distributed file system. And the author alsostated the big environment-Cloud Computing. The node status based storage replicationdistribute policy can analyze the node statuses such as CPU usage, Dist I/O usage, Memoryusage, Network bandwidth usage and disk capacity utilization, and via these statuses, theauthor can expounds the performance point(KPI) mechanism. The KPI as a benchmark canlead the control node to provide the reasonably node to client. In order to better elaborateNSRD policy, the author abstracted the model, and divided into three services to elaborate.Three services are node status obtain services, forwarding node’s status information serviceand the target node selection service.To elaborate these NSRD policy’s three services better, the author use HDFS filesystem as an example. And author also elaborated the necessity of NSRD policy basis ofHDFS file system’s working flow. From analysis in chapter three and four, we canacknowledge that many distributed file systems invariably slice large file and store it crossthe whole cluster. When the file system is doing writing process, the control node will givethe node list to the client. However, the control node recommend the storage nodes whichin the cluster to the client, often using Roun-Ronbin random selection strategy. Although this strategy is simple and easy to implement, but it did not fully consider the entire node incluster with CPU usage rate, Memory usage rate, Disk I/O rage, Disk usage rage, networkbandwidth. So in this way it will lead to find the target node which is high load andunreliable.To solve the problems above, NSRD let storage node obtain accurate and real-timenode status through node status obtain services, and forwarding service status to the controlnode via forwarding node’s status information service. Finally control node will calculatethe KPI to each node and eventually returns the node which has highest KPI value to clientthrough target node selection service.To prove the NSRD policy can be realized, the author improve the HDFS file system’sreplication distribution strategy, and integrated node status obtain services, forwardingnode’s status information service and the target node selection service with HDFS filesystem and did sub-experiments under different scenarios. Large-scale cluster environmentcan’t be deployed in the author’s lab environment, so the authors simulated the NSRDstrategy and HDFS default policy through the MATLAB simulation. Then it comes tocomparison and analysis of the stability of the transmission and transmission efficiency.The distribution mechanisms of storage replication in distributed file system is still inthe research stage, and many distributed file systems are not integrated the intelligentdistribution mechanism, so in this way to start a discussion of new way to proposed themethod to determine the final storage destination node by node ’status. The weights in eachNode’ KPI value valuation algorithm is obtained through simple experiment, and still needto make more accurate weights via different experiments. |