Font Size: a A A

Research On Data Storage Management Technology Of Science And Technology Cloud Platform

Posted on:2017-03-31Degree:MasterType:Thesis
Country:ChinaCandidate:Q LiFull Text:PDF
GTID:2308330482488694Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In recent years, the country continues to promote the cloud computing industry, which combined with specific industry. As one of the famous open source framework of cloud computing, Hadoop has also been a unique favor, many enterprises are the development of this technology. The national science and technology management system will also based on cloud computing technology, to ensure the high availability of data storage and convenient viscoelastic extension of storage space or computing performance in the future.We undertake the design and development of the similarity check system of science and technology data. It uses MapReduce to realize the parallel computation of the full text comparison of the project in the Hadoop platform. All of the file data is stored on the Hadoop distributed file system HDFS. It contain legacy devices and new purchased ones to taking full of old devices, and these devices are quite different in storage performance, calculated performance as well as 10 performance. In the actual operation, it is found that the uneven distribution of the data blocks will reduce the running speed of MapReduce, which will affect the speed of the Hadoop cluster response. Because of the default rack perception storage strategy of HDFS without considering the different of node’s performance, it is possible to make a high frequency data stored on the low performance nodes, at the same time, the low frequency data more likely to store on high performance node, then impact on the cluster response time, as well as reduces the resource utilization.To solve these headache problems, our team propose a hierarchical storage scheduling mechanism. On the basis of HDFS rack perception scheduling policy, Firstly in accordance with the node’s CPU, memory size, disk size, disk I/O and other inherent hardware performance, dividing nodes into high configuration node and opposite of low configuration node; secondly according to the node’s CPU usage, memory usage, network bandwidth usage, disk usage and other performance dynamic factors to establish performance evaluation model of the node, and to build three performance levels p1, p2, p3, from high to low, to evaluate the performance of nodes. Making integrated scheduling according to the node configuration, performance levels, network location and other factors. According to the data access frequency to dynamically adjust the distribution of the data block in the process of cluster running. It is to improve response rate of cluster by making a high frequency data stored on the high processing performance and high configuration node. On the other hand, removing a low frequency data from high configuration node for space-saving.The time for calculation has been increased by 6%because of the improved scheduling strategy of hierarchical storage about replication which is apply into the similarity check system of science and technology data to find the repetition.
Keywords/Search Tags:Cloud storage, HDFS, Heterogcneous cluster, Hierarchical storage, Storage scheduling
PDF Full Text Request
Related items