Font Size: a A A

Research On Optimization Technology Of Distributed File System Based On Hadoop

Posted on:2014-04-01Degree:MasterType:Thesis
Country:ChinaCandidate:D Z ZhangFull Text:PDF
GTID:2268330401976285Subject:Circuits and Systems
Abstract/Summary:PDF Full Text Request
With the development of mobile Internet, the amount of data in the network increaseddramatically, these data after analysis and data mining can be very valuable, these informationcan be used for commercial, scientific research, production and other aspects. If we usetraditional supercomputers to handle the rapid growing massive data, it costs high and wastestoo much energy. Cloud computing as a cheap, efficient and reliable solution, get a lot of thepeople’s attention. Hadoop is an open source cloud data processing platform, it can be widelyused in the processing and analysis of huge amounts of data.Cloud platform use thedistributed file system, there are some well-known distributed file systems like Lustre, GPFS(General Parallel File System), the design of these systems are based on the mainframe, theyare not suit for our microcomputer using cloud computing environment today.This paper use GlusterFS as a cloud platform distributed file system, GlusterFS is amicrocomputer useable Distributed File System. This paper firstly realize the connectionbetween GlusterFS and Hadoop core module, the Common, on this point the paper use theGlusterFS’s Translator mechanism. The mechanism is able to achieve all GlusterFSexpansion. The paper use Translator’s library functions to connect the core of Hadoop, theCommon, and this paper obtain the appropriate storage rights and define theorg.apache.hadoop.fs.glusterfs class, and create the data flow which accord GlusterFS dataformat. The paper use FUSE(Filesystem in Userspace) to make GlusterFS mount to Hadoop,and replace the Hadop own Distributed File System HDFS(Hadoop Distributed File System).So the paper can avoid the defects of HDFS, and can use GlusterFS’s advantages to enhancethe the whole Hadoop cloud computing performance. To achieve optimization platform, thepaper use Infiniband RDMA(Remote Direct Memory Access) transmission network, thisnetwork can guarantee that Hadoop can not be affected by the restrictions of networkbandwidth and speed, and improve the performance of the Hadoop; According to networkcongestion situation in the system, the paper use a judgment function to decide whether to usedata compression to save network bandwidth, and further enhance the Hadoop data transferrate in the current network; For the current GlusterFS’s data caching algorithms considers notvery comprehensive, the paper use a new data caching algorithm GAC (GlusterFS AutomaticCache Algorithm). The algorithm first determines whether the current data is in order,and onthe ordered data the paper determine the strength of the order, and the paper use a read-aheadsize formula to calculate the reasonable the size of the read-ahead size. Reasonablepre-reading enhances Hadoop’s filesystem performance. The presented optimization measures,greatly improved the performance of the Hadoop platform distributed file system. Throughtesting on the Hadoop cloud platform, the paper prove that the o ptimized Hadoop Distributed File System performance increases by10times, Hadoop platform cloud computingperformance increases by more than2times.
Keywords/Search Tags:GlusterFS, Hadoop, GtoH Interface, GAC Algorithm, Data Compression
PDF Full Text Request
Related items