Font Size: a A A

The Research And Improvement For General Distributed File System

Posted on:2011-10-16Degree:MasterType:Thesis
Country:ChinaCandidate:G S GongFull Text:PDF
GTID:2178360308963598Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Today, there are so many information that distributed file system are used more and more widely, especially HDFS(Hadoop Distributed File System) as its scability, robustness and open source. Because it has many similarities with GFS(Google File System), that means it has good performance in search engine application but may not do well in other applications.Although in search engine we mainly process large data sets, in other applications we also need to process a large mount of small files. HDFS is not geared up to efficiently accessing small files, that's why we need to research and improve it. So far the best solution to solve the problem is using Sequencefile which is a file format in HDFS to replace small files.So this paper design and implement a transform tool called seqtool. This tool can efficiently transform a large mount of small files to a Sequencefile; It also can process archive files; Moreover it support append writing operation; At last it can compress the Sequencefile in two ways to save space.In the mean time this paper also design and implement a random read algorithm for Sequencefile which speed up the random read rate to Sequencefile. In order to archive high efficient we design a metadata file for every Sequencefile which record the position for every small file in Sequencefile, then we can use dictionary tree algorithm combined with secondary index organization to access the metadata file. In addition this paper also design and implement a HDFS web manage interface which we can browse, delete, upload and download files in HDFS conveniently.At last, this paper makes a series of performance tests between Sequencfile and small files. For batch reading and writing test, the Sequencfile has better performance and the difference are very huge; For random reading test, casue the random read algorithm we also have good performance. At last we use wordcount as a MapReudce test case, the result show the Sequencfile has much more better performace as expected.
Keywords/Search Tags:HDFS, Distributed File System, Sequencefile, Small files
PDF Full Text Request
Related items