Font Size: a A A

Fast File System For Access Of Massive Urls

Posted on:2011-06-29Degree:MasterType:Thesis
Country:ChinaCandidate:X P WangFull Text:PDF
GTID:2198330338489592Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The performance of web crawler almost depends on the rapid storage and access for the massive URLs. And According to strategy, the crawler fetches the URLs from internet on a certain order. So as to search the URLs quickly, the URLs is often stored in the relational database. When the size of the database is very large, the performance of relational database degrades so severely that can't meet the demand of the crawler. So the performance of the storage and access for massive URLs is the bottleneck of the web crawler.This study is to solve the bottleneck problem of massive URL management. According to the depth analysis of the crawling process in a real environment, we summarize the technical needs of the crawler. And on the basis of the results of the studies on the needs, we finally presented the fast file system for the storage and access of massive URLs. According to the functions of the fast file system, the file system is devided into logic access model and physical access model, and ultimately the performace of the file system made to meet the needs of crawler.There are some contributions of this paper:1) To improve the efficiency of file system, B+Tree is used as an index of the information of URLs and its subsidiaries. And the hash value of the URL combined domain is used as a key of the B+Tree. And we improve the storage utility of B+Tree by moving the key between brother nodes.2) In the physical access model, we use some principle such as URL accessing locality principle, disk sequential accessing, repeatability of crawling and delayed write techniques to reduce the I/O access and improve the efficiency of file system.3) We design the file system and implement the prototype of the file system, which includes all the features we discussed above. That lay a solid foundation for the continue research.
Keywords/Search Tags:fast file system, URL management, B+Tree, logical access model, physical access model
PDF Full Text Request
Related items