Font Size: a A A

Study And Application Of L-Priority Bloom Filter

Posted on:2011-10-31Degree:MasterType:Thesis
Country:ChinaCandidate:F MiFull Text:PDF
GTID:2178360305455158Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The 21st century is the century of information technology. With the growing popularityof computers and networks, the structure of people's lives gradually transfer to the computersand networks from traditional television, radio, newspapers and other media,Computer-related industry is rapidly developing. The growing network size and the rapidincrease of internet users lead to a rapid growth network information, which forms massivedatabases. It has become a problem which need to get more attention and to be solvedurgently that how to express and search these massive data. Therefore, it is very necessary tostudy how to express and search massive data。This paper cites the data from the 24th China Internet Development of survey ofChina's Internet Center in July 2009, describes the current rapid development of domesticInternet, and proposes the focus of this article - Bloom Filter.Bloom Filter is proposed by Burton H. Bloom in 1970 as a space-efficient algorithm fordata representation, which has been widely used in databases, networks, and distributedsystems. It expresses and searches massive data by using a bit space. This paper conductsin-depth research of the Bloom Filter both in theory and application, describes the currentstatus and research results. After the careful study, analyze and research of the Bloom Filter,proposes L-priority Bloom Filter in response to the limitations and the lack of the originalalgorithm in large-scale data representation and search.The design of L priority Bloom Filter is very simple. It uses a multidimensional bitspace matrix to replace the bit space of Bloom Filter. Multidimensional bit space matrixrepresents the element with different dimensions of the bit space, so as to support thedifferent priority of the elements. Taking into account the practical problems to be solvedand the memory space required, the multidimensional bit space matrix uses some specialprovisions, so as to achieve better the representation and the search with various elementspriority and low memory space in the massive data. Compared to Bloom Filter, L priority Bloom Filter cost more time and memory space regardless of query or insertion due to therepresentation of matrix elements with bit space. However, the advantage of the L-priorityBloom Filter not to be ignored is to represent the multi-priority set element, and significantlyreduce the incorrect judgments rate of elements with high-priority without changing theincorrect judgments rate of elements with low-priority. At the same time compared to otheralgorithm, L priority Bloom Filter is more simple and easier for practical application.After proposing the L priority Bloom Filter and use it in the duplication exclusion andnew page judgment of large-scale web pages, the author propose a new model of duplicationexclusion and prejudgment of large-scale web pages based on the LPBF and the URL. Theduplication exclusion and new page judgment have always been important parts of thesearch engine. When users search the web site, search engine clusters similar content pages,and provides relevant content to users. But there is a lot of duplication in the clustered pages,which would cause inconvenience if you send directly to the users without any processing.Meanwhile, to a portal whose pages update frequently, a good search engine must be able todetect and determine the pages newly created, so as to promptly compile and feedback to theusers. Therefore, it is an important task to quickly and effectively determine whether a pageis new or should be included.L priority Bloom Filter is designed precisely to solve the problem of large-scale webpage duplication exclusion and new page judgment. A large number of web analysis foundthat in all the pages, the relatively important pages occupied only a tiny fraction. If directlyuse the Bloom Filter, these important pages will be affected by a large number of ordinaryweb pages and increase the misjudgment rate, bring more serious consequences. Thus, asmentioned before, L-priority Bloom Filter significantly reduces the incorrect judgments rateof elements with high-priority without changing the incorrect judgments rate of elementswith low-priority. Through a combination of L-priority Bloom Filter and the URL, the authorproposes a new model of duplication exclusion and prejudgment of large-scale web pages.The model represent the URL using L-priority Bloom Filter of the in a multidimensional bitspace matrix, and conduct a related insertion and query operations.Finally, the author applies the model based on the Bloom Filter and large-scale web page URL. The model uses second-order LPBF and 8 hash functions. In the model'simplementation, uses the C/S architecture model, defines the server and the client. Throughthe packet transmission of information exchange between the client and server, the clientprogram sends operations package, and the server conducts operations, return operationalstatus and results according to client's instructions and data, at the same time, the programalso uses multi-threading technology, allowing multiple clients to connect with the server,achieve the interaction better. In addition, the focus of the overall system implementation isthe realization of 8 hash functions, the authors adapts and implements these 8 hash functionafter extensive study and research of hash functions with good performance to URL, in orderto achieve this simple model system better.Through above research, the author has a certain understanding of representation andsearch of large scale data collection, but it is only the tip of the iceberg in this area. It alsoneeds more efforts and to exploration.
Keywords/Search Tags:Bloom Filter, LPBF, Set, Hash, Large-ScaleWeb Page
PDF Full Text Request
Related items