Font Size: a A A

The Implementation And Application Of Removing Duplicated Web Pages Based On Bloom Filter

Posted on:2011-08-05Degree:MasterType:Thesis
Country:ChinaCandidate:Z WangFull Text:PDF
GTID:2178360305959858Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the constant development of the Internet, Internet information has been expanding persistently by exponent, which leads to great difficulties in searching information. Consequently, there is much significance in the removal of duplicated web pages.This thesis explores the removal of duplicated web pages based on theory and practice, and has accomplished the following goals:Firstly, as to the given requirements, it designs and achieves the web crawler in the website of question and answer platform, and demonstrates the specific achievement procedure. Moreover, it also implements the filtration of the repeated URLs which is based upon the algorithm of Bloom Filter.Secondly, for the extraction of content-body of the target pages, it reaches the goal by using XPath expressions to focus on the paths of the contents.Finally, it implements the removal of the extracted reduplicated content-body. As the achievement procedure, it uses CDC(content-defined chunking) to divide the content-body of a web page. Then, it uses hash functions to each chunk of documents to generate a Bloom Filter for one web page, and estimates the similarity of the documents by making logical bitwise AND operation of the Bloom Filters belonging to the documents.The method of the removal of duplicated web pages the thesis focused on has been applicable for practical applications. The results show that this method has achieved remarkable effect of the removal of duplicated web pages, and highly improved the capability of the information retrieval system as well as the users' experiences.
Keywords/Search Tags:Bloom Filter, web crawler, removal of duplicated URLs, removal of duplicated web pages
PDF Full Text Request
Related items