Font Size: a A A

Research, Large-scale Approximation Paragraph Fingerprint-based Web Page Detection Algorithm

Posted on:2013-01-18Degree:MasterType:Thesis
Country:ChinaCandidate:Y LuanFull Text:PDF
GTID:2218330371959961Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The drastic development of Internet in the recent years has made the concept of web search engines receive remarkable significance. The voluminous amounts of web pages swarming the web have posed huge challenges to the technology of web search engines. However, the existence of duplicate and near duplicate web pages causes huge problems. These pages have created additional overheads for the search engines critically affecting their performance and quality. The detection of duplicate and near duplicate web pages has long been recognized in web crawling research community.In this paper, we first introduce the key technologies of near duplicate detection for web pages, and then analysis several classical algorithms.Before detection, we need to remove the noises in web pages, such as the navigation, advertisement, copyright. In this paper, we propose a new algorithm to extract the main content of the web pages which bases on the priority of HTML tags and DOM tree. The experimental results indicate that main content of web pages can be extracted accurately by using our algorithm.In addition, we propose a paragraph fingerprinting based algorithm for detecting duplicate and near duplicate web pages. The experimental results indicate that the algorithm achieves both high precision and high recall. We also implement three algorithms else: Shingling, SimHash and an extraction of full text long sentences based algorithm. We compare them with our algorithm, the experimental results show that our algorithm outperforms in terms of three benchmark measures:precision, recall and efficiency, and a reduction in the size of feature set.Besides, we redesign our near duplicate detecting algorithm basing on the MapReduce programming model, and implement our algorithm under the open source implementation Hadoop 0.20.2. HDFS is used to provide the distributed storage. The experimental results indicate that our algorithm can deal with large-scale data successfully by parallel designing.
Keywords/Search Tags:Noise elimination, Paragraph fingerprint, Near duplicate detection, MapReduce, Parallel computing
PDF Full Text Request
Related items