Research, Large-scale Approximation Paragraph Fingerprint-based Web Page Detection Algorithm

Posted on:2013-01-18

Degree:Master

Type:Thesis

Country:China

Candidate:Y Luan

Full Text:PDF

GTID:2218330371959961

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

The drastic development of Internet in the recent years has made the concept of web search engines receive remarkable significance. The voluminous amounts of web pages swarming the web have posed huge challenges to the technology of web search engines. However, the existence of duplicate and near duplicate web pages causes huge problems. These pages have created additional overheads for the search engines critically affecting their performance and quality. The detection of duplicate and near duplicate web pages has long been recognized in web crawling research community.In this paper, we first introduce the key technologies of near duplicate detection for web pages, and then analysis several classical algorithms.Before detection, we need to remove the noises in web pages, such as the navigation, advertisement, copyright. In this paper, we propose a new algorithm to extract the main content of the web pages which bases on the priority of HTML tags and DOM tree. The experimental results indicate that main content of web pages can be extracted accurately by using our algorithm.In addition, we propose a paragraph fingerprinting based algorithm for detecting duplicate and near duplicate web pages. The experimental results indicate that the algorithm achieves both high precision and high recall. We also implement three algorithms else: Shingling, SimHash and an extraction of full text long sentences based algorithm. We compare them with our algorithm, the experimental results show that our algorithm outperforms in terms of three benchmark measures:precision, recall and efficiency, and a reduction in the size of feature set.Besides, we redesign our near duplicate detecting algorithm basing on the MapReduce programming model, and implement our algorithm under the open source implementation Hadoop 0.20.2. HDFS is used to provide the distributed storage. The experimental results indicate that our algorithm can deal with large-scale data successfully by parallel designing.

Keywords/Search Tags:

Noise elimination, Paragraph fingerprint, Near duplicate detection, MapReduce, Parallel computing

PDF Full Text Request

Related items

1	Research Of Near-Duplicate Image Elimination
2	Research Of Automated Duplicate Bug Report Detection
3	Research On Parallel Approaches For Processing Massive High Speed Rail Noise Data Based On Cloud Computing
4	A Duplicate Document Detect System Based On GPU Parallel Computation
5	Research On High-Performance Duplicate Detection And Elimination
6	The Design And Implementation Of Parallel Computing Platform Based On MapReduce
7	Reseach On Mapreduce Parallel Computing Platform For Cloud Computing
8	An Abnormal Packet Parallel Generation Engine Based On Mapreduce
9	Research Of Parallel Frequent Itemset Mining Algorithm Based On MapReduce
10	Community Detection Based On Mapreduce