A Duplicate Document Detect System Based On GPU Parallel Computation

Posted on:2012-06-18

Degree:Master

Type:Thesis

Country:China

Candidate:P Y Xiao

Full Text:PDF

GTID:2178330332476266

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Along with the rapid development of Internet, the document information on the network has been increasing exponentially. As the high fluidity of the information on the network, there is massive duplicate information between the documents. These duplicate documents will be a great challenge for the information retrieval tool, therefore, how to distinguish the duplicate document which is meaningless to user fast and correctly is a significant topic for the internet industry.This article will combine the duplicate documents detection and GPU parallel computing, proposing a parallel method which based on the GPU to detect the duplicate documents. The method of this article originates from the shingling algorithm and has carried on the following expansion and the optimization to the shingling algorithm:1. use the Bloom Filter structure to store features, establish a fast pre-filter mechanism to prevent the unnecessary duplicate document search operation. 2. draw on the thought of the reverse index of search engine, accelerate the detection of the duplicate document.3. detect the duplicate document based on GPU parallel computation, accelerate entire duplicate document detection.The experiment indicated that the method mentioned in this article can detect the duplicate news document fast. This method has high precise rate on the news document although its recall rate is ordinary.

Keywords/Search Tags:

duplicate document detection, near duplicate detection, Bloom Filter, reverse index, GPU, CUDA, crawler

PDF Full Text Request

Related items

1	Design And Implementation Of Duplicate Objects Detection In XML Document
2	Near-duplicate Video Fast Detection Based On Global And Local Features Fusion
3	Research On High-Performance Duplicate Detection And Elimination
4	Research Of Chinese News Web Page Duplicate Detection
5	Research On Mobile Search Oriented WAP Duplicate Data Detection
6	Research On Near-duplicate Detection Algorithm
7	Research Of Automated Duplicate Bug Report Detection
8	Near-duplicate Detection In Large Scale Video Dataset
9	A Detection Method Of Duplicate Defect Reports Based On Fusing Text And Categorization Information
10	The Design And Implement Of Sharing Website Based On Near-duplicate Web Video Detection