Font Size: a A A

A Duplicate Document Detect System Based On GPU Parallel Computation

Posted on:2012-06-18Degree:MasterType:Thesis
Country:ChinaCandidate:P Y XiaoFull Text:PDF
GTID:2178330332476266Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Along with the rapid development of Internet, the document information on the network has been increasing exponentially. As the high fluidity of the information on the network, there is massive duplicate information between the documents. These duplicate documents will be a great challenge for the information retrieval tool, therefore, how to distinguish the duplicate document which is meaningless to user fast and correctly is a significant topic for the internet industry.This article will combine the duplicate documents detection and GPU parallel computing, proposing a parallel method which based on the GPU to detect the duplicate documents. The method of this article originates from the shingling algorithm and has carried on the following expansion and the optimization to the shingling algorithm:1. use the Bloom Filter structure to store features, establish a fast pre-filter mechanism to prevent the unnecessary duplicate document search operation. 2. draw on the thought of the reverse index of search engine, accelerate the detection of the duplicate document.3. detect the duplicate document based on GPU parallel computation, accelerate entire duplicate document detection.The experiment indicated that the method mentioned in this article can detect the duplicate news document fast. This method has high precise rate on the news document although its recall rate is ordinary.
Keywords/Search Tags:duplicate document detection, near duplicate detection, Bloom Filter, reverse index, GPU, CUDA, crawler
PDF Full Text Request
Related items