Font Size: a A A

Reserch And Application On Document Similarity Detection Based On Minwise Hashing

Posted on:2013-11-07Degree:DoctorType:Dissertation
Country:ChinaCandidate:X P YuanFull Text:PDF
GTID:1268330401479270Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the explosive information growth of Web, there are a large number of massive web of similar information. On the one hand, these similar documents consumed high resources of index, the other affected users. Document similarity detection technology is an important topic in the information processing field, and it is a powerful tool to protect the author’s intellectual property and to improve the efficiency of information retrieval.Duplicate Document Detection (DDD) is widely used to find out similar documents, or in other words, to detect plagiarism in documents. Plagiarism does not only include intact copy, but also close imitation of the language and thoughts of another author and the representation of them as one’s own original work. Takeing fund projects similarity detection as a research background, In order to quickly and accurately detect the similar documents in the environment of massive amounts of data, this paper focuses on the theories and methods of document similarity detection for a more in-depth study, especially on the similarity estimation algorithm, the similarity retrieval algorithm, and similarity matching techniques based on SIMD optimization. Research work has been done as following:(1)f-fractional bit minwise hash algorithm is proposed for a wider range of selectivity for accuracy and storage space requirements. This paper studied the feasibility of f-fractional bit minwise hash algorithm, and constructs the optimal fractional bit to make the minimum variances of estimator. The algorithm’s innovation is extending the b bit into f-fractional bit. It broke through the limit of b integer; the similarity could be estimated by fractional bit. It not only improves the theoretical system of minwise hash algorithm, but also provides support for the diverse needs of precision and storage space in the actual system.(2) Connected bit minwise hash algorithm is proposed to improve the efficiency of similarity estimation since the half of comparisons is greatly reduced with5%loss of accuracy. Connected bit is convenient to be built and the performance increases exponentially with a strong practical significance in the environment of massive amounts of data.(3) Fingerprint group merging retrieval algorithm is proposed in large part to address both sides of a problem:similarity threshold could not be too low and fewer fingerprints could lead to low accuracy. Fingerprint group merging retrieval algorithm could quickly find documents with higher similarity in existing documentation set with the lower similarity threshold. Due to the reduction of the similarity threshold, the application level is wider.(4) For the problem that comparison’s results are difficult to clearly show, an optimized similarity comparison algorithm is proposed and implemented by using Intel SIMD technology and GPUs, based on the analysis of the whole non-parallel algorithm and the data statistics. Experimental results demonstrate that certain performance improvements could be obtained. As evidence of similarity to be significant expressed in the system. On one hand, the similarity can be tracked, on the other hand, in favor of manual review.(5)To solve the key problems of fund projects similarity detection system that the existing project data is difficult to extract quickly and accurately; the time of mass projects data comparision is too long; comparision’s results are difficult to clearly show, this paper uses key technologies to form a complete fund projects similarity detection system for the fund projects formal review.
Keywords/Search Tags:document similarity detection, minwise hashing, similarity estimation, fingerprint, fractional bit, connected bit
PDF Full Text Request
Related items