Font Size: a A A

Research And Application Of Weighted One Permutation Hash

Posted on:2022-07-25Degree:MasterType:Thesis
Country:ChinaCandidate:S L WangFull Text:PDF
GTID:2518306332995789Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
Document similarity detection plays an important role in maintaining the integrity of scientific research,academic fairness and justice,and protecting intellectual property rights.Based on the background of similarity detection of NSFC project application,aiming at the problems of low efficiency and inaccuracy in massive text similarity detection,this paper proposes a one permutation hash algorithm based on weight,a one permutation hash algorithm based on location coding and a consistent weight sampling filtering algorithm.The main innovations of this paper are as follows:(1)In order to solve the problem of accuracy solidifying caused by constructing estimators in uniform partition of single permutation hash algorithm(OPH),the weighted onepermutation hash algorithm(WOPH)is proposed.If OPH needs to improve the accuracy,the only way to reduce the width of the partition is to increase the number of partitions k and re-divide the sample space?accordingly.Therefore,the paper divides the full set?into k1 and k2 partitions(k1?k2)evenly,and then constructs a weighted kw according to a certain ratio for k1 and k2,and by adjusting the ratio of k1 and k2,different kw can be formed to meet different calculation accuracy.After theoretical derivation,the WOPH estimator is constructed.Experimental results and analysis show that WOPH can achieve accuracy variability without re-preprocessing and extracting features.(2)For OPH and WOPH,when performing feature fingerprint comparison,the feature fingerprints that do not affect the calculation result are also involved in the comparison,which obviously affects the calculation efficiency.Therefore,the paper proposes a one permutation hash algorithm based on position encoding(POPH).POPH combines the value of the characteristic fingerprint and the position of the characteristic fingerprint to form a figure fingerprint in the form of<key,value>,and only stores the figure value and position of the figure fingerprint that are not empty,and use the cross-comparison method to calculate the similarity.(3)In order to solve the problem of high accuracy but low computational efficiency of uniform weighted sampling,a filtering algorithm of consistent weighted sampling is proposed.In the process of feature matching,the consistent weighted sampling filtering algorithm sets up observation points in the feature fingerprint comparison,and uses the hypothesis test and small probability principle to judge the similarity of the document to be tested,so that it is unnecessary to complete the feature fingerprint comparison,so as to improve the computational efficiency.(4)Document similarity detection system.The paper applies the weight-based single replacement hash algorithm and the consistent sampling filtering algorithm to the similarity system of the National Natural Science Foundation of China project application.The detection efficiency of the system has been significantly improved,and the detection results are true and credible,providing a scientific basis for the similarity detection task of the National Natural Science Foundation of China's project application undertaken by the Key Laboratory of Central South University of Hunan Province.
Keywords/Search Tags:One permutation hash, dynamic double threshold, Minwise Hashing, consistent sampling
PDF Full Text Request
Related items