Font Size: a A A

Optimlzation And Implementation Of Hash Based Text Copy Detection Algorithms

Posted on:2015-02-20Degree:MasterType:Thesis
Country:ChinaCandidate:Y N GuoFull Text:PDF
GTID:2268330431456310Subject:Computer technology
Abstract/Summary:PDF Full Text Request
As the growth of social website and various service-oriented website, numerous user generated contents appear. These contents are very useful in fields such as data mining and so on. However, duplicate appears very common in these contents, which gives us a big challenge. It is important to research on the methods which can be used to eliminate these duplicate things. In the procedure of text copy detection, we should extract features of texts and transform each text to a vector first. A word is seemed as a feature in this text. By doing this, a text is transformed to a high dimensional vector. And the multiplicity of two texts can be measured by the distance between the two corresponding vectors. As the distance between two high dimensional vectors is can’t be calculated directly, we need a fast computing method to solve this problem.Hash-based copy detection algorithm is widely used due to its simplity and high efficiency. Firstly, hash-based algorithm is used for mapping high dimensional vectors to low dimensional space. And then the similarity of texts can be reflected by the distance of corresponding low dimensional vectors.A new text copy detection algorithm is proposed in this paper, which is very effective in accomplishing copy detection for short texts, such as microblog and comments. This algorithm is called Optimized qSign Algorithm. Firstly, we introduce the basic qSign algorithm and an optimization method on basic qSign algorithm. There are three problems exist in the optimization method, which can be solved effectively by the optimized qSign algorithm proposed in this thesis. A large scale high dimensional non-linear optimization problem needs to be solved in the optimized qSign algorithm, which is hardly solved by some direct methods. So, how to solve the optimization problem is the difficulty in this paper. Markov Chain Monte Carlo algorithm is implemented, and the problem is solved by word clustering. Finally, we introduce several classic hash-based copy detection algorithms, includes Locality-Sensitive Hashing, Min-Hash, Spectral Hashing and Semantic Hashing. Their implementations are introduced, too. Three experiments are designed to compare the efficiency and effective of each algorithm. The effective and efficiency of each algorithm, the optimization series of qSign algorithm and the selection of parameters of the proposed optimization qSign algorithm are compared in these three experiments respectively. Experimental results show that the proposed optimization algorithm outperforms the state-of-the art, and only need several steps of offline computation.
Keywords/Search Tags:Text Copy Detection, Hash Code, Optimization, Markov Chain MontaCarlo
PDF Full Text Request
Related items