Research On High Efficient Text Copy Detection Based On Hash Learning

Posted on:2014-03-12

Degree:Master

Type:Thesis

Country:China

Candidate:Wu

Full Text:PDF

GTID:2208330434972758

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

In these days, with the developing and prevailing of the internet and computer, more and more data are stored electronically. What comes with it is the massive amount of duplicate documents, which causes a lot of pain to many people. For example, many companies and organizations are suffering from the problems of massive storage occupation and inefficient searching. Also, for many websites, especially news website, lots of copy-to-use web contents largely harm the interest and enthusiasm of the content producer, which has a very bad impact on the internet. Besides, a lot of duplicate data will reduce the effectiveness of the search engine to some extent.For the research of duplicate document detection, the main research directions mainly include:1, text representation;2, efficiency and scalability. The former direction mainly focuses on how to extract features from text and use these features to do better duplicate detection. And in the background of massive data, the latter direction mainly studies how to efficiently detect duplicate documents. However, in many researches, these two directions are not isolated. The former will serve the latter, and an efficient duplicate detection method may require a special form of text representation. Besides, the granularity of duplicate detection varies for different kinds of application. Those that need smaller granularity, e.g. sentence level duplicate detection, will face much severe efficiency problems.The main contents of this paper also include these two directions, details are as follows:1) Firstly, a complete sentence-level duplicate document detection framework is provided, including the main processes and detection algorithms.2) Secondly, the practicability of using hash code to represent the text is discussed. And it shows the existing hash code generation method has much room to improve in the aspect of precision. Based on the fact that the space of hash code is limited and needed to be made full use of, a hash code learning method is proposed. And the experiment results show that this method will generate better hash codes for this task, thus largely improve the precision of the detection, without harming the recall.3) Finally, by implementing the most time-consuming algorithms in CUDA platform, this method achieves more than1500X of speedup, while maintaining good scalability.

Keywords/Search Tags:

learning to hash, copy detection, GPU

PDF Full Text Request

Related items

1	Optimlzation And Implementation Of Hash Based Text Copy Detection Algorithms
2	Multimedia Copy Detection Technology Based On Robust Hashing
3	Research On Content-similarity Based Video Segment Copy Detection
4	Research On Hash Algorithm Based Image Copy Detection
5	Perceptual Hashing For Image Copy Detection
6	Research On The Techniques For Content-based Video Copy Detection
7	Research On Deep Hash For Video Copy Retrieval
8	Design And Implementation Of Digital Watermarking And Copy Detection Based Internet Information Security Monitoring And Service System
9	Research On Image Hash Algorithm For Copy Detection And Tamper Detection
10	Researches On Visual Saliency Based Video Hashing For Video Copy Detection