Font Size: a A A

Research On High Efficient Text Copy Detection Based On Hash Learning

Posted on:2014-03-12Degree:MasterType:Thesis
Country:ChinaCandidate:WuFull Text:PDF
GTID:2208330434972758Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In these days, with the developing and prevailing of the internet and computer, more and more data are stored electronically. What comes with it is the massive amount of duplicate documents, which causes a lot of pain to many people. For example, many companies and organizations are suffering from the problems of massive storage occupation and inefficient searching. Also, for many websites, especially news website, lots of copy-to-use web contents largely harm the interest and enthusiasm of the content producer, which has a very bad impact on the internet. Besides, a lot of duplicate data will reduce the effectiveness of the search engine to some extent.For the research of duplicate document detection, the main research directions mainly include:1, text representation;2, efficiency and scalability. The former direction mainly focuses on how to extract features from text and use these features to do better duplicate detection. And in the background of massive data, the latter direction mainly studies how to efficiently detect duplicate documents. However, in many researches, these two directions are not isolated. The former will serve the latter, and an efficient duplicate detection method may require a special form of text representation. Besides, the granularity of duplicate detection varies for different kinds of application. Those that need smaller granularity, e.g. sentence level duplicate detection, will face much severe efficiency problems.The main contents of this paper also include these two directions, details are as follows:1) Firstly, a complete sentence-level duplicate document detection framework is provided, including the main processes and detection algorithms.2) Secondly, the practicability of using hash code to represent the text is discussed. And it shows the existing hash code generation method has much room to improve in the aspect of precision. Based on the fact that the space of hash code is limited and needed to be made full use of, a hash code learning method is proposed. And the experiment results show that this method will generate better hash codes for this task, thus largely improve the precision of the detection, without harming the recall.3) Finally, by implementing the most time-consuming algorithms in CUDA platform, this method achieves more than1500X of speedup, while maintaining good scalability.
Keywords/Search Tags:learning to hash, copy detection, GPU
PDF Full Text Request
Related items