Font Size: a A A

Document Parts Duplicate Detection Research

Posted on:2013-02-10Degree:MasterType:Thesis
Country:ChinaCandidate:H M YuFull Text:PDF
GTID:2248330395950174Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Because of the explosion of the Internet, enormous duplicated data cause serious problem for search engine, opinion mining and many other Web applications. Most existing near-duplicate detection approaches focus on document level, so these approaches are not able to find out the duplicated part that is just a small piece of two documents.To solve this problem, we proposed a novel algorithm in this thesis. The main idea is to divide the algorithm into two subtasks. One is sentence level copy detection, and another is sequence matching.An effective and efficient feature extraction algorithm--Low-IDF-SIG algorithm was proposed, and an efficient near-duplicate detection system on sentence level was built based on this algorithm. For evaluation, the proposed method was compared with other approaches on a real corpus. Experimental results show that our proposed method can improve both precision and efficiency of near-duplicate detection on sentence level.The author also proposed a novel partial copy detection algorithm called PDC-MR-Ⅱ based on MapReduce Framework.The algorithm and system proposed by this thesis can be used to solve many problems, such as paper copy detection, topic copy detection in a forum and paged news copy detection, etc.
Keywords/Search Tags:Partial Copy Detection, Low-IDF-SIG, PDC-MR-Ⅱ, MapReduce
PDF Full Text Request
Related items