Document Parts Duplicate Detection Research

Posted on:2013-02-10

Degree:Master

Type:Thesis

Country:China

Candidate:H M Yu

Full Text:PDF

GTID:2248330395950174

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Because of the explosion of the Internet, enormous duplicated data cause serious problem for search engine, opinion mining and many other Web applications. Most existing near-duplicate detection approaches focus on document level, so these approaches are not able to find out the duplicated part that is just a small piece of two documents.To solve this problem, we proposed a novel algorithm in this thesis. The main idea is to divide the algorithm into two subtasks. One is sentence level copy detection, and another is sequence matching.An effective and efficient feature extraction algorithm--Low-IDF-SIG algorithm was proposed, and an efficient near-duplicate detection system on sentence level was built based on this algorithm. For evaluation, the proposed method was compared with other approaches on a real corpus. Experimental results show that our proposed method can improve both precision and efficiency of near-duplicate detection on sentence level.The author also proposed a novel partial copy detection algorithm called PDC-MR-Ⅱ based on MapReduce Framework.The algorithm and system proposed by this thesis can be used to solve many problems, such as paper copy detection, topic copy detection in a forum and paged news copy detection, etc.

Keywords/Search Tags:

Partial Copy Detection, Low-IDF-SIG, PDC-MR-Ⅱ, MapReduce

PDF Full Text Request

Related items

1	A Video Copy Detection System Based On Graphs
2	Research On Video Copy Detection Based On The Hadoop Platform
3	Research And Implementation Of Document Copy Detection
4	Multimedia Copy Detection Technology Based On Robust Hashing
5	Research And Implementation On Video Copy Detection Based On SIFT Features
6	Design And Implementation Of Mapreduce-based Structured Query Mechanism
7	Research And Implementation Of Advertising Video Copy Detection
8	Distributed Text Copy Detection Algorithms Based On The Index
9	Research On Image Copy Tamper Detection Algorithm Based On SIFT Feature Points
10	Research On The Copy Detection Technology For Source Code