Distributed Text Copy Detection Algorithms Based On The Index

Posted on:2013-10-03

Degree:Master

Type:Thesis

Country:China

Candidate:Y Zhang

Full Text:PDF

GTID:2248330395450570

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

In internet, there are a lot of duplicated web pages. How to efficiently detect near duplicate in large scale document corpus has been a hot topic for a long time. A effective and efficient near duplicate detection algorithm is useful in many ways, such as reducing duplicated result in search engine, plagiarizing detection and text clustering. As the corpus size increases, near duplicate detection algorithm for single-node computer is intractable. Distributed computing framework such as Map-Reduce has been introduced in this task. To perform near duplicate detection on a distributed platform, several problems have to be solved. The first problem is how to split data across different nodes. The second one is how to reduce data traffic between different nodes. The third one is how to balance the load on every node. Previous research fails to handle these three problems properly. In this paper, we introduce two efficient near duplicate detection algorithms that could solve these problems and have good scalability.Usually. near duplicate detection algorithms use inverted index to improve their efficiency. In this paper, we first investigate the index structure on a distributed computing platform. We present an effective index structure--Doc-Split Index. DSI. In DSI, each block contains the index of a subset of the whole corpus. Therefore, at each time only one index block is needed in a single computing node. This will reduce the data transport between different nodes and balance the load on every node.。Based on the discussion on index structure, we propose two distributed algorithms for duplicate detection, PQ on DSI and PCP on DSI. These two algorithms are both efficient. Beside of that, the algorithms are implemented on Map-Reduce and in hence, have good scalability.In the experiments, we firstly perform parameter selection on an Oracle Set which is randomly selected from the real corpus with duplicated documents manually marked up. Then on the WT10G, we compared our algorithms (PQ on DSI, PCP on DSI) with two other algorithms (PQ, PCP) which were proposed by former researchers. The experiment shows our algorithms are better than theirs in efficiency.

Keywords/Search Tags:

Near Duplicate Detection, Copy Detection, Map-Reduce

PDF Full Text Request

Related items

1	Near-Duplicate Video Retrieval And Copy Video Detection
2	Research And Implementation Of High Efficiency Video Copy Detection Technology Based On
3	Document Parts Duplicate Detection Research
4	Feature Detector Based Image Copy Detection
5	Research Of Near-Duplicate Image Elimination
6	Research And Implementation Of Document Copy Detection
7	Near-duplicate Video Fast Detection Based On Global And Local Features Fusion
8	A Duplicate Document Detect System Based On GPU Parallel Computation
9	Multimedia Copy Detection Technology Based On Robust Hashing
10	Research Of Content-based Image Copy Detection