Font Size: a A A

Distributed Text Copy Detection Algorithms Based On The Index

Posted on:2013-10-03Degree:MasterType:Thesis
Country:ChinaCandidate:Y ZhangFull Text:PDF
GTID:2248330395450570Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In internet, there are a lot of duplicated web pages. How to efficiently detect near duplicate in large scale document corpus has been a hot topic for a long time. A effective and efficient near duplicate detection algorithm is useful in many ways, such as reducing duplicated result in search engine, plagiarizing detection and text clustering. As the corpus size increases, near duplicate detection algorithm for single-node computer is intractable. Distributed computing framework such as Map-Reduce has been introduced in this task. To perform near duplicate detection on a distributed platform, several problems have to be solved. The first problem is how to split data across different nodes. The second one is how to reduce data traffic between different nodes. The third one is how to balance the load on every node. Previous research fails to handle these three problems properly. In this paper, we introduce two efficient near duplicate detection algorithms that could solve these problems and have good scalability.Usually. near duplicate detection algorithms use inverted index to improve their efficiency. In this paper, we first investigate the index structure on a distributed computing platform. We present an effective index structure--Doc-Split Index. DSI. In DSI, each block contains the index of a subset of the whole corpus. Therefore, at each time only one index block is needed in a single computing node. This will reduce the data transport between different nodes and balance the load on every node.。Based on the discussion on index structure, we propose two distributed algorithms for duplicate detection, PQ on DSI and PCP on DSI. These two algorithms are both efficient. Beside of that, the algorithms are implemented on Map-Reduce and in hence, have good scalability.In the experiments, we firstly perform parameter selection on an Oracle Set which is randomly selected from the real corpus with duplicated documents manually marked up. Then on the WT10G, we compared our algorithms (PQ on DSI, PCP on DSI) with two other algorithms (PQ, PCP) which were proposed by former researchers. The experiment shows our algorithms are better than theirs in efficiency.
Keywords/Search Tags:Near Duplicate Detection, Copy Detection, Map-Reduce
PDF Full Text Request
Related items