The Key Technology Research Of Distributed Plagiarism Detection Based On Hadoop

Posted on:2015-04-21

Degree:Master

Type:Thesis

Country:China

Candidate:M X Wang

Full Text:PDF

GTID:2348330518470620

Subject:Engineering

Abstract/Summary:

PDF Full Text Request

Text plagiarism detection is an important research direction in the field of natural language processing, and it plays an important role in the protection of intellectual property,information retrieval and plagiarism detection and so on. With the popularity of the network and increasingly rich Internet applications, they produce a lot of data, which gave rise to a wealth of information and a great convenience, but also put forward higher requirements for the design and performance of plagiarism detection system. First, plagiarism detection system framework able to adapt to the processing requirements of large-scale data sets, such as storage capacity, computing resources and system reliability; Secondly, for an online plagiarism detection systems, which can quickly show plagiarism detection results on large data sets. One way to solve these problems is to apply the method processing large data to the plagiarism detection system. For the new features of plagiarism detection system, we use a distributed cluster of ideas, expand and improve the traditional system modules to meet the processing requirements of large data sets. Currently, the plagiarism detection systems research on large data sets not only has a high theoretical value, but very promising.This paper systematically describes the two components of the plagiarism detection system: source retrieval module and text alignment module, focusing on the architecture of the source retrieval module. In the traditional source retrieval structure, the system cannot meet the storage requirements and computing needs when handling large data sets, and also cannot guarantee high reliability. In response to these problems, this paper presents the architecture based on the index slice in the source retrieval module. This architecture is based on Hadoop distributed environment, which can integrate all of the storage resources and computing resources in the cluster; at the same time, it uses a copy mechanism of Hadoop and index slice, so that there are multiple copies of data on different machines. Compared with the traditional source retrieval structure, new structure has strong scalability to meet the requirements when processing of large data sets. On this basis, the paper improved plagiarism fragment merging algorithm in the text alignment module, and proposed fragments merge algorithm based on graph theory. Compared with the original algorithm, the new method improves the time performance in the entire plagiarism detection system and text alignment module. Finally, this paper designed and implemented a distributed plagiarism detection system based on Hadoop, proved that by experiments the source retrieval module based on index slice can accommodate the processing requirements of large-scale data sets, fragments merge algorithm based on graph theory can effectively improve time performance of the system.

Keywords/Search Tags:

plagiarism detection, massive data sets, source retrieval, text alignment, Hadoop

PDF Full Text Request

Related items

1	Research And Implementation Of Retrieval Model For Plagiarism Detection
2	Research On Plagiarism Detection Modeling Based On Statistical Machine Learning
3	Research Of Cross-Lingual Plagiarism Detection Mixed Translation And Bilingual Features
4	Research On Text Plagiarism Detection Methods
5	Source Code Plagiarism Detection Based On Information Retrieval And Stacking Integrated Learning
6	The Research And Implementation Of Parallelism Of Information Retrieval Related Algorithms Based On Mapreduce
7	Research On The Hadoop-based Distributed Full-text Retrieval And Related Technologies
8	The Study And Realization Of Paper Plagiarism Identification System Based On The Text Structure
9	Research And Implementation Of Distribute Massive Text Data Index And Retrieval System
10	Chinese Text Plagiarism Detection Algorithm Based On The Double Feature Extraction