Text plagiarism detection is an important research direction in the field of natural language processing, and it plays an important role in the protection of intellectual property,information retrieval and plagiarism detection and so on. With the popularity of the network and increasingly rich Internet applications, they produce a lot of data, which gave rise to a wealth of information and a great convenience, but also put forward higher requirements for the design and performance of plagiarism detection system. First, plagiarism detection system framework able to adapt to the processing requirements of large-scale data sets, such as storage capacity, computing resources and system reliability; Secondly, for an online plagiarism detection systems, which can quickly show plagiarism detection results on large data sets. One way to solve these problems is to apply the method processing large data to the plagiarism detection system. For the new features of plagiarism detection system, we use a distributed cluster of ideas, expand and improve the traditional system modules to meet the processing requirements of large data sets. Currently, the plagiarism detection systems research on large data sets not only has a high theoretical value, but very promising.This paper systematically describes the two components of the plagiarism detection system: source retrieval module and text alignment module, focusing on the architecture of the source retrieval module. In the traditional source retrieval structure, the system cannot meet the storage requirements and computing needs when handling large data sets, and also cannot guarantee high reliability. In response to these problems, this paper presents the architecture based on the index slice in the source retrieval module. This architecture is based on Hadoop distributed environment, which can integrate all of the storage resources and computing resources in the cluster; at the same time, it uses a copy mechanism of Hadoop and index slice, so that there are multiple copies of data on different machines. Compared with the traditional source retrieval structure, new structure has strong scalability to meet the requirements when processing of large data sets. On this basis, the paper improved plagiarism fragment merging algorithm in the text alignment module, and proposed fragments merge algorithm based on graph theory. Compared with the original algorithm, the new method improves the time performance in the entire plagiarism detection system and text alignment module. Finally, this paper designed and implemented a distributed plagiarism detection system based on Hadoop, proved that by experiments the source retrieval module based on index slice can accommodate the processing requirements of large-scale data sets, fragments merge algorithm based on graph theory can effectively improve time performance of the system. |