Font Size: a A A

The System Design Of The Paper Similarity Analysis

Posted on:2013-11-17Degree:MasterType:Thesis
Country:ChinaCandidate:C XuFull Text:PDF
GTID:2248330395969898Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Plagiarism is one of the serious problems that troubled academia. At present, we mainly use digital fingerprint and string matching techniques for similarity analysis of the English papers, it is more mature. But for Chinese papers, there are many problems for most algorithms like low recognition rate and low efficiency. So we must improve the similarity analysis technology.It has very important application both in intellectual property protection and information retrieval for text copy detection. In early days,text copy detection mainly concentrated on program plagiarism and now the most studies are on the papers detection of academic misconduct. Firstly the paper introduces the significance of research and developments of the text copy detection technology, then the paper describes Chinese Word Segmentation, because Chinese word segmentation is the basis of solving the problem of papers’similarity analysis.The paper also introduces the ICTCLAS Chinese word segmentation system.And then the paper designs a comprehensive analysis method of papers similarity, There are two main basic text copy detection methods in this design, word frequency statistics based on the paragraph and fingerprint identification step by step.The comprehensive plan is the improving and perfecting of the plan obtaining frequency count, fingerprint identification and KMP string matching method. We only need to match a aplagiarized paragraph, not a comlplete match for word frequency statistics based on the paragraph, it greatly improves the performance. At the same time the string direct matching is the most accurate method, therefore it improves the accuracy of plagiarism determination. The performance of fingerprint identification step by step is much better and it is more suitable for large-scale calculation in the text copy detection methods. The fingerprint generation uses using k-words method, fingerprint selection uses winnowing strategy. Then,We use the java code to achieve some of the functions of this approach, the functions include the paragraph word frequency statistics and interactive paper similarity analysis system, the system includes word frequency statistics, fingerprint identification and double KMP. The paper also carry on the system test analysis, including interactive similarity system test and paragraphs word statistical procedure test. We prove its feasibility and performance advantages. Finally, this paper summarizes the similarities and differences in Chinese and English text copy detection based on a lot of work,and point out the development direction of Chinese text copy detection method.
Keywords/Search Tags:Text copy detection, word frequency statistics, Fingerprint identification, ICTCLAS, KMP algorithm
PDF Full Text Request
Related items