Font Size: a A A

Research Of Documents Copy Detection And Implementation Of System

Posted on:2017-11-18Degree:MasterType:Thesis
Country:ChinaCandidate:G P TongFull Text:PDF
GTID:2428330485461677Subject:Information Science
Abstract/Summary:PDF Full Text Request
With electronic documents and network resources is becoming richer,it is increasingly easy to get academic achievements of others,but also more likely to copy academic achievements from others.There are still many academic plagiarism though Copyright Law has been established very early in China and harsh punitive measures of academic plagiarism has been established by various academic institutions.Obviously,bad scholars will copy others' academic achievements more unbridledly if there is no good documents copy detection system.This paper studys prototypes and technologies of documents copy detection system,and analyzes the advantages and disadvantages of each technology,and finds out the future direction of documents copy detection system.Then focus on the analysis of several representative prototype and related technologies of documents copy detection system.Including frequency statistical techniques,string matching techniques and several digital fingerprint techniques.Then this paper puts forward a prototype of documents copy detection system,which is based on an improved sliding sentence segmentation method and it requires inverted after clustering.The clustering is soft clustering based on orthogonal basis.The system uses the longest common subsequence and bag-of-words to mark up the similiarity when exact match.Finally,a large number of experiments has been done.Then notices that the system is not well supported English detection,and the speed of registration and detection is slow.But it has high precision and recall rates,especially in the marking has a very good effect.Innovations:This paper proposes an improved sliding sentence segmentation method.It also proposes soft clustering which is based on orthogonal basis.The system uses the longest common subsequence and bag-of-words to mark up the similiarity when exact match.In order to detect small sentences plagiarism,the loop matching of small sentences is used.This system can detect the similiarity only after document registration.Document registration process has some sub-process.The first sub-process is pre-process of documents.Mainly to clean the data of documents,including the paragraph merging,page stitching,structure recognizing,chart identifying,garbage detecting.The second sub-process is segmentation of text block.Including the sentence segmentation,sliding,word segmentation,filtering.The third sub-process is soft clustering based on orthogonal basis,including clustering and soft classifying.The forth sub-process is inverted index group by classification.Document similarity detection divided into quick search and exact match.Quick search aimed at finding suspicious documents,exact match aimed at confirming whether similar sentences are copied or not and marking similarity.According to module tests,the system decidedsto use nouns,verbs and adjectives;to use improved sliding Sentence segmentation method;to use soft clustering based on orthogonal basis;to use term frequency statistics;to use the loop matching of small sentences in order to improve recall rates.System integration tests show that the document registration time is proportional to the size of registration documents;that the speed of detection is proportional to the size of the documents to be detected;that the system has good effects to each category of test corpus.Finally,the system has good effects by contrast to other detection systems.The author will study the algorithm to reduce the registration time and detection time in the future.Also will study the English text detection and semantic detection.In addition,precise marking effect can be further improved.Distributed systems implementation and bilingual detection is the future direction of documents copy detection.
Keywords/Search Tags:Similarity Detection, Documents Copy Detection, Prototype System, Term Frequency Statistics, String Matching, Digital Fingerprints
PDF Full Text Request
Related items