Research Of Documents Copy Detection And Implementation Of System

Posted on:2017-11-18

Degree:Master

Type:Thesis

Country:China

Candidate:G P Tong

Full Text:PDF

GTID:2428330485461677

Subject:Information Science

Abstract/Summary:

PDF Full Text Request

With electronic documents and network resources is becoming richer,it is increasingly easy to get academic achievements of others,but also more likely to copy academic achievements from others.There are still many academic plagiarism though Copyright Law has been established very early in China and harsh punitive measures of academic plagiarism has been established by various academic institutions.Obviously,bad scholars will copy others' academic achievements more unbridledly if there is no good documents copy detection system.This paper studys prototypes and technologies of documents copy detection system,and analyzes the advantages and disadvantages of each technology,and finds out the future direction of documents copy detection system.Then focus on the analysis of several representative prototype and related technologies of documents copy detection system.Including frequency statistical techniques,string matching techniques and several digital fingerprint techniques.Then this paper puts forward a prototype of documents copy detection system,which is based on an improved sliding sentence segmentation method and it requires inverted after clustering.The clustering is soft clustering based on orthogonal basis.The system uses the longest common subsequence and bag-of-words to mark up the similiarity when exact match.Finally,a large number of experiments has been done.Then notices that the system is not well supported English detection,and the speed of registration and detection is slow.But it has high precision and recall rates,especially in the marking has a very good effect.Innovations:This paper proposes an improved sliding sentence segmentation method.It also proposes soft clustering which is based on orthogonal basis.The system uses the longest common subsequence and bag-of-words to mark up the similiarity when exact match.In order to detect small sentences plagiarism,the loop matching of small sentences is used.This system can detect the similiarity only after document registration.Document registration process has some sub-process.The first sub-process is pre-process of documents.Mainly to clean the data of documents,including the paragraph merging,page stitching,structure recognizing,chart identifying,garbage detecting.The second sub-process is segmentation of text block.Including the sentence segmentation,sliding,word segmentation,filtering.The third sub-process is soft clustering based on orthogonal basis,including clustering and soft classifying.The forth sub-process is inverted index group by classification.Document similarity detection divided into quick search and exact match.Quick search aimed at finding suspicious documents,exact match aimed at confirming whether similar sentences are copied or not and marking similarity.According to module tests,the system decidedsto use nouns,verbs and adjectives;to use improved sliding Sentence segmentation method;to use soft clustering based on orthogonal basis;to use term frequency statistics;to use the loop matching of small sentences in order to improve recall rates.System integration tests show that the document registration time is proportional to the size of registration documents;that the speed of detection is proportional to the size of the documents to be detected;that the system has good effects to each category of test corpus.Finally,the system has good effects by contrast to other detection systems.The author will study the algorithm to reduce the registration time and detection time in the future.Also will study the English text detection and semantic detection.In addition,precise marking effect can be further improved.Distributed systems implementation and bilingual detection is the future direction of documents copy detection.

Keywords/Search Tags:

PDF Full Text Request

Related items

1	Research On Improved Copy Detection Methods For Chinese Documents Based On String Matching
2	Research On The Copy Detection System For Documents Based On String Matching Method
3	Research On The Copy Detection Technology For Source Code
4	Research On Content-similarity Based Video Segment Copy Detection
5	The System Design Of The Paper Similarity Analysis
6	Research On Key Technologies Of Video Copy Detection
7	On The Irregular Planar Fragments Matching
8	Chinese Text Copy Detection Based On N-Gram
9	Research On Key Issues Of Copy Detection Between Documents
10	Research On Intrusion Detection System Orientated String Matching Optimization