Font Size: a A A

Research On Key Issues Of Copy Detection Between Documents

Posted on:2014-07-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:D ZouFull Text:PDF
GTID:1318330566454643Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Plagiarism is mainly found in the papers and project reports between researchers and in the homework and course papers among students.Unlike researcher plagiarism,the original source of student plagiarism is mostly located inside the learning community.There are many systems on the market that can be used to check if a paper was plagiarized ‘‘globally”.The detection tools specifically for student local plagiarism is not common,there is a well-known Moss system in Stanford University that can be used to detect software code plagiarism among students.Detection methods based on approximate fingerprint and based on word frequency statistics are the most deeply studied and widely adopted one among all plagiarism detection technologies.Approximate fingerprinting was proposed to overcome the fragility of checksum based detection methods when being attacked by random characters added in a document.A prominent characteristic of an approximating fingerprinting scheme is that when a document is changed a little in content,its fingerprint does not change much either.Combined with the random sampling method,plagiarist can not predict the generation rules of fingerprint sampling,which can significantly reduce the interference caused by the noise that plagiarist add into a document for avoiding detection,but it also demands more in computation and storage resources.Detection methods based on word frequency statistics have drawn much attention in plagiarism detection domain recently.This approach analyzes the distribution and word frequency of keywords in the document,and determine whether there is plagiarism,differently approximate fingerprints,the approach can retain semantic information.However,the method demands excessive resource,particularly for storage space,limiting its scope of application.Heuristic merging method is the most commonly used method of positioning copy text.By defining the merging rules,the method can reduce the impact of interference information to locating results at a certain extent.Exhaustive iterator method and approximate longest common substring method are often adopted amony all heuristic merging methods,but efficiency of these methods is relatively low,and the locating accuracy is vulnerable to the effects of noise.Commercial plagiarism detection software is using the index and database technology,to split document into fragment and store these fragment into database and create index.This approach can avoid the impact of interference and improve the detection efficiency at a certain extent,but the mass storage and how to achieve efficient indexing is needed to solve the problem.Plagiarism in different areas have their own characteristics,but also has a certain commonality,either existing plagiarism detection technology can not completely solve the problem of plagiarism in all fields.Based on the full analysis to the characteristics of student homework and researcher paper plagiarism,this paper has studied various detection algorithms plagiarism,proposed solutions about student homework and researcher paper plagiarism,and won the second place in the 2010 international plagiarism detection competition.Moreover,we has designed and developed a plagiarism detection system that is used to check student homework in e-learning system of SCUT,and obtained a good effect.This research has finished the following work:1.Design and create two data set for detecting the plagiarism in student homework and researcher paper.After detailed analysis to the cause,the manner and scope of college students plagiarism,we have collected over 10,000 copies of submitted student homework and download thousands of research paper copies from IEEE web site,organized into two data set for detecting student homework and researcher paper plagiarism,and used to test the proposed algorithm in the actual environment detection.In order to facilitate comparison with other algorithms,the paper also uses plagiarism detection international competitions(PAN)proposed Western corpus.2.Proposed aplagiarism detection method based on rapidly semantic matching for the heuristic retrieval step.The traditional similarity method in dealing with a very small proportion plagiarism documents,the accuracy of detection can be affected by the size of the document.In order to solve this problem,the relationship between the text semantics and the fingerprint order is analyzed,and a semantic matching method,which projects the fingerprint vector into a binary sequence to reduce the dimension and remain the position information of the fingerprint,is proposed.3.Proposed two dimension reduction algorithm to solve the low efficiency problem that the existed preselection method dealed with massive document set.One is the feature vector dimensionality reduction method based on Pearson coefficient,another is the similarityreserved dimensionality reduction methods based on Cauchy coefficient.Two methods are to ensure the deviation of two documents correlation to be controlled in the lower range,to minimize the dimension of feature vectors.4.For the first time,we have used the concept of clustering in similar text positioning,and proposed a similar text positioning method based on slope-density cluster.Using the rapid semantic matching method in heurisitic retrieval step,and combining with the principle of density-based clustering method,proposed the concept of slope-density cluster,and used it to position similar text.5.Designed and developed anstudent homework plagiarism detection system,and has been implemented in our e-learning system to detect student homework,the system has been running for more than two years on the line.
Keywords/Search Tags:plagiarism detection, similar text positioning, semantic matching, dimension reduction, slope-density cluster
PDF Full Text Request
Related items