Font Size: a A A

Research On Improved Copy Detection Methods For Chinese Documents Based On String Matching

Posted on:2013-06-14Degree:MasterType:Thesis
Country:ChinaCandidate:J SongFull Text:PDF
GTID:2248330395484845Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, people can easily get a lot of digitaldocuments. However, the Internet is a double-edged sword. On one hand, it provides a largenumber of information for researchers to understand the related technology which is favorabletoward promoting the growth of scientific research. On the other hand, it creates convenientconditions for plagiarists to take possession of others’ research results which contributes to theunhealthy academic atmosphere. Unlike document watermarking which embeds additionalinformation in a document, document copy detection technique detects plagiarism byextracting features from the document itself.According to the distilling manner of feature extraction, the existing methods can bedivided into three categories: methods based on string matching, methods based on featurevector and methods based on the semantic representation of chunks, in which the firstcategory is the most studied and the most widely used. In order to solve the problems of themethods based on string matching, we draw on synonyms substitution which is a matureapplication in Watermarking and Information Hiding, and propose two improved methodsbased on string matching.1) Document copy detection based on synonyms substitution and N-gram. In theproposed method, a new fingerprint extraction algorithm has been given, in which Chineseword segmentation, keywords selection and synonyms substitution are used to improve thesecurity of the copy detection method; an inverted index is used in fingerprint storage toimprove the detection speed. Experimental results show that compared with the improvedmethod based on N-gram, the detection results of our method on plagiarized documents isbetter.2) Document copy detection based on fingerprint extraction from multiple chunks. Thepresented method selects sentences and K-words as chunks to extract fingerprints on the basisof the analysis of the chunk selection strategy. The fingerprint extraction of sentencesinvolves keywords selection, hash processing, MD5, etc; the fingerprint extraction ofK-words uses synonyms substitution, hash sorting, fingerprint generation, etc; Overlap iscalculated by an improved method. Experimental comparison show that, our method can moreeffectively detect plagiarized documents with synonyms substitution and plagiarizeddocuments with simple sentences restatement.
Keywords/Search Tags:Copy Detection, Document Fingerprints, Synonyms Substitution, Inverted Index
PDF Full Text Request
Related items