Font Size: a A A

Research On The Copy Detection System For Documents Based On String Matching Method

Posted on:2007-05-28Degree:MasterType:Thesis
Country:ChinaCandidate:X LiFull Text:PDF
GTID:2178360182483122Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The technology of copy detection for documents is an important topic inthe data security field, it is a powerful measure to protect intellectual propertyand improve efficiency of information retrieval. Copy detection for documentsis to judge whether the given document plagiarize contents of other documentsin the database, which plagiarism occurs in some way, such as by duplicatingpartial or total document contents, by using different words or sentences toexpress the same meaning of the texts of pervious documents in the database.Firstly, this paper introduces the basic theories of the technology andanalyses the functions and characteristics of current copy detection systems fordocuments. The key technologies of the copy detection systems for documentsare given.Secondly, this paper uses the thought of Karp-Rabin string matchingalgorithm and presents the copy detection system for documents based onstring matching method to slove the deficiencies of current copy detectionsystems for documents. The architecture of the system and basic theories ofevery module are given.Again, this paper describes the properties and many technologies of thecopy detection system for documents based on string matching method.Overlapping chunks are selected to divide documents. The "rolling" hashfunction is adopted to compute hash values of the chunks. Sample algorithm isdesigned to extract features of text in the sequence of hash values. Expecteddensity of extracted features, validity and complexity of the algorithm areproved. Measuring similarity method is presented, and it can find overlapsamong documents. The digital search tree is used to storage the features of thedocuments in the database. The system uses the double linked tree to representthe architecture of the digital search tree.Finally, based on these researches, a prototype of the copy detectionsystem for documents based on string matching method is designed andimplemented by object-oriented method. Accuracy of the results of copydetection for documents is evaluated in the end.
Keywords/Search Tags:Copy Detection, Feature Extraction, Chunk, Similarity, Plagiarism
PDF Full Text Request
Related items