Font Size: a A A

The Design And Implementation Of Real-time Online Similarity Retrieval System

Posted on:2012-09-09Degree:MasterType:Thesis
Country:ChinaCandidate:J RenFull Text:PDF
GTID:2178330332476249Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Similarity retrieval has been widely used, especially in the field of intellectual property protection and information retrieval. The paper focuses on a similarity search in the plagiarism detection of the fields.For the student work plagiarism detection scenarios, similarity retrieval system must have features to support real-time data update and search results of the high accuracy. Therefore we need to extend similarity retrieval system, especially in the application of large amount of data scenarios, making it the ability to have a quick retrieval.Then we introduce rapid retrieval module in the retrieval system, which can be implemented by three strategies:inverted index, pruning and cluster multi-dimensional index. Commonly used inverted index, the index update strategy is fixed-time incremental update or rebuild, which can't meet the needs of real-time system, it was proposed and implemented a way to build real-time search engine.For the application of the code plagiarism detection, we analyzed the code in non-natural language text, and propose the fingerprint based on code structure feature for fast candidate retrieval in plagiarism detection. Then we use the classical RKR-GST algorithm to calculate the similarity between program codes. Meanwhile, we highlight similarity regions of code with different color.We also achieve similarity document retrieval services. Direct using of similarity document with search engine, the results are unsatisfactory, because the word size is too small, it discards the document structure information. Therefore we propose a method based on the spot fingerprint in spotsigs algorithm, to similarity retrieval. Finally, we propose a similarity measure combinated with vector space model and structural features of document.
Keywords/Search Tags:Similarity retrieval, plagiarism detection, real-time retrieval, RKR-GST, SpotSigs, structural features
PDF Full Text Request
Related items