Font Size: a A A

Chinese Text Copy Detection Based On N-Gram

Posted on:2015-08-18Degree:DoctorType:Dissertation
Country:ChinaCandidate:W ZhangFull Text:PDF
GTID:1368330488499507Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rise and spread of internet technologies,text copy detection technology rises to become a hot research topic in the field of natural language processing,and its importance in the protection of intellectual property rights has become increasingly evident.English text copy detection technology developed earlier,but due to the differences between English and Chinese languages,so a good many English copy detection techniques are not fully applicable to the Chinese texts.How to design effective detection algorithms according to the characteristics of Chinese language is increasingly becoming a concerned issue.Plagiarism using a lot of means including sentence addition or deletion,synonym substitution,and even statement restatement.Different methods have respective characteristics of plagiarism,so using a single method of detection is often unable to achieve the desired effect.Currently copy detection level is subject to the development of natural language processing technology,so it is really hard to go deep into semantics.Copy detection based on string matching and word frequency statistics are the two kinds of most commonly used methods.This paper focuses on the Chinese natural language,and proposes a statistical method based on arbitrary length n-gram frequencies.On this basis,for common plagiarism,respectively,in terms fragments,synonyms and replace text fingerprints.The main results are as follows:According to the features of Chinese word length,the two variable-length index is presented based on a bigram.The method uses a sliding window of length 2.The use of Chinese characters mapped to encoding the relative position of the index,in the case of ensuring the retrieval results,greatly reducing the index space.Meanwhile,the use of the storage characteristics of the address codes and the efficient collection algorithm,achieve the retrieval and frequency statistics of n-gram of any length.Meanwhile,when the text library is expanded,indexes do not need rebuilding.Use Ferret method by changing the length of the detection unit,and calculate the length of each copy detection accuracy under the recall rate and other indicators to determine the best fit Chinese fragment length.On this basis,a detection method is proposed based on distance from the center of the core vocabulary of Chinese text copies.The formula of overlap degree using the proposed method can further improve the effect of Chinese text copy detection based on fragment matching.Existing synonym copy detection methods are all based on single vocabulary synonym expansion,ignoring the words with customary collocation under the real natural language environment.In view of this situation,this paper presents a method based on the expansion of synonyms match.The method of filtering after expansion of word collocation greatly reduces the size of the expansion set,and depresses the detection noise additionally.On this basis,we propose an algorithm of overlap degree based on synonym collocation.Experiments show that the method obtained good results for synonyms detection.Use speech as a sequence of sentences sentence "template",using the phrase"template" plus low frequency fragment way hash calculation,generate text fingerprints.Fingerprint comparison is used to determine the plagiarism between sentences.The method takes sentences as detection units to avoid the context influence during the detection process.It can be used as a supplement to other detection methods.
Keywords/Search Tags:Copy detection, two levels indexes, center distance, synonym expansion, text fingerprints, speech sequence
PDF Full Text Request
Related items