Chinese Text Copy Detection Based On N-Gram

Posted on:2015-08-18

Degree:Doctor

Type:Dissertation

Country:China

Candidate:W Zhang

Full Text:PDF

GTID:1368330488499507

Subject:Computer application technology

Abstract/Summary:

With the rise and spread of internet technologies,text copy detection technology rises to become a hot research topic in the field of natural language processing,and its importance in the protection of intellectual property rights has become increasingly evident.English text copy detection technology developed earlier,but due to the differences between English and Chinese languages,so a good many English copy detection techniques are not fully applicable to the Chinese texts.How to design effective detection algorithms according to the characteristics of Chinese language is increasingly becoming a concerned issue.Plagiarism using a lot of means including sentence addition or deletion,synonym substitution,and even statement restatement.Different methods have respective characteristics of plagiarism,so using a single method of detection is often unable to achieve the desired effect.Currently copy detection level is subject to the development of natural language processing technology,so it is really hard to go deep into semantics.Copy detection based on string matching and word frequency statistics are the two kinds of most commonly used methods.This paper focuses on the Chinese natural language,and proposes a statistical method based on arbitrary length n-gram frequencies.On this basis,for common plagiarism,respectively,in terms fragments,synonyms and replace text fingerprints.The main results are as follows:According to the features of Chinese word length,the two variable-length index is presented based on a bigram.The method uses a sliding window of length 2.The use of Chinese characters mapped to encoding the relative position of the index,in the case of ensuring the retrieval results,greatly reducing the index space.Meanwhile,the use of the storage characteristics of the address codes and the efficient collection algorithm,achieve the retrieval and frequency statistics of n-gram of any length.Meanwhile,when the text library is expanded,indexes do not need rebuilding.Use Ferret method by changing the length of the detection unit,and calculate the length of each copy detection accuracy under the recall rate and other indicators to determine the best fit Chinese fragment length.On this basis,a detection method is proposed based on distance from the center of the core vocabulary of Chinese text copies.The formula of overlap degree using the proposed method can further improve the effect of Chinese text copy detection based on fragment matching.Existing synonym copy detection methods are all based on single vocabulary synonym expansion,ignoring the words with customary collocation under the real natural language environment.In view of this situation,this paper presents a method based on the expansion of synonyms match.The method of filtering after expansion of word collocation greatly reduces the size of the expansion set,and depresses the detection noise additionally.On this basis,we propose an algorithm of overlap degree based on synonym collocation.Experiments show that the method obtained good results for synonyms detection.Use speech as a sequence of sentences sentence "template",using the phrase"template" plus low frequency fragment way hash calculation,generate text fingerprints.Fingerprint comparison is used to determine the plagiarism between sentences.The method takes sentences as detection units to avoid the context influence during the detection process.It can be used as a supplement to other detection methods.

Keywords/Search Tags:

Copy detection, two levels indexes, center distance, synonym expansion, text fingerprints, speech sequence

Related items

1	Research On Improved Copy Detection Methods For Chinese Documents Based On String Matching
2	A Feature Space Optimized Algorithm Based On Word Embeddings For Synonym Expansion
3	Text Copy Detection Research On Fingerprint Feature
4	Research Of Documents Copy Detection And Implementation Of System
5	Research Of Text Recommended Methods Based On Synonym Network
6	Research And Implementation Of Document Copy Detection
7	The Research On Document Copy Detection Technology Based On Chinese Character Component Histogram
8	Research Of Copy Detection Of Chinese Scientific Papers Base On Text Structure And Content
9	Research On Text Similarity Detection Algorithm Based On Simhash
10	Synonym Discovery Based On The Searching Information