Font Size: a A A

The Research Of Text Copy Detection Based On Top-Down Topic Tree

Posted on:2011-02-27Degree:MasterType:Thesis
Country:ChinaCandidate:S WangFull Text:PDF
GTID:2178330332460703Subject:Information management and e-government
Abstract/Summary:PDF Full Text Request
In recent years, the plagiarism phenomena are become more and more serious in the sphere of learning. So the plagiarism raised public concern. In order to protect intellectual property rights, to rectify academic atmosphere and to reduce the serious consequences of the plagiarism, to design and to implement the text copy detection system has become necessary.To select blocks of text is a difficulty in the string match algorithm and it ignored semantic and structure information in the word frequency algorithm. Therefore, the algorithm of text copy detection based on top-down topic tree is put forward. On the one hand, it illustrates a new method of text representation based on topic tree:title, author, unit, abstract, keywords and categories information are used to indicate the root node; a branch node is a topic bag:firstly, the semantic clustering method creates the theme; secondly, sentence relationship map extracts topic sentences; a leave node is a sentence. On the other hand, it makes a similarity calculatingly method:firstly, it calculates the similarity of root nodes, which is similar to compare the root information of the two texts. If the two root nodes are not similar, the calculation is stopped. Otherwise, the next layer similarity will be continuously counted:secondly, it computes the similarities of branch nodes, which counts topic bags similarities based on sentences similarities. If the similarities of branch nodes are smaller than the threshold, the calculation is stopped. Otherwise, the next layer similarity will be continuously computed; thirdly, it calculates the similarities of leaf nodes, which counts sentences similarities. If the similarities of leaf nodes are smaller than the threshold, we found the two texts don't copy. Otherwise, they exist plagiarize.We design and implementation the text copy detection system. The experiment collected 1000 texts, which come from five different areas. And then sectors of the texts are replicated in different levels. So we have 20 texts and collect five irrelevant texts. The 25 texts will be tested in the database. To test the validity of the new copy detection algorithm, we do three corresponding experiments with several copy detection algorithms. The results show that the proposed algorithm costs less time, and it has more dipartite degree and the accuracy rate is higher.
Keywords/Search Tags:Topic Tree, Copy Detection, Similarity, Text Representation
PDF Full Text Request
Related items