The Research Of Text Copy Detection Based On Top-Down Topic Tree

Posted on:2011-02-27

Degree:Master

Type:Thesis

Country:China

Candidate:S Wang

Full Text:PDF

GTID:2178330332460703

Subject:Information management and e-government

Abstract/Summary:

PDF Full Text Request

In recent years, the plagiarism phenomena are become more and more serious in the sphere of learning. So the plagiarism raised public concern. In order to protect intellectual property rights, to rectify academic atmosphere and to reduce the serious consequences of the plagiarism, to design and to implement the text copy detection system has become necessary.To select blocks of text is a difficulty in the string match algorithm and it ignored semantic and structure information in the word frequency algorithm. Therefore, the algorithm of text copy detection based on top-down topic tree is put forward. On the one hand, it illustrates a new method of text representation based on topic tree:title, author, unit, abstract, keywords and categories information are used to indicate the root node; a branch node is a topic bag:firstly, the semantic clustering method creates the theme; secondly, sentence relationship map extracts topic sentences; a leave node is a sentence. On the other hand, it makes a similarity calculatingly method:firstly, it calculates the similarity of root nodes, which is similar to compare the root information of the two texts. If the two root nodes are not similar, the calculation is stopped. Otherwise, the next layer similarity will be continuously counted:secondly, it computes the similarities of branch nodes, which counts topic bags similarities based on sentences similarities. If the similarities of branch nodes are smaller than the threshold, the calculation is stopped. Otherwise, the next layer similarity will be continuously computed; thirdly, it calculates the similarities of leaf nodes, which counts sentences similarities. If the similarities of leaf nodes are smaller than the threshold, we found the two texts don't copy. Otherwise, they exist plagiarize.We design and implementation the text copy detection system. The experiment collected 1000 texts, which come from five different areas. And then sectors of the texts are replicated in different levels. So we have 20 texts and collect five irrelevant texts. The 25 texts will be tested in the database. To test the validity of the new copy detection algorithm, we do three corresponding experiments with several copy detection algorithms. The results show that the proposed algorithm costs less time, and it has more dipartite degree and the accuracy rate is higher.

Keywords/Search Tags:

Topic Tree, Copy Detection, Similarity, Text Representation

PDF Full Text Request

Related items

1	Research Of Copy Detection Of Chinese Scientific Papers Base On Text Structure And Content
2	Research And Implement Of The Computer-Aided Copy Detection System For Document
3	Research On Microblog Text Processing And Topic Analysis Methods
4	Improved Text Topic Representation And Learning Method
5	Study On Chinese Text Replication Detection Based On Sentence Similarity
6	Research On Hot Topic Detection Methods For Microblog
7	Research On Hot Topic Discovery Of Sina Microblog
8	Mongolian Short Text Semantic Similarity Calculation Based On Deep VAE Integrated With Topic Information
9	Research On Topic Clustering Algorithm Based On Topic Models
10	Research And Implementation Of Document Copy Detection