Font Size: a A A

Research On Plagiarism Detection Modeling Based On Statistical Machine Learning

Posted on:2018-03-17Degree:DoctorType:Dissertation
Country:ChinaCandidate:L L KongFull Text:PDF
GTID:1368330548495874Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Plagiarism has become easier with the rapid development of the Internet,especially the increasing literature resources,as well as the applications of search engine and machine translation.The growing plagiarism has accelerated the research progress of plagiarism detection technologies.In recent years,plagiarism detection has attracted extensive attention of many researchers and organizations from the academic institutions and industrial fields.It has become one of the current hot issues.This paper takes the plagiarism detection modeling based on statistical machine learning as the research object to boost the performance of plagiarism detection.Corpus construction,modeling process of plagiarism detection,and model solutions are focused as the key issues.The technologies of natural language processing,information retrieval,and statistical machine learning are explored,represented by plagiarism detection.The primary contents of this paper are as follows.The plagiarism corpus construction approach based on natural annotation is presented to deal with the problem of automatic and manual construction approaches of plagiarism corpus.The idea of natural language processing based on huge-scale naturally annotated corpora is introduced.The real plagiarism documents are chosen as data source and a text alignment algorithm is exploited to align the plagiarism segments.The proposed method makes the construction and labeling all derive from the corpus itself,which paves the way for the auto-construction of high-quality plagiarism corpus with the least interaction and little prior knowledge.The results evaluated by PAN@CLEF 2015 verified the quality of the corpus.A ranking-based model to query generation in plagiarism source retrieval is proposed to address the issues of heuristic-based query generation methods.In this paper,the query generation for source retrieval is formulated under a ranking framework for achieving the optimal source retrieval performance on each suspicious document segment.To solve the essential problem of an absence of training data,the building of training samples for source retrieval is also proposed in this paper.We rigorously evaluate various aspects of the proposed method on the publicly available PAN@CLEF 2013 source retrieval corpus.Compared with the established baselines,the experimental results show that applying our proposed query generation model based on learning to rank yields statistically significant improvements over baselines in source retrieval effectiveness.Two source retrieval filtering models based on the aggregation of retrieved results,the source retrieval filtering model based on the aggregation degree and the source retrieval filtering model based on the aggregation loss,are proposed to deal with the global characteristics of retrieved results resulting from correlated queries.The problem of source retrieval filtering is formalized into a ranking framework and a ranking logistical regression model is presented to implement the framework.Experimental results on the PAN@CLEF 2013 have shown that the proposed models statistically outperform several baseline models.A text deep paraphrase identification model interacting semantics with syntax is investigated to improve the performance of paraphrase plagiarism detection.The paraphrase plagiarism detection is formalized as a problem of deep semantic matching between sentences.The semantic is represented using the distributed representations,the linguistic features are considered in the proposed model.The interaction of syntax and semantic features are captured by the tensor,and the patterns of text matching are extracted by the convolutional neural network.Experiments were carried on the corpus of Microsoft Research Paraphrase(MSRP),PAN@CLEF 2010 and PAN@CLEF 2012 corpus of paraphrase plagiarism detection.The experimental results show that the proposed method outperforms several classical text matching models and the deep text matching models.A plagiarism text alignment model based on sequence labeling is proposed for considering the sequential relationships of text segments and the similarity of the observations variables.The proposed model overcome the obstacles brought out by the heuristic-based methods which are difficult to define the rules of text alignment and consider the context information.The proposed model is evaluated on PAN@CLEF 2012 high obfuscation corpus and simulated paraphrase obfuscation corpus and compared to the golden baselines.Experimental results demonstrate that the proposed models improve the performance of text alignment significantly than the baseline methods.
Keywords/Search Tags:Plagiarism Detection, Statistical Machine Learning, Source Retrieval, Text Alignment, Corpus Construction, Paraphrase Matching
PDF Full Text Request
Related items