Font Size: a A A

Key Technologies For We Media Text Plagiarism Detection Based On Deep Learning

Posted on:2022-09-25Degree:MasterType:Thesis
Country:ChinaCandidate:Y J TianFull Text:PDF
GTID:2518306740494684Subject:Cyberspace security
Abstract/Summary:PDF Full Text Request
The continuous promotion of Internet technology and the growing enthusiasm for participating in current affairs of users have greatly accelerated the development of the We Media industry.However,the shortage of review capacity in platform and lack of self-discipline of creators result in abnormal development of the We Media industry,where the current situation of plagiarism has become worse.Text similarity detection can compare two texts in fine granularity,and the detection results can provide quantitative indicators for determining the plagiarism of We Media works.However,there are still many problems in the application of traditional text similarity detection technology in the field of We Media.First of all,due to massive and various We Media texts,traditional text similarity detection methods are not sufficient to conduct rapid matching of similar texts.Meanwhile,the types of plagiarism are diverse,traditional methods are incapable of detecting deep semantic information which entailed in We Media texts,resulting in low accuracy of text detection.Considering the above problems,first of all,this dissertation proposes a similar We Media text matching method based on the enhanced Simhash(SWMTMM-S)to quickly recall the similar text set from massive texts.Then,this dissertation proposes a text-similarity detection method based on XLNet and Bi LSTM(TSDA-XBL).This method can analyze the degree of similarity between the texts from a fine-grained analysis.Finally,based on the Uniform Content Label(UCL),the UCL indexing mechanism and the similarity detection system for We Media texts is designed to verify the algorithm proposed in this dissertation.The main content of this dissertation is demonstrated as follows:(1)To improve the matching efficiency of similar texts in massive We Media texts,this dissertation proposes a similar We Media text matching method SWMTMM-S based on an enhanced Simhash algorithm.First,instead of using traditional Simhash method of word embedding,we use the Skip-gram model trained on a massive corpus to obtain word vector representation to enhance the semantic information representation for feature words.Then,we combine with the We Media text feature,TF-IDF weight,part of speech weight,and position weight to optimize weight selection,so as to distinguish effects of different word types on text representation,and enrich semantic information of the text.In this way,the text fingerprint is obtained based on the above procedure,and the text fingerprint index is constructed to quickly match the similar text set.(2)To effectively detect the deep semantic similarity between target text and plagiarized text,this dissertation proposes a text similarity detection method TSDA-XBL based on XLNet and Bi LSTM.First of all,this method uses the XLNet module to obtain word vector representation then uses the Bi LSTM module to learn the bidirectional dependence of words to obtain a text representation matrix in sentence granularity.Meanwhile,adversarial training is introduced in training word embedding procedure to enhance the robustness of the model.Next,the self-attention layer is used to extract the contribution of different sentences to the text representation and generate the deep semantic features of the text.Finally,the semantic similarity matrix is gained by using the deep text representation matrix of the target text and plagiarized text,and Convolutional Neural Network is utilized to extract features to implement final verifications.(3)Combined with the characteristics of We Media texts,this dissertation designs the UCL indexing mechanism and the similarity detection system for We Media texts.This dissertation tests and analyzes the SWMTMM-S method and TSDA-XBL method by experiments.The experimental results show that the SWMTMM-S method has a higher recall rate than the traditional Simhash algorithm,and could quickly retrieve similar text sets from massive texts.TSDA-XBL method has good robustness,could extract the semantic feature of the text more effective,and improve accuracy of text similarity detection.
Keywords/Search Tags:deep learning, text-similarity, We Media, Simhash algorithm, Uniform Content Label
PDF Full Text Request
Related items