Font Size: a A A

Key Technologies For Infringement Detection Of News Text

Posted on:2020-10-11Degree:MasterType:Thesis
Country:ChinaCandidate:L SunFull Text:PDF
GTID:2428330623459902Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The publicity of news information leads to infringements of news content,such as plagiarism,tampering and illegal proliferation,which in turn causes economic losses to news content originators.Text similarity detection is one of the key technologies used to solve the problem of infringement of news content.However,traditional text similarity detection algorithms are inefficient in processing of massive data.Besides,the traditional methods usually use character or word frequency to extract features for achieving text similarity comparison,which leads to low accuracy caused by lack of semantic information.Based on the above analysis,this thesis studies the detection method of news text infringement.In order to achieve efficient and accurate detection,a similar news text screening method based on semantic fingerprinting(SF-SNTSM)and a text similarity detection algorithm based on BERT model and interactive inference network(BERT-IIN-TSDA)are proposed.On this basis,a prototype system of news copyright protection is designed to help users detect infringement efficiently and accurately.The main work of this thesis are as follows:(1)Aiming at the detection efficiency of massive news data,this thesis proposes a similar news text screening method SF-SNTSM based on semantic fingerprint and proposes a text fingerprint generation algorithm WS-TFGA based on Word2 vec and Simhash.The SF-SNTSM method firstly generates the semantic digital fingerprint of news text by WS-TFGA algorithm,then searches the similar text set in the copyright database according to the digital fingerprint,and finally uses the assistant filtering mechanism to judge whether the text set needs further in-depth infringement detection.Compared with the traditional locally sensitive hash detection method,SF-SNTSM can effectively improve the accuracy and recall rate while maintaining the detection rate.(2)Considering the low accuracy of traditional detection methods due to the lack of semantic information,this thesis proposes a text similarity detection algorithm BERT-IIN-TSDA based on BERT model and interactive inference network.BERT-IIN-TSDA is a follow-up detection step of SF-SNTSM,which mainly includes generating a text representation matrix module and text infringement determination module.Firstly,a representation matrix of the news text to be detected and the source news text is generated by the pre-trained BERT language model.Secondly,the self-attention coding layer is used to extract the correlation information within the text.Then,the information interaction layer is used to match the sentence-level relationship between the text to be detected and the source news text,and the text interaction matrix is obtained.The deep semantic information of the text interaction matrix is extracted by the deep network Dense Net.Finally,the classification discriminating module is used to realize the judgment of news text infringement.Experiments show that the algorithm can further improve the accuracy of text similarity detection.(3)Based on the above methods,this thesis develops a prototype system of news copyright protection,and uses uniform content label(UCL)to manage news digital copyright.The data set of news plagiarism is constructed by crawling real news data.Based on the data set,this thesis verifies the performance of SF-SNTSM and BERT-IIN-TSDA.The experimental results show that SF-SNTSM has better Hamming distance quantization ability and higher accuracy and recall rate than the traditional local sensitive hashing method,and BERT-IIN-TSDA algorithm has higher accuracy than the traditional similarity detection method.
Keywords/Search Tags:Text infringement detection, local-sensitive hashing, semantic fingerprinting, BERT, interactive inference network
PDF Full Text Request
Related items